Apache Spark is the most popular Apache open-source project till date and it has become catalyst for adoption of big data infrastructure. Spark uses in-memory technology and offers high performance for complex computation processes such as Machine Learning, Streaming Analytics and Graph engine.
Providing support for Hadoop and Spark in SAP Predictive Analytics is crucial to serve our customer needs because:
- Data is getting bigger and wider
- Performance and speed expectations are rising
- Customers are looking for optimized processing with proper utilization of their Hadoop resources
- Customers want to leverage their existing workforce to perform predictive analysis of their big data assets as SAP Predictive Analytics has a business friendly tool which does not demand data science and big data developer skills
Native Spark Modeling
Native Spark Modeling executes the automated predictive models directly on Hadoop using Spark engine.
Before Native Spark Modeling, your predictive modeling engine was essentially the SAP Predictive Analytics desktop or SAP Predictive Analytics server. Now with Native Spark Modeling – data intensive tasks are delegated to Spark and thus the data transfer is avoided between the SAP Predictive Analytics client and the data source platform (in this case the Hadoop Platform).
The left side of diagram shows the existing process without Spark in which case huge data transfer took place and it was costly on performance for big data and then the one on the right shows same process using Native Spark Modeling to execute machine learning computation close to data:
- On-install of SAP Predictive Analytics tool, users will find “SparkConnector” folder which contains developed functionality in form of ‘jar’. A user, typically an administrator will need to define Spark and Yarn connection properties in the configuration files for each ODBC DSN that they intend to use for Native Spark Modeling capability. Refer to Native Spark Modeling configuration section in SAP Predictive Analytics documentation.
- They load up SAP Predictive Analytics
- They open up Modeler
- They have to make sure ‘Native Spark Modeling’ flag in Model Training Delegation option is switched ON under “File->Preferences” menu
- They choose Classification/Regression (Starting PA 2.5, only Classifications models are supported on Spark, Regression will follow soon).
- They can choose an existing Hive Table using ‘Use Database Table’ option or an Analytical Dataset which is based on Hive tables using the ‘Use Data Manager’ option. SAP Automated Analytics proprietary algorithms are made to scale across any amount of information – both long and wide. The wider the datasets, stronger the predictive power! If you are wondering the formation of wider datasets then consider an e-commerce example, where weblogs are analyzed to understand trends behind purchases. As you build aggregates in Data Manager, even more columns get added for analysis.
- They load the description of the dataset from Hive or from a file or choose “Analyze”
- They choose a target field from the loaded dataset to run the training against e.g. Credit_card_Exist(=Yes/No)
- They generate the model which would be now executed on spark engine. Notice the progress bar now which shows progress messages for spark
- They notice ongoing Spark jobs from the application WebUI which can be started form the browser using http://localhost:4040/
- Once the model is generated they have the same choices as if it was a traditional database. e.g. Smart Variable Contribution report in the Automated
- Finally, they can manage model lifecycle using SAP Predictive Analytics Model Manager component. For example if user wants to retrain the model at frequent intervals; they can schedule the task from Model Manager for their data in Hadoop using the model that was trained on Spark. The retraining in this case is processed on Spark as well.
The data will continue to grow and the enterprises will continue to shift more and more of this data on Hadoop platforms; they can now begin to apply predictive solutions on top to get meaningful insights.
Companies have different options for predictive analysis such as SAP Predictive Analytics or open source machine learning libraries, SAP Predictive Analytics however makes a difference with its full stack support on Hadoop starting from data manipulation on Big Data to model training on Spark and finally to in-database apply/re-train for production-ready Big Data.
In conclusion, Native Spark Modeling is a key foundation in SAP’s Predictive Big Data architecture which enables performance gains of 7-10 times and more for big data. It is also prepared to scale as your data and infrastructure widens in future. With advantages of performance and scalability, Business Analysts can build more predictive models & fail early without having to worry about Big data technology.