Predictive Scoring in Spark using Spark Streaming
In the last decade companies like Google, Facebook and Netflix have led the way in collecting and monetizing huge amounts of data generated by consumer’s everyday activity. They look on this data as a strategic asset – every decision in their organizations is data driven, every product they sell is data driven. This has created enormous interest amongst traditional enterprises who can easily see the benefit of putting their data to work in the same way.
The techniques they have used to achieve this are based on Hadoop and its surrounding ecosystem of technologies which allow the collection, storage and processing of gigantic amounts of data at a low economic cost. Initially Hadoop was primarily used for batch workloads but in the last couple of years Apache Spark has emerged as a technology which enables both batch and real time capabilities to work in parallel in the same infrastructure. This capability is known as a Lambda Architecture. It is particularly suited to predictive analytics where typically patterns are identified within a historical dataset on a regular basis (Model Training) but then new incoming records are checked in real time to see if they correspond to these patterns (Scoring). For example a bank will identify what constitutes a profitable customer during Model Training. They will then look at new customers in real time to see if they have high potential profitability. It is incredibly important for scoring to happen quickly – if a profitable new customer is not given the correct treatment they may walk away. This is why having a real time capability is very important.
The following document shows how SAP Predictive Analytics can be used to generate models on existing facts in spark and then deploy them to Spark Streaming where they can score incoming data in real time.
Spark and Spark Streaming
Spark Streaming provides a real time streaming capability to Apache Spark while Spark SQL provides one of the batch mechanisms. Spark streaming enables Spark to process 100,000-500,000 records per node per second, and to reach sub-second latency.
A key feature of Spark is fault tolerance, which refers to the ability of a system to continue to operate properly in the case of a failure. Similar to Spark, raw data in Spark Streaming is distributed and replicated in memory across a cluster that can be reproduced if data is lost. Likewise , the data stream in Spark Streaming can be recomputed if a node that comprises the cluster fails.
Spark Streaming provides APIs in Scala, Java, Python and R. The APIs can read data from or write results into multiple resources including Flume, HDFS, Kafka and raw TCP stream. What’s more, users can create Resilient Distributed Datasets (RDDs) — the basic abstraction in Spark — by normal Spark programming. A combination of the RDDs with data from the multiple resources works as the input for Spark Streaming.
Bank’s potential customer workflow
In case of Bank’s potential customer scenario – Bank can use SAP PA to train the model on existing customer data that may reside in hadoop. Thanks to the big data technologies embedded into SAP predictive analytics starting version PA 2.2, models can be trained in seconds or minutes to deal with tens of gigabytes of data. Now to keep ahead of market competition, lets see how bank can make use of Spark streaming along with SAP PA to perform real time scoring- in other words predict ‘new potential customers’ in real time.
- Step 1 and 2 depicted above are carried in SAP Predictive Analytics. The Automated mode of SAP Predictive Analytics (PA) automates the whole life cycle of data science, from data preparation to analysis result validation. The analytics models that Automated Analytics generate, can be exported in different languages to reproduce the operations made by Data Encoding, Clustering and Classification/Regression. The generated code can then be used outside SAP to apply models on new data. [For more details on applying model see – how to integrate code generated by scorer ]. In this case the model is exported in java class file.
- Embed the model in Spark by invoking the KxJModelFactory which is utility method of Java API offered by SAP PA, and then deploy it on Spark using spark-submit.
- Model score generator is a java project which imports KxJRT library from SAP PA to apply model. It receives target data and requests from users via a given port in spark. (We set port 10999 to receive scoring requests by TCP stream in our POC but the port number can be configured flexibly). Internally, Spark Streaming then divides the input data streams into batches, which are then processed by the Spark engine to generate the final stream of results in batches. [More details: spark streaming programming ]
- Apply PA model which will generate scores and output results from spark streaming. (In our POC we used text files for output scores).