In the last decade companies like Google, Facebook and Netflix have led the way in collecting and monetizing huge amounts of data generated by consumer’s everyday activity. They look on this data as a strategic asset – every decision in their organizations is data driven, every product they sell is data driven. This has created enormous interest amongst traditional enterprises who can easily see the benefit of putting their data to work in the same way.
The techniques they have used to achieve this are based on Hadoop and its surrounding ecosystem of technologies which allow the collection, storage and processing of gigantic amounts of data at a low economic cost. Initially Hadoop was primarily used for batch workloads but in the last couple of years Apache Spark has emerged as a technology which enables both batch and real time capabilities to work in parallel in the same infrastructure. This capability is known as a Lambda Architecture. It is particularly suited to predictive analytics where typically patterns are identified within a historical dataset on a regular basis (Model Training) but then new incoming records are checked in real time to see if they correspond to these patterns (Scoring). For example a bank will identify what constitutes a profitable customer during Model Training. They will then look at new customers in real time to see if they have high potential profitability. It is incredibly important for scoring to happen quickly – if a profitable new customer is not given the correct treatment they may walk away. This is why having a real time capability is very important.
The following document shows how SAP Predictive Analytics can be used to generate models on existing facts in spark and then deploy them to Spark Streaming where they can score incoming data in real time.
Spark Streaming provides a real time streaming capability to Apache Spark while Spark SQL provides one of the batch mechanisms. Spark streaming enables Spark to process 100,000-500,000 records per node per second, and to reach sub-second latency.
A key feature of Spark is fault tolerance, which refers to the ability of a system to continue to operate properly in the case of a failure. Similar to Spark, raw data in Spark Streaming is distributed and replicated in memory across a cluster that can be reproduced if data is lost. Likewise , the data stream in Spark Streaming can be recomputed if a node that comprises the cluster fails.
Spark Streaming provides APIs in Scala, Java, Python and R. The APIs can read data from or write results into multiple resources including Flume, HDFS, Kafka and raw TCP stream. What’s more, users can create Resilient Distributed Datasets (RDDs) -- the basic abstraction in Spark -- by normal Spark programming. A combination of the RDDs with data from the multiple resources works as the input for Spark Streaming.
In case of Bank's potential customer scenario - Bank can use SAP PA to train the model on existing customer data that may reside in hadoop. Thanks to the big data technologies embedded into SAP predictive analytics starting version PA 2.2, models can be trained in seconds or minutes to deal with tens of gigabytes of data. Now to keep ahead of market competition, lets see how bank can make use of Spark streaming along with SAP PA to perform real time scoring- in other words predict 'new potential customers' in real time.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
12 | |
10 | |
9 | |
7 | |
7 | |
7 | |
6 | |
6 | |
5 | |
4 |