With more and more data becoming dynamic (whether it be data from sensors, social media, financial transactions, or other constantly changing sources), finding a machine learning model that can change and adapt to streamed data can be a big challenge.
Regardless of whether the goal is fraud detection, predicting purchasing patterns, or social media sentiment analysis, without being able to accommodate streamed data, if you want to find a machine learning model, you may face obstacles. In most cases, you may be stuck training and retraining your data infrequently, and maintaining a large pre-existing data set. As patterns change you might have to retrain the model with additional historical, labelled data to make sure the algorithm was accurate. And if you want to detect hidden patterns in the data, you’ll have to store enough data to make sure the predictions are accurate. Each time there’s something new, you’ll need to re-analyze the entire dataset.
The Advantage of Streaming + Machine Learning
Incremental machine learning algorithms learn and update a model on the fly, so predictions are based on a dynamic model. Supervised learning in streaming continuously learns as new data arrives and is labelled, so you can do accurate scoring in real-time, and have it adapt to changing situations. You don’t have to wait until more data is collected.
Unsupervised learning in streaming is able to detect novel patterns in streaming data in real-time without any re-analysis of previously examined data. This means you only need to persist a comparably small amount of data, and that the analysis adapts to the changing streamed data, and changing data patterns.
By combining smart data streaming with integrated machine learning algorithms, you can leverage both supervised and unsupervised learning to train models, score and cluster data all in real-time with modest memory and storage requirements. As of SPS 11, Smart Data Streaming has two classification functions that work in tandem for supervised learning: Hoeffding Tree Training and Hoeffding Tree Scoring, and one clustering function for unsupervised learning: DenStream Clustering. These are native CCL functions that can be used directly within streaming projects.
Training and Scoring Data with the Hoeffding Tree Algorithm
The adaptive Hoeffding Tree is an incremental decision tree algorithm that only needs a limited number of samples to choose the best tree node splitting attribute. So, you can continually train the model on streamed data, and you don’t need to re-train before starting to score data.
Denstream Clustering Algorithm
DenStream Clustering is an incremental clustering algorithm that uses micro-clusters to summarize clusters of arbitrary shapes and an elaborate pruning technique to detect outliers. Again, as your data streams in, the model adapts and changes – outliers may end up becoming new clusters.
Using Machine Learning with Streaming
We’ve specifically designed these algorithms to operate against live data streams in real-time. To use machine learning functions in a streaming project, you need to define a predictive analysis model, and configure machine learning function parameters. You can add, edit or delete models within SAP HANA studio, using the data services view in the SAP HANA Streaming Development perspective.
It’s simple to define the parameters of a machine learning model within the Streaming Development perspective.
Once they are configured, you can use saved models in streaming projects to run analytic algorithms on sets of incoming data.You can do this in the CCL editor, or the visual editor.
The model (bottom element), interacts with other elements in the streaming project.
Same Steps Each Time
To use any algorithm in a streaming project:
- Make sure you’re connected to a streaming server
- Create or load a HANA data service. If you’re going to create multiple models, keep in mind that the permissions enabled here apply to all models using this data service.
- In the model folder of your data service, add or update a model. (You can find the specific parameters for each type of model in the Machine Learning section of the Streaming Developer Guide.
- Once you’ve got a model, add or open a streaming project.
- Next, add a model, an input stream and a model stream to the project, using the same schema.
When you start running the project, you can see the data in the stream view. You can also go to the HANA Admin Console, and see where the model information is automatically stored in tables.
Videos and Further Reading
We’ve developed a series of videos related to using Machine Learning with streaming, including an overview and demos specifically related to setting up each type of model, and using that model within a streaming project. You can find these videos (along with others related to streaming) on the Smart Data Streaming playlist of the HANA Academy YouTube channel.
Hoeffding Tree videos
Using the algorithm for training data
Using the algorithm for scoring data
DenStream Clustering videos
And for details about each of the algorithms and their parameters, as well as a step by step work flow, and CCL examples, see the related documentation on the SAP Help Portal: Machine Learning with Streaming.