Streaming Analytics: Using Web IDE for Machine Learning
Predictive analysis can potentially be pretty complex and intimidating to set up. While the applications for it are practically endless, the learning curve can be a challenge. However, whether you’re a data scientist, or simply a developer that wants to start working with machine learning, SAP HANA streaming analytics provides a simple interface to get you started.
In streaming analytics (and of course, SAP HANA itself), you don’t need to be an expert on data mining to have data work for you. On top of that, you can do practically everything through your browser – which is where we get to the HANA Web IDE.
I’ve talked about the machine learning capabilities of streaming analytics in a previous blog, but never really delved into the whole process. In summary, streaming analytics can apply predictive analysis functions to streaming data in real time, using customized machine learning models. This means you can keep getting better and better data without manually entering massive amounts of information multiple times (though if you want to do that, streaming analytics can handle that too).
Before getting into specifics for Web IDE, here’s a quick reminder of the types of models we have available in streaming analytics:
- Hoeffding Tree Training for Classification, which continuously works to discover predictive relationships, even as the streaming data changes.
- Hoeffding Tree Scoring for Classification, which applies the trained predictive model to the data.
- Decision Tree Scoring, which applies a predictive model to trained models imported from SAP HANA tables.
- DenStream Clustering, which groups and prunes data object points, based on their weighted significance.
You can read more about each model type in the streaming analytics documentation.
In contrast to some earlier content, this blog is a more comprehensive, start-to-finish guide on how to work with them in Web IDE and the streaming analytics runtime tool.
Note: If you’ve worked with the streaming runtime tool before and have existing workspaces and a HANA data service, skip ahead to Part 3.
Part 1: Enabling the streaming analytics runtime tool (or SRTT for short)
In Web IDE, there are two main ways to prep a machine learning model for use in a streaming analytics project:
- Create a model directly in the plugin.
- Import a workspace and all of its models into SRTT, then edit the models there.
Either way, you’ll need to have both streaming plugins enabled in the Web IDE.
Part 2: Choosing a workspace and connecting to HANA
Next, you’ll need a workspace to put models into.
- Open the streaming analytics runtime tool.
- Add a workspace. You have two options here:
- Build a streaming project from Web IDE, and let it automatically create a workspace for you.
- Register an existing workspace: in SRTT, right click on Streaming Workspaces, select Register Custom Workspace, and fill out your details.
Then, you need to create a data service connection to HANA. Why is this needed right away? The models you create are going to store metadata in HANA, which includes properties, version information, a snapshot of the latest content, and more.
If you don’t have a HANA data service:
- In the relevant workspace, right-click on Data Services and select Add Hana Service.
- Enter your connection details and save them. The data service is available to use immediately after you save it.
Part 3: Creating or editing the model
Now, you can load and edit any existing models, or create a completely new one.
- Drill down through the workspace folder into the models (Workspace > Project > PAL Models).
If you’re unfamiliar with the acronym, PAL stands for “predictive analysis library”.
- Right-click a data service folder to create a new model, or
- Open a data service folder, then double-click an existing model to edit it.
Either way, the model opens up to the right of the workspace pane.
- Set up your model properties and parameters.
Note: Depending on the model that you pick, there may be a group of mathematical settings at the bottom of the list, beginning after Sync Point. These are parameters for the algorithm itself, and unlike most of the general properties for the model, they won’t show up in CCL.
- Enter any name and description for your model. You’ll need to reference the model by name in CCL.
- Choose a function to base the model on.
If you pick a scoring function, make sure that you have an existing trained model to reference (for Decision Tree Scoring, you’ll need to import it from HANA first).
- Enter an input schema. This has to be in a specific format, depending on the function:
- Training: [IDS]+S
- Scoring: [IS][IDS]+ (this [IDS]+ has to be the same as the referenced training model)
- Clustering: [IS][IDS]+
Legend: I=Integer; S=Double; S=String; + = any number of any combination of columns in the preceding set of brackets.
For example: [IDS]+S could be integer, string, double, double, string, string.
- Enter an output schema.
- Training: D
- Scoring: [IS]SD
- Clustering: [IS]ISDISS
For example, [IS]SD could be either integer, string, double, or string, string, double.
- Set the memory quota to the amount of maximum memory you want to allocate to the model, in megabytes.
- Set a sync point in a number of rows (N) or seconds (S). This is the period after which streaming analytics will sync with the HANA database.
- If you’re using a scoring function, enter the referenced training model name.
- If you’re using a training or clustering function, set the parameters for the algorithm. I won’t get into them in this blog, but you can read up on what each parameter does in the documentation. For now, you can accept the defaults.
- Save the model.
Part 4: Using the model in a streaming analytics project
Once you have a model set up, you can add it to any streaming project using a DECLARE MODEL element.
- Switch over to the Web IDE and open the CCL file for the project you’re going to use (MTA > streaming analytics module > model).
- Add the following code:
DECLARE MODEL <model-name> TYPE MACHINE_LEARNING INPUT SCHEMA (<input-schema>) OUTPUT SCHEMA (<output-schema> ) PROPERTIES dataservice = '<HANA-service-name>' ;
- Attach an input stream and an output stream to the model.
And you’re done! Now you can run some data through the model to test it. Simply build the streaming project to the workspace where your model is saved, then switch back over to the SRTT to see the model’s calculated results.