Hoeffding Tree Overview – Creating a Training Model (Part 2)
Welcome to part 2 of the Hoeffding Tree machine learning series. This series teaches you how to build a streaming project in SAP HANA studio that can execute a training model. The second video is now available here. For a refresher on how to create a training model, check out the first video and blog.
As promised, we’ve posted this second blog in tandem with the video release to give you a quick overview while offering a sneak peek of what’s to come. Even though we’re working in studio, keep in mind that this series also applies to Web IDE users. If you want to create machine learning models in Web IDE, then check out this blog post.
Here’s an overview of part 2 and a look at what’s to come.
Part 2 is the conclusion of the training phase, which involves creating and building a project that uses the Hoeffding Tree Training algorithm to train your model. For this tutorial, you’ll build the project using a code snippet. After compiling and running the project, you’ll upload data from a CSV file and view the output. You can download the code snippets and source data from https://github.com/saphanaacademy/SDS. Bookmark this page, as you’ll need the resources there for later videos in this series. The code in GitHub is not official SAP code; it is sample code that you’ll use for this tutorial series.
For this video, you’ll get your data used for training from the CLAIMS_TRAIN.csv file. This file contains gathered source data on insurance policy types, the age of the claimant, amount of the claim, department of the claimant, and whether or not the claim is fraudulent. This data will be collected by the input stream, ‘in1’.
Here’s a look at the data:
Before this data can be collected, you need to create and build a streaming project.
Creating and building a project
In studio or Web IDE, create a new project. Then, copy the CCL from the sha_hoeffding_train.ccl file and replace the default CCL in your project with the CCL you just copied. Thanks to the utility of streaming analytics, streaming can obtain your data from anywhere – from a CSV file, an XML file, an IoT device, and so on.
The CCL structure for executing a machine learning model is simple: declare the model, set the schemas, set the data service connection, and create the input and output streams. One input stream needs to gather the data, and one output stream needs to execute the model. In our example, we have another output stream for collecting all results once the modeling is finished.
Here’s what the CCL looks like:
CREATE SCHEMA schema_in (POLICY string, AGE integer, AMOUNT integer, OCCUPATION string, FRAUD string); CREATE SCHEMA schema_out (ACCURACY double); DECLARE MODEL sha_hoeffding_train TYPE MACHINE_LEARNING INPUT SCHEMA schema_in OUTPUT SCHEMA schema_out PROPERTIES dataservice = 'hanadb' ; CREATE INPUT STREAM in1 SCHEMA schema_in ; CREATE OUTPUT STREAM model_stream AS EXECUTE MODEL sha_hoeffding_train FROM in1 ; CREATE OUTPUT STREAM out1 AS SELECT * FROM model_stream ;
These input and output schemas mirror the schemas you used in the first video, where you created your training model. The ‘DECLARE MODEL’ statement sets the name of the model, the input and output schemas, and the data service. The input stream, ‘in1’, collects the training data, and the output stream, ‘model_stream’, collects this data from ‘in1’ and executes the training model. Another output stream, ‘out1’, collects all results. You can write ‘out1’ to a HANA table, or stream it to anywhere you please.
Before moving on, make sure the name of the training model you’re declaring and executing is the same as the name of the model you created in Part 1. The project won’t run properly if the name doesn’t match. In our case, the name is ‘hoeffdingtrain’:
DECLARE MODEL hoeffdingtrain TYPE MACHINE_LEARNING INPUT SCHEMA schema_in OUTPUT SCHEMA schema_out PROPERTIES dataservice = 'hanadb' ; CREATE INPUT STREAM in1 SCHEMA schema_in ; CREATE OUTPUT STREAM model_stream AS EXECUTE MODEL hoeffdingtrain FROM in1 ;
The project’s built, so now you can compile and run it. Once up and running, it’s ready to continuously receive data and build on your model. Open the input and output streams so you can see the data when it uploads.
Uploading data and viewing the output
Via the File Upload view, upload the CLAIMS_TRAIN.csv file:
The input tab shows the 16 rows of training data:
The accuracy of the model (in terms of predictions) increases with more data. You can view the accuracy in the output tab:
Once your model has a sufficiently accurate output, 80% for example, you can begin using it to make predictions about future events.
Stay tuned for part 3 of the video series (and its associated blog post), where you’ll create a scoring model that references your training model. To close the series, you’ll create and build a project where you can execute the scoring model to make calculations about future insurance claims, predicting whether they’ll be fraudulent or not.
For more on machine learning models, check out the Model Management section of the Streaming Analytics Developer Guide. If you’re interested in creating machine learning models in Web IDE, then check out this blog post.