[SAP HANA Academy] Live3: Preform Clustering
[Update: April 5th, 2016 – The Live3 on HCP tutorial series was created using the SAP HANA Cloud Platform free developer trial landscape in January 2015. The HCP landscape has significantly evolved over the past year. Therefore one may encounter many issues while following along with the series using the most recent version of the free developer trail edition of HCP.]
The SAP HANA Academy’s Philip Mugglestone continues along in Live3 course by reviewing and running the SQL Script necessary to setup and execute K-means clustering analysis using the SAP HANA predictive analysis library. Philip also shows how to create views that will be accessed via web services down the road. Please watch Philip’s video below.
(0:30 – 2:00) Pasting Predictive SQL code and Replacing Schema Name
With Notepad++ open file 05 setupPredictive.sql from the scripts folder of the Live3 code repository on GitHub. Copy and paste the code into a SQL console in Eclipse.
First make sure you’re set to the proper schema. Second you must paste your schema name everywhere in the code where it currently has neo_ as a placeholder. There are five lines of code that must be corrected and you do have the option of preforming a global replace.
(2:00 – 4:35) Examination of the Setup Predictive Code – HCP AFL Wrapper Generator Stored Procedure
The first part of the code will preform a clean up in case the syntax has been previously run.
The next section creates a stored procedure using a special HCP stored procedure called HCP AFL Wrapper Generator. As HCP is a multi-tenant version of SAP HANA, this stored procedure is slightly different than the on-premise SAP HANA AFL Wrapper Generator. This stored procedure creates a set of input table types that reflects the input data table, the parameter table and the two results tables.
A reference to these table types is put into a signature table. Then the Wrapper Generator is called with the name of the procedure we want it to create (Tweeters_Cluster), the name of the algorithm we want to use (KMEANS) and the four input/output tables we want to use from the signature table. This allows SAP HANA to generate a stored procedure to preform the clustering that we can then call later on.
After executing the stored procedure, Philip examines it in the procedures folder of the newly created SYS_AFL schema.
(4:35 – 6:15) Examination of the Setup Predictive Code – Creating Parameter and Output Tables
The next part of the code creates the parameter table that will be used. The table’s type has already been defined in the stored procedure so a column table can be created using the like statement. Then values are inserted into the table. Philip recommends optimizing the number of clusters that will be created by setting a minimum value of 5 and a maximum value of 10. Execute that section of code to create the parameter table.
Continuing on in the code is two lines that create a pair of empty output tables called PAL_RESULTS and PAL_Centers. These output tables are created again with the like statement so they use the table types that have already been set up. The table types respect the structure that was listed in the documentation.
(6:15 – 8: 35) Running the Code and Examining the Tables
To actually run the clustering is rather simple. First the results tables must be empty and then the stored procedure must be called. This is the section that is run on a regular basis, possibly every minute or hour. The input table/view is the Twitter view, which will send real-time data straight into the stored procedure in the SAP HANA engine. The stored procedure does an intense job of reading the entire set of data and iterating it up to 100 times to work out the clusters.
At the end of the code is a set of select statements that let us see the data in the output tables. The first result table shows a cluster number between 0 and 9 for each of the Twitter users from the input data. The distance number in the results table details how far a user is away from the center of their assigned cluster. The second output table shows the stance and influence value for each cluster.
(8:35 – 10:30) Examination of the Setup Predictive Code – Creating Views for Web Services
The code also creates a pair of views. The TweetersClustered view pulls in the original tweeting information, adds 1 to the cluster number and counts the total number of tweets each user sent. This enables us to see the stance, influence, the total number of tweets and the cluster number for each Twitter user.
The Clusters view shows the center of the clusters. This view adds one to the cluster’s value and shows the number of people in the cluster using a join of a select of the number of users assigned to each cluster.
(10:30 – 12:00) Analysis of the TweetersClustered View
Open a data preview for the TweetsClustered view in the HCP schema and go to the Analysis tab. Drag the user and the clusterNumber over on the Labels axis and the stance and influence over on the Values axis. Change to a bubble chart to see a plot of the influence and stance for each individual Twitter user with a color for each individual cluster number.
Follow along with the Live3 on HCP course here.
SAP HANA Academy over 900 free tutorial videos on using SAP HANA and SAP HANA Cloud Platform.