[SAP HANA Academy] Live3: Explain Clustering
[Update: April 5th, 2016 – The Live3 on HCP tutorial series was created using the SAP HANA Cloud Platform free developer trial landscape in January 2015. The HCP landscape has significantly evolved over the past year. Therefore one may encounter many issues while following along with the series using the most recent version of the free developer trail edition of HCP.]
In the next part of the Live3 course Philip Mugglestone explains how the SAP HANA predictive analysis library (PAL) can be used to cluster similar Tweeters together based on their influence and stance scores. This video will review the k-means clustering algorithm. Check out Philip’s tutorial video below.
(0:35 – 3:20) Overview of PAL
For an extensive set of in-depth information about PAL browse through and view this playlist of 84 videos from Philip in the SAP HANA Academy. The Playlist covers many of PAL’s native algorithms including clustering with the K-means algorithm.
Reading through the SAP HANA PAL documentation is vital for getting a full understanding of the myriad capabilities PAL offers. In a web browser visit help.sap.com/hana and click on SAP HANA options. Select the SAP HANA Predictive link and then you can choose to view the PAL documentation in a PDF or online.
PAL is embedded data mining algorithms in the SAP HANA engine (where the data actually resides). By navigating though the page you can find information on K-means clustering.
(3:20 – 4:40) K-means Clustering Information
K-means uses input data (in Live3 Twitter users) and then lists out information (Influence and Stance) about each piece of data so clustering can be preformed based on similarities in the data.
K-means clustering is a table-based mechanism. This documentation is the go-to source for K-means clustering information including the data types of your input data, what parameters are required, how many clusters you have to create and what are their centers.
(4:40 – 7:20) Visualizing Tweeters’ Stance and Influence Scores
Back in Eclipse do a data preview on the Tweeters table we just created. This Tweeters table will be the input table for the predictive analysis. Our id will be the Twitter users’ handles and our inputs will be the stance and influence scores.
Clicking on the Distinct values tab quickly displays the range of the stance and influence values for all of the Twitter users. For Philip’s data on the Australian Open over 67% of the users have a 0 stance score so they are considered neutral while over 70% have a -1 influence score.
To further analyze the data Philip clicks the Analytics tab and then drags both the stance and influence Numerics to the Values axis. The he selects a scatter charter to visualize a cross section of scores for each user. This divides of all of the users into quadrants based on their stance and influence. One business value we could quickly derive would be to target the people in the top left quadrant who are highly influential and expressing negative views with educational outreach.
Follow along with the Live3 course here.
SAP HANA Academy over 900 free tutorial videos on using SAP HANA and SAP HANA Cloud Platform.