Skip to Content

Going Back to School with the SAP HANA Academy: PAL: 79. Clustering – K-Medians in SPS09

It’s really strange that after a blog in which I quoted a comedian paraphrasing the SAS, I am now thinking about comedians, instead of K-Medians.  In this video Philip looks at a new clustering algorithm (K-Medians) added with SPS09 of SAP HANA in the Predictive Analysis Library.   

Picture1.png  

However, having reconsidered the title I am going back to school not as a student but as a new teacher.  My first post was teaching Mathematics. I had not done Mathematics since I was 18 and in the intervening twelve years I had been to university to do Politics, worked on Emergency Response, been on a Graduate Management Scheme and been an IT contractor.  I used to discuss lessons with an experienced teacher who despite being a non-smoker used to spend his lunchtime in the smoking staff room.  So I love the style shown in this video.  Nothing is assumed and key points are repeated at just the right time.

Philip starts as always by referring to the reference guide and then demonstrates the mean and median using the Maths is Fun website.  Thankfully no memories here and showing my age (I only taught Mathematics for a year) computers were not as widespread in teaching back then. I remember a senior teacher who thought she used IT in her lessons (a new lesson requirement) by having one slide with her lesson objectives projected onto the wall.

Anyway back to the the K-Medians algorithm.  This calculates the median for a given set of data.  The table below shows the algorithm name, type of average it is derived from and a method of calculation for a given data set of numbers.

Algorithm

Average

Calculation

K-Mean

Mean

Add all the numbers together.

Divide by the number of numbers.

K-Median

Median

Put all the numbers in order of size. 

If the number of numbers is odd, select the middle number.

If the number of numbers is even, calculate the mean of the two middle numbers

Philip points out that the mean and median may not be the same and how this impacts how the initial cluster centres are calculated.  The K-Means and K-Medians algorithms will iterate around optimizing until the result settle down.  This is a very intense process which is best done directly in memory in the HANA engine itself.

Philip then refers back to the PAL Reference Guide and discusses how the algorithm deals with categorical data.  He uses the example of gender i.e. male or female.  The algorithm can automatically convert those categorical data into something numeric so that these values can be used in the formulae included.

You can see the input and output tables are similar to those for K-Means which was already available before SPS09.


Picture2.png

 

If you look at the code you can see that very little has changed since SPS08 except for the AFL wrapper procedure and the ability to specify a schema. 

Remember that with SPS09 you can’t use SYS_AFL anymore.

Picture3.png 

You also need to have data in the right format.  The customers table in the PAL schema includes the customer name which is not needed.  You only need an ID and the numeric values you are going to do the clustering by.  The example below shows a four value clustering for different values. This code will create a results and centres table and an output view that will allow you to put in the customer name.  This algorithm creates your centre ID so your cluster number’s been assigned.  Please note the auto-numbering starts from zero but you can change that as below by adding 1 so that you start from 1.  This can also be applied to the centre ID.

Picture4.png 

The parameters below allow you to change the clustering and include options to do with seeding. 

Picture5.png

Phillip then uses the manual to discuss the different parameters you have available, how you can do the seeding, normalization and other functions as well.

Picture6.png

After telling us what’s going to happen next and making sure we all understand the where the documentation is and which parts are being referred to, Philip deems us ready to call the procedure. 

Picture7.png 

Above in the results view you can see the customer name has been added, that 1 has been added to the customer number so that its starts from 1, you can see for each customer which cluster they have been assigned to based on the combination of lifespend, newspend, income and loyalty.

Picture8.png

The centre table for each cluster we get a row and can which is the value of life span newspend, income and loyalty which represents the centre of that cluster. These values are hard to visualise with just raw table data and are best viewed in a chart.

Picture9.png

You can also change the group number as below (5) to get a different number of clusters.

Picture10.png

Now when you run the algorithm it will create five different clusters as shown above and below.

Picture11.png

Picture12.png

This is also shown in the chart after it has been refreshed. You can also see how many customers are in each cluster.

Picture13.png


Plenary


Philip has made an easy to follow self contained video that makes no assumptions of prior learning and revisits the key differences between SPS09 and SPS08 as appropriate.  He has made the learning relevant and engaging.  He has rekindled fond memories of teaching Mathematics back in the days of innocence.


My memories of learning to teach mathematics are forever associated with a time back in the day when the interesting teachers smoked.  I gleaned pearls of wisdom with stinging eyes while trying not to breathe deeply in a smoke filled environment.  As Political Correctness took over the spaces allocated to smokers got smaller and smaller.  The room smokers were consigned became more inconvenient with the passing of time.  Eventually, just before the school was rebuilt, they ended up in a windowless room where requests to have the extractor fan were met with rebuttal stating that if smokers wanted fresh air they could sit in the staff room and not smoke. After the school was rebuilt it was designated non smoking.  The smokers used to sit on the cricket pavilion opposite the school until they were told that they were making the school look disreputable.  The last I heard “the magnificent seven” who included my non-smoking friend were confined to sharing adjacent cars parked off the school premises.

The banishment of smoking, teaching mathematics and the advance of computers are intertwined for me.  Thanks to Philip, I now also associate calculating averages with a little bit of code rather than a smoke filled room.


To report this post you need to login first.

Be the first to leave a comment

You must be Logged on to comment or reply to a post.

Leave a Reply