Skip to Content

PAL is an optional component from SAP HANA and the main porpoise of PAL is to enable the modelers to perform predictive analysis over big volumes of data. Currently, PAL includes classic and universal predictive analysis algorithms from six data-mining categories:

  • Clustering
  • Classification
  • Association
  • Time Series
  • Preprocessing
  • Miscellaneous

To be able to call PAL functions from SQLScript you need to be on SPS05 and install the Application Function Library (AFL) which includes PAL and BFL (Business Functions Library).

In this post I’m going to focus on how to use the Apriori Algorithm. Let’s take a look at what the PAL user guide says about it:

“…Apriori is a classic predictive analysis algorithm for finding association rules used in association analysis. Association analysis uncovers the hidden patterns, correlations or casual structures among a set of items or objects. For example, association analysis enables you to understand what products and services customers tend to purchase at the same time. By analyzing the purchasing trends of your customers with association analysis, you can predict their future behavior…”

In other words, this algorithm will find frequent itemsets. This algorithm is designed to run on transactional tables, that means that you can for example use your Sales Order Items table to find patterns on how your customers consume your products or services without the need of having to pre-process the information stored in the table. So running this algorithm is quite simple. The easiest way to understand the porpoise of this algorithm is to go to Amazon and browse a product (any), you’ll notice that there is a section called “Frequently Bought Together”. The content of that section is created using an Association Algorithm like the one we are talking about in this blog (if not the same).

When you run the Apriori procedure in SAP HANA, the results will be stored in a column table that contains all the associations uncovered by the algorithm with three measures: Support, Confidence and Lift


Support

Is the proportion of transactions in the data set which contain a particular itemset. Let’s say we have a Sales Order Items table that looks like this:

Transaction ID

Item

1

Milk

1

Bread

2

Butter

3

Beer

4

Milk

4

Bread

4

Butter

5

Milk

5

Butter

In this example, the Itemset {Milk, Bread, Butter} will have a Support = 1/5, because only 1 transaction out of 5 contains that particular Itemset. If we use the Itemset {Milk, Bread}, the Support will be 2/5, because 2 transactions out of 5 contain that combination.

Confidence

This measure will tell you how often a customer buys Item B when he buys Item A. For example, if we use the table above, the Confidence of {Milk, Bread} will be 2/3, because out of 3 transactions that contain Milk, only 2 contain the Itemset {Milk, Bread}. So the Confidence is defined as:

form1.PNG

Lift

In general, if we have a high Support and a high Confidence, this means we have a strong association. However, this can be deceptive in some cases because if Item A and Item B have a high support, we can have a high value for confidence even when they are independent. A better measure to judge the strength of an association rule is the Lift. The Lift ratio is the confidence of the rule divided by the confidence assuming the consequent (Item B) and the antecedent (Item A) are independent:

form2.PNG

A Lift ratio greater than 1.0 suggests that we have a strong rule. The larger the Lift ratio, the greater is the strength of the association. In our previous example, the Lift Ratio for {Milk, Bread} will be:

form3.PNG

Ok, now that we understand the different concepts associated with this algorithm, let the coding begin!

Code

For my example I would like to know which #HASHTAGS are frequently twitted together in Twitter when Twitting about SAP HANA, so I downloaded a number of Tweets from Twitter (15.000 to be precise) that contain the words SAP and HANA. After downloading the data I created a column table with the following structure to store the Tweets:

CREATE COLUMN TABLE TWT (

       “ID” SMALLINT,

       “DATE” INTEGER, –> I use Integer for the date because I’m using the UNIX TimeStamp Format

       “CONTENT” NVARCHAR(200),

  PRIMARY KEY (“ID”));

After uploading the Tweets to SAP HANA, I ran a Text Analysis process by executing the following statement:

CREATE FULLTEXT INDEX TWT_FTI ON TWT(“CONTENT”)

TEXT ANALYSIS ON CONFIGURATION ‘EXTARCTION_CORE’;

The reason why I ran this Text Analysis process is because I want to identify all the #HASHTAGS in my Tweets. After creating the Full Text Index, a new table called $TA_TWT_FTI is created. Let’s take a look at the content of the table:

/wp-content/uploads/2013/01/cap1_179625.png

As you can see, in this table I can easily identify all the different #HASHTAGS by only filtering by column TYPE = SOCIAL_MEDIA/TOPIC_TWITTER. If you would like to know more about the Text Analysis Process, take a look at my previous post where I describe in detail how to use it. My next step is creating a SQL View on this table to show only #HASHTAGS:

CREATE VIEW “PAL_TRANS_TAB” AS

SELECT ID, TOKEN FROM “$TA_TWT_FTI”

WHERE TYPE = ‘SOCIAL_MEDIA/TOPIC_TWITTER’;

The resulting view looks like this:

/wp-content/uploads/2013/01/cap2_179626.png

This looks pretty similar to a Sales Order Items table, doesn’t it? Just replace the ID with the Sales Order ID and the Token with the Product or Service ID included in the Sales Order. Now I’m ready to run the Apriori algorithm:

SET SCHEMA _SYS_AFL;

/* CREATE TABLE TYPE FOR MY INPUT DATA */

DROP TYPE PAL_DATA_T;

CREATE TYPE PAL_DATA_T AS TABLE(

“ID” INT,

“TOKEN” VARCHAR(100)

);

/* CREATE TABLE TYPE FOR THE OUTPUT TABLE */

DROP TYPE PAL_RESULT_T;

CREATE TYPE PAL_RESULT_T AS TABLE(

“PRERULE” VARCHAR(500),

“POSTRULE” VARCHAR(500),

“SUPPORT” DOUBLE,

“CONFIDENCE” DOUBLE,

“LIFT” DOUBLE

);

/* CREATE TABLE TYPE FOR THE OUTPUT PMML MODEL */

DROP TYPE PAL_PMMLMODEL_T;

CREATE TYPE PAL_PMMLMODEL_T AS TABLE(

“ID” INT,

“PMMLMODEL” VARCHAR(5000)

);

/* CREATE TABLE TYPE FOR THE TABLE THAT WILL CONTAIN THE INPUT PARAMETERS */

DROP TYPE PAL_CONTROL_T;

CREATE TYPE PAL_CONTROL_T AS TABLE(

“NAME” VARCHAR (50),

“INTARGS” INTEGER,

“DOUBLEARGS” DOUBLE,

“STRINGARGS” VARCHAR (100)

);

/* CREATE TABLE THAT WILL POINT TO THE DIFFERENT TYPES I’M USING TO RUN THE ALGORITHM */

DROP TABLE PDATA;

CREATE COLUMN TABLE PDATA(

“ID” INT,

“TYPENAME” VARCHAR(100),

“DIRECTION” VARCHAR(100) );

/* FILL THE TABLE */

INSERT INTO PDATA VALUES (1, ‘_SYS_AFL.PAL_DATA_T’, ‘in’);

INSERT INTO PDATA VALUES (2, ‘_SYS_AFL.PAL_CONTROL_T’, ‘in’);

INSERT INTO PDATA VALUES (3, ‘_SYS_AFL.PAL_RESULT_T’, ‘out’);

INSERT INTO PDATA VALUES (4, ‘_SYS_AFL.PAL_PMMLMODEL_T’, ‘out’);

/* GENERATE THE APRIORI PROCEDURE */

DROP PROCEDURE PAL_APRIORI_RULE;

DROP TYPE PAL_APRIORI_RULE__TT_P1;

DROP TYPE PAL_APRIORI_RULE__TT_P2;

DROP TYPE PAL_APRIORI_RULE__TT_P3;

DROP TYPE PAL_APRIORI_RULE__TT_P4;

call SYSTEM.afl_wrapper_generator(‘PAL_APRIORI_RULE’, ‘AFLPAL’, ‘APRIORIRULE’, PDATA);

/* CREATE TABLE THAT WILL CONTAIN THE INPUT PARAMETERS */

DROP TABLE PAL_CONTROL_TAB;

CREATE COLUMN TABLE PAL_CONTROL_TAB(

“NAME” VARCHAR (50),

“INTARGS” INTEGER,

“DOUBLEARGS” DOUBLE,

“STRINGARGS” VARCHAR (100)

);

/* FILL THE TABLE */

/* NUMBER OF THREADS TO BE USED */

INSERT INTO PAL_CONTROL_TAB VALUES (‘THREAD_NUMBER’, 2, null, null);

/* MINIMUM SUPPORT THRESHOLD */

INSERT INTO PAL_CONTROL_TAB VALUES (‘MIN_SUPPORT’, null, 0.02, null);

/* MINIMUM CONFIDENCE THRESHOLD */

INSERT INTO PAL_CONTROL_TAB VALUES (‘MIN_CONFIDENCE’, null, 0.02, null);

/* CREATE THE OUTPUT TABLE */

DROP TABLE PAL_RESULT_TAB;

CREATE COLUMN TABLE PAL_RESULT_TAB(

“PRERULE” VARCHAR(500),

“POSTRULE” VARCHAR(500),

“SUPPORT” Double,

“CONFIDENCE” Double,

“LIFT” DOUBLE

);

/* CREATE THE OUTPUT TABLE WITH THE PMML MODEL */

DROP TABLE PAL_PMMLMODEL_TAB;

CREATE COLUMN TABLE PAL_PMMLMODEL_TAB (

“ID” INT,

“PMMLMODEL” VARCHAR(5000)

);

/* CALL THE APRIORI ALGORITHM */

CALL PAL_APRIORI_RULE(PAL_TRANS_TAB, PAL_CONTROL_TAB, PAL_RESULT_TAB, PAL_PMMLMODEL_TAB) WITH overview;

/* SHOW THE RESULTS */

SELECT * FROM PAL_RESULT_TAB;

SELECT * FROM PAL_PMMLMODEL_TAB;

This is how the output table looks like:

/wp-content/uploads/2013/01/cap3_179633.png

Of course, the strongest rules are {#HANA, #SAP} and {#SAP, #HANA} because that is the selection criteria I used to download the data from Twitter, but the interesting thing about this algorithm is that it not only creates rules with two items, the rules can contain multiple items, for example, people that Tweet using #HASHTAGS #SAP and #SAPTechEd in the same Tweet, usually include the #HASHTAG #HANA. I hope you enjoyed it!

Follow me on Twitter: @LukiSpa

Info en Español sobre SAP HANA™:

www.HablemosHANA.com


To report this post you need to login first.

12 Comments

You must be Logged on to comment or reply to a post.

  1. Swati Vakharia

    Hi,

    When I try to create TYPE “PAL_DATA_T” within schema “_SYS_AFL”, it gives me authorization issue.

    Kindly let me know if there are any specific authorizations required to create the TYPE?

    Regards,

    Swati V

    (0) 
    1. Lucas Sparvieri Post author

      Hi Swati, I would suggest asking to your system admin to provide you with SQL Priveleges to the _SYS_AFL Schema. Something like this should help:

      GRANT CREATE ANY ON SCHEMA _SYS_AFL TO <USER_NAME>;

      Let me know how it goes!

      (0) 
      1. Krishna Gosetty

        Hi Lucas,

        SAP note 1846194 – Lack of permissons when using AFL says

        • Never garnt any kind of access to schema _SYS_AFL to any user.

        So, will it still work if these table types are created in use schemas?

        (0) 
  2. R. Konijnenburg

    Hi Lucas,

    Fails at following step:

    Could not execute ‘call SYSTEM.afl_wrapper_generator(‘PAL_APRIORI_RULE’, ‘AFLPAL’, ‘APRIORIRULE’, PDATA)’

    SAP DBTech JDBC: [328]: invalid name of function or procedure: AFL_WRAPPER_GENERATOR: line 1 col 13 (at pos 12)

    Any idea (procedure is not under schema “system” I believe…)

    Found this:

    http://help.sap.com/hana/hana_dev_bfl_en.pdf

    However dont quite follow this (where is this mount?):

    Step

    1

    Create the AFL_WRAPPER_GENERATOR Procedure

    Before using any AFL function, you need to create the

    AFL_WRAPPER_GENERATOR

    procedure. It is

    used to generate a wrapper for the AFL functions that take tables with

    a

    variable number of columns

    as inputs. This

    procedure only needs to be created once.

    1.

    Make sure

    that

    you are the SYSTEM user.

    2.

    Go to

    /hanamnt/<

    SID

    >/HDB <

    instance

    _number

    >/exe/

    plugins/afl/

    and e

    xecute

    the

    afl_wrapper_generator.sql

    script file.

    (0) 
    1. Lucas Sparvieri Post author

      Hi Ronald, seems like I forgot to mention something important in my blog post 🙂 . In order to create the AFL_WRAPPER_GENERATOR procedure under the SYSTEM schema you need to get into your HANA Box and navigate to the folder /hanamnt/<SID>/HDB <instance_number>/exe/plugins/afl/. In there you will find a SQL script file called afl_wrapper_generator.sql, you need to execute this script from HANA studio to generate the missing procedure. After executing this script you should see the AFL_WRAPPER_GENERATOR procedure under the SYSTEM schema. Only the SYSTEM user has privileges to execute this script. One thing that is worth mentioning, is that this procedure will go away every time that you restart your HANA server (?!), so keep a copy of this script in your local computer because you will need to run it quite a few times. I will update the blog to mention this! Thanks for trying it out !!! Let me know how it goes!!

      Best, Lucas.

      (0) 
    1. Lucas Sparvieri Post author

      Hi Arman, didn’t try it but it should. Nevertheless I would suggest using the AFM instead of writing all this code if you are on rev 70

      Cheers

      (0) 
  3. Robert Gehlis

    Hi,

    I just did an analysis just like this. Only problem is, that I’m getting Lift values that are just off the charts, ranging all the way from 40 to 1000. Did anyone else have a similar problem?

    (0) 
  4. Greg Baynham

    Hello Lucas, can you update the blog post to include information about where the procedure is stored (_SYS_AFL schema)?  Also, we ran into the error 423 when running the AFL_WRAPPER_GENERATOR which looked like a big problem till we realized someone had already created the procedure and we couldn’t overwrite it.

    (0) 
    1. Lucas Sparvieri Post author

      Hi Greg, actually, in the latest revisions you can create the procedure in any schema, when you run the AFL_WRAPPER_GENERATOR you can specify the target schema:

      CALL “SYS”.AFLLANG_WRAPPER_PROCEDURE_CREATE( ‘AFLPAL’, ‘APRIORIRULE’, ‘<SCHEMA_NAME>‘, ‘PAL_APRIORI_RULE_PROC’, PAL_APRIORI_PDATA_TBL);

      Hope this helps!

      (0) 
      1. Matthias S

        Hi Lucas,

        Can you may update your post for SPS09?

        I’m stuck in creating the AFLLANG_WRAPPER_PROCEDURE

        I get error SAP DBTech JDBC: [260]: invalid column name:  [260] invalid column name

        CALL “SYS”.AFLLANG_WRAPPER_PROCEDURE_CREATE( ‘AFLPAL’, ‘APRIORIRULE’, ‘_SYS_AFL’, ‘PAL_APRIORI_RULE’, PDATA);

        Thanks in advance

        (0) 

Leave a Reply