Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
Former Member

We also have a chinese version of this blog.

1 Application scenario

  

   SAP HANA is a in-memory database which keeps data resident in memory for quick access. At the same time, the physical disk storage is also used as a data backup and logging to prevent data loss in caseof losing power. This architecture greatly reduced the time of data access and makes SAP HANA is "high speed".

  

      In traditional data model, database just a tool for storing  and fetching data, so for the application similar to the figure below, the client obtains the data from the Database, and then calculate the result and finally write it back to the Database, if the data is too large, the data transmission overhead is too large, and if the client does not have enough memory, sometimes the calculation and analysis process will also very slow.

      With the help of large memory , SAP HANA provide a solution which move the data sensitive calculation into the database layer,  in this way we can eliminate the overhead of data transmission,  typical framework is as follows:

        For some simple calculation, we can use SQLScript to accomplish, SQLScript provide some basic variable definition and flow control statement. But for complicated calculation and analysis, SQLScript may be not convenient, such as clustering analysis. For this purpose, SAP HANA provide AFL( application Function library), which implement some algorithms in C++ and package them into a library for a SQLScript to call. This greatly enriched the SQLScript's function.


2 PAL introduction


   PAL( Predictive Analysis Library)  is one of the library under the AFL framework, mainly used for prediction and analysis , providing a lot of data mining algorithm. For different application scenarios, the PAL function include the following categories:


     (1) cluster analysis

      (2) classification

      (3) association analysis

      (4) time series analysis

      (5) data preprocessing

      (6) statistial analysis

      (7) Social network analysis.


     for each category ,there are many specific algorithms, for exampe the k-means algorithm under the cluster analysis category.

       It is worth mentioning that ,AFL is a separate package, you need to install it first if you want to use it.


3  basic step


    To use PAL function, there are 3 steps.

     (1) generate AFL_WRAPPER_GENERATOR and AFL_WRAPPER_ERASER procedure

           It is simple to generate these two procedure. In the AFL package, there are two files named afl_wrapper_generator.sql and afl_wrapper_eraser.sql, copy their content to the SQL Console and execute them.  After that ,you need to assign the permisson.

   GRANT EXECUTE ON system.afl_wrapper_generator to USER1;

   GRANT EXECUTE ON system.afl_wrapper_eraser to USER1;


    This step just need to execute only once when you it is the first time you use AFL.


   (2)Generate the algorithm's instance.

          CALL SYSTEM.AFL_WRAPPER_GENERATOR(

              '<procedure_name>',       

              '<area_name>',

              '<function_name>', <signature_table>);

Procedure_name: name

Area_name: usually AFLPAL;

Function_name:algorithm name;

Signature_table: arguments table;


(3) call the algorithm

CALL <procedure_name>(

<data_input_table> {,…},   

<parameter_table>,

<output_table> {,…}) with overview;

Procedure_name:algorithm instance name

Data_input_table: data input;

Parameter_table:arguments table

Output_table:table for output;

4  Demo

I will show you the usage with one demo of DBSCAN algorithm. (Because the environment of my Machine exists the AFL_WRAPPER_GENERATOR procedure , so the first step need not to execute, and here we assume that the schema is named TEST.

DBSCAN is a cluster algorithm which is good at noise reduction, more description please see DBSCAN - Wikipedia, the free encyclopedia



/* create data table with two attribute */
CREATE TYPE PAL_DBSCAN_DATA_T AS TABLE ( ID integer, ATTRIB1 double, ATTRIB2 
double);
/*table for control arguments*/
CREATE TYPE PAL_CONTROL_T AS TABLE( NAME varchar(50), INTARGS integer, DOUBLEARGS 
double, STRINGARGS varchar(100));
/*create result table type*/
CREATE TYPE PAL_DBSCAN_RESULTS_T AS TABLE( ID integer, RESULT integer);
/*create parameter table*/
CREATE COLUMN TABLE PAL_DBSCAN_PDATA_TBL( "ID" INT, "TYPENAME" VARCHAR(100), "DIRECTION" VARCHAR(100) );
/*insert some arguments to the table*/
INSERT INTO PAL_DBSCAN_PDATA_TBL VALUES (1, 'TEST.PAL_DBSCAN_DATA_T', 'in'); 
INSERT INTO PAL_DBSCAN_PDATA_TBL VALUES (2, 'TEST.PAL_CONTROL_T', 'in'); 
INSERT INTO PAL_DBSCAN_PDATA_TBL VALUES (3, 'TEST.PAL_DBSCAN_RESULTS_T', 'out'); 
/*permission assign*/
GRANT SELECT ON DM_PAL.PAL_DBSCAN_PDATA_TBL to SYSTEM;
/*generate the algorithm instance of DBSCAN*/
call SYSTEM.afl_wrapper_eraser('PAL_DBSCAN9');
call SYSTEM.afl_wrapper_generator('PAL_DBSCAN9', 'AFLPAL', 'DBSCAN', PAL_DBSCAN_PDATA_TBL);
/* create data table*/
CREATE COLUMN TABLE PAL_DBSCAN_DATA_TBL ( ID integer, ATTRIB1 double, ATTRIB2 double);
/*insert test data*/
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(1,0.10,0.10);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(2,0.11,0.10);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(3,0.10,0.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(4,0.11,0.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(5,0.12,0.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(6,0.11,0.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(7,0.12,0.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(8,0.12,0.13);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(9,0.13,0.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(10,0.13,0.13);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(11,0.13,0.14);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(12,0.14,0.13);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(13,10.10,10.10);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(14,10.11,10.10);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(15,10.10,10.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(16,10.11,10.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(17,10.11,10.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(18,10.12,10.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(19,10.12,10.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(20,10.12,10.13);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(21,10.13,10.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(22,10.13,10.13);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(23,10.13,10.14);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(24,10.14,10.13);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(25,4.10,4.10);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(26,7.11,7.10);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(27,-3.10,-3.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(28,16.11,16.11);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(29,20.11,20.12);
INSERT INTO PAL_DBSCAN_DATA_TBL VALUES(30,15.12,15.11);
/*create temp table*/
CREATE LOCAL TEMPORARY COLUMN TABLE #PAL_CONTROL_TBL( NAME varchar(50), INTARGS 
integer, DOUBLEARGS double, STRINGARGS varchar(100)); 
/*set input arguments*/
/*threads 18*/
INSERT INTO #PAL_CONTROL_TBL VALUES('THREAD_NUMBER',18,null,null);
/*auto set parmater*/
INSERT INTO #PAL_CONTROL_TBL VALUES('AUTO_PARAM',null,null,'true');
/*Manhattan distance*/
INSERT INTO #PAL_CONTROL_TBL VALUES('DISTANCE_METHOD',1,null,null);
/*result table*/
CREATE COLUMN TABLE PAL_DBSCAN_RESULTS_TBL( ID integer, RESULT integer);
/*call the algorithm*/
CALL _SYS_AFL.PAL_DBSCAN9(PAL_DBSCAN_DATA_TBL, "#PAL_CONTROL_TBL", 
PAL_DBSCAN_RESULTS_TBL) with overview;
/*see the result*/
SELECT * FROM PAL_DBSCAN_RESULTS_TBL; 
 

if executed correctly, you will see the result above, the record is clustered into 3 categories, 0, 1, -1 is the cluster ID.



5   conclusion

    This article introduced the PAL in SAP HANA with the example of DBACAN. And many other algorithms have the similar step the use, The main work is to prepare the data ,and define the arguments according to the document, and finally call the algorithm.

      From the efficiency perspective, the use of PAL will take advantage of large memory of SAP HANA. If used properly, it is really fast for the analysis if big data!

       [Note: SAP HANA version of the test cases used here is for SAP HANA SPS06]

1 Comment