Enhancements to the Predictive Analysis Library in...

pmugglestone · ‎04-27-2018

HANA 2.0 SPS 03 has been released and there are lots of juicy goodies for the Predictive Analysis Library (PAL).

The PAL is HANA's native in-database machine machine learning capability and a lot of effort has gone into enhancing productivity and ease of use for data scientists and developers that wish to embed machine learning into their applications. That's in addition to improved performance when training and scoring models and, of course, no new HANA release would be complete without a sprinkling of new PAL algorithms.

In this blog I’ll introduce selected updates and show you where to find hands-on tutorial videos:

Model evaluation and hyper parameter selection

Logical parallel processing

New algorithms for Recommendation, Social Network Analysis, and Time Series

Enhanced real-time state-enabled scoring

Model evaluation and hyper parameter selection

A major challenge when training models is how to build the most robust model possible whilst avoiding over-fitting. With dozens of tuning parameters - identifying optimal parameter settings can also be a long and frustrating process.

SPS03 introduces the model evaluation and hyper parameter selection capability to help speed up and optimize this process and is available for 10 algorithms.

Cross validation and bootstrapping are supported as re-sampling methods along with a number of evaluation metrics including error rate and area under curve.

Candidate algorithm-specific parameter values can be set via an explicit list or range and the optimal search strategy can be performed either by trying all possible values (grid) or by randomly selecting values.

Here's a code snippet showing some of the key parameters involved:

-- model evaluation & parameter search parameters

INSERT INTO "#Params" VALUES ('PROGRESS_INDICATOR_ID', null, null, 'Train Logistic Regression');

INSERT INTO "#Params" VALUES ('RESAMPLING_METHOD', null, null, 'stratified_cv'); -- cv, stratified_cv, bootstrap, stratified_bootstrap

INSERT INTO "#Params" VALUES ('FOLD_NUM', 5, null, null);

INSERT INTO "#Params" VALUES ('EVALUATION_METRIC', null, null, 'AUC'); -- RMSE, MAE, ERROR_RATE, NLL, AUC

INSERT INTO "#Params" VALUES ('REPEAT_TIMES', 2, null, null);

INSERT INTO "#Params" VALUES ('PARAM_SEARCH_STRATEGY', null, null, 'grid'); -- grid, random

--INSERT INTO "#Params" VALUES ('RANDOM_SEARCH_TIMES', 2, null, null); -- use when search strategy = random



-- algorithm specific parameter values

INSERT INTO "#Params" VALUES ('ENET_LAMBDA_VALUES', null, null, '{0.01,0.02,0.005,0.001}'); -- discrete values

INSERT INTO "#Params" VALUES ('ENET_ALPHA_RANGE', null, null, '[0.005,0.001,0.01]'); -- range: start, increment, end

Given that a complex and comprehensive model evaluation and parameter selection task can take some time to complete, a nice touch is that it's possible to specify an execution id and then query a system view to check on progress whilst the task is running:

-- check progress

SELECT * FROM "_SYS_AFL"."FUNCTION_PROGRESS_IN_AFLPAL" WHERE "EXECUTION_ID" = 'Train Logistic Regression';

Watch video tutorials of model evaluation and hyper parameter search in action:

PAL 153 Model Evaluation Parameter Selection - Logistic Regression

PAL 154 Model Evaluation Parameter Selection - Decision Tree

Logical parallel processing

SPS03 also introduces the ability to perform logical group-by processing and do that in parallel.

So perhaps you have some training data and you'd like to build several models - each with it's own distinct set of parameters. That can now be done in a simple call rather than one per set of parameters.

Likewise you might have data that's grouped by company, instead of making a separate call for each company you can now make a single call and the prediction will be made for each company present in the data.

The key to doing this is to include the group-by column in the input data or parameter table (or both) and also to refer to the these column(s) in the procedure call using WITH HINT - PARALLEL BY PARAMETER VALUES as shown in the following example:

-- multiple groups, group-specific parameters



CREATE TABLE "StockPrices" ("stockId" VARCHAR(3), "timeId" INTEGER, "price" DOUBLE);

CREATE LOCAL TEMPORARY COLUMN TABLE "#Params" ("stockId" VARCHAR(3), "name" VARCHAR(60), "intArgs" INTEGER, "doubleArgs" DOUBLE, "stringArgs" VARCHAR(100));

CREATE COLUMN TABLE "Forecast" ("stockId" VARCHAR(3), "timeId" INTEGER, "predictedPrice" DOUBLE, "pi1Lower" DOUBLE, "pi1Upper" DOUBLE, "pi2Lower" DOUBLE, "pi2Upper" DOUBLE);

CREATE COLUMN TABLE "Statistics" ("stockId" VARCHAR(3), "name" NVARCHAR(100), "value" NVARCHAR(100));



INSERT INTO "StockPrices" VALUES ('ABC', 1, 362);

INSERT INTO "StockPrices" VALUES ('ABC', 2, 385);

INSERT INTO "StockPrices" VALUES ('ABC', 3, 432);

INSERT INTO "StockPrices" VALUES ('ABC', 4, 341);

INSERT INTO "StockPrices" VALUES ('ABC', 5, 382);

INSERT INTO "StockPrices" VALUES ('ABC', 6, 409);

INSERT INTO "StockPrices" VALUES ('ABC', 7, 498);

INSERT INTO "StockPrices" VALUES ('ABC', 8, 387);

INSERT INTO "StockPrices" VALUES ('ABC', 9, 473);

INSERT INTO "StockPrices" VALUES ('ABC', 10, 513);

INSERT INTO "StockPrices" VALUES ('ABC', 11, 582);

INSERT INTO "StockPrices" VALUES ('ABC', 12, 474);

INSERT INTO "StockPrices" VALUES ('XYZ', 1, 544);

INSERT INTO "StockPrices" VALUES ('XYZ', 2, 582);

INSERT INTO "StockPrices" VALUES ('XYZ', 3, 681);

INSERT INTO "StockPrices" VALUES ('XYZ', 4, 557);

INSERT INTO "StockPrices" VALUES ('XYZ', 5, 626);

INSERT INTO "StockPrices" VALUES ('XYZ', 6, 654);

INSERT INTO "StockPrices" VALUES ('XYZ', 7, 691);

INSERT INTO "StockPrices" VALUES ('XYZ', 8, 712);

INSERT INTO "StockPrices" VALUES ('XYZ', 9, 674);

INSERT INTO "StockPrices" VALUES ('XYZ', 10, 732);

INSERT INTO "StockPrices" VALUES ('XYZ', 11, 745);

INSERT INTO "StockPrices" VALUES ('XYZ', 12, 775);



INSERT INTO "#Params" VALUES ('ABC', 'MODELSELECTION', 1, null, null); 

INSERT INTO "#Params" VALUES ('ABC', 'FORECAST_NUM', 3, null, null); 

INSERT INTO "#Params" VALUES ('ABC', 'INITIAL_METHOD', 0, null, null); 

INSERT INTO "#Params" VALUES ('XYZ', 'MODELSELECTION', 1, null, null); 

INSERT INTO "#Params" VALUES ('XYZ', 'FORECAST_NUM', 3, null, null); 

INSERT INTO "#Params" VALUES ('XYZ', 'INITIAL_METHOD', 1, null, null); 

INSERT INTO "#Params" VALUES ('XYZ', 'PREDICTION_CONFIDENCE_1', null, 0.75, null); 

INSERT INTO "#Params" VALUES ('XYZ', 'PREDICTION_CONFIDENCE_2', null, 0.90, null); 



CALL "_SYS_AFL"."PAL_AUTO_EXPSMOOTH" ("StockPrices", "#Params", "Forecast", "Statistics") WITH OVERVIEW WITH HINT (PARALLEL_BY_PARAMETER_VALUES(p1."stockId", p2."stockId"));

Of course output tables also need to include the group-by column(s) so that results can be correctly identified.

Watch a video tutorial of logical parallel group-by processing in action:

PAL 155 Parallel By Parameter Values

New algorithms for Recommendation, Social Network Analysis, and Time Series

A couple of new recommendation algorithms have been added in SPS03.

Field-Aware Factorization Machine - useful for click through rates in advertising

Alternating Least Squares - similar to the existing Factorized Polynomial Regression Model but without global or side features

For social network analysis, the Page Rank algorithm has been added. This can be used to determine a web pages rank or importance based on links to that page:

-- page rank



CREATE COLUMN TABLE "Data" ("pageFrom" NVARCHAR(100), "pageTo" NVARCHAR(100));

CREATE LOCAL TEMPORARY COLUMN TABLE "#Params" ("name" VARCHAR(60), "intArgs" INTEGER, "doubleArgs" DOUBLE, "stringArgs" VARCHAR(100));

CREATE COLUMN TABLE "Results" ("page" NVARCHAR(100), "rank" DOUBLE);



INSERT INTO "Data" VALUES ('Home', 'About');

INSERT INTO "Data" VALUES ('Home', 'Product');

INSERT INTO "Data" VALUES ('Home', 'Links');

INSERT INTO "Data" VALUES ('About', 'Home');

INSERT INTO "Data" VALUES ('Product', 'Home');

INSERT INTO "Data" VALUES ('Links', 'Home');

INSERT INTO "Data" VALUES ('Links', 'External Site A');

INSERT INTO "Data" VALUES ('Links', 'External Site B');

INSERT INTO "Data" VALUES ('Links', 'External Site C');

INSERT INTO "Data" VALUES ('Links', 'External Site D');

INSERT INTO "Data" VALUES ('External Site A', 'Home');

INSERT INTO "Data" VALUES ('External Site A', 'Product');

INSERT INTO "Data" VALUES ('External Site B', 'Home');

INSERT INTO "Data" VALUES ('External Site C', 'Home');

INSERT INTO "Data" VALUES ('External Site D', 'Home');

INSERT INTO "Data" VALUES ('External Site D', 'Product');



INSERT INTO "#Params" VALUES ('DAMPING', null, 0.85, null); -- default: 0.85



CALL "_SYS_AFL"."PAL_PAGERANK" ("Data", "#Params", "Results") WITH OVERVIEW;

Finally, the Hierarchical Forecast introduces top-down, bottom-up and optimal combination forecasting to a dataset that incorporates a hierarchy.

Watch video tutorials of the new algorithms in action:

PAL 156 Recommender - Field-Aware Factorization Machine

PAL 157 Recommender - Alternating Least Squares

PAL 159 Social Network Analysis - Page Rank

PAL 160 Time Series - Hierarchical Forecast

Enhanced real-time scoring

Real-time scoring via state-enabled models has been available for a while now. It enables parsed models to be stored in-memory which dramatically improves performance when making repeated predict or scoring calls - especially with large, complex models as the model only needs to be parsed once.

SPS03 brings full support meaning that additional algorithms are supported (14 in total) and access is now far simpler through the introduction of two new type-any procedures:

PAL_CREATE_MODEL_STATE

PAL_DELETE_MODEL_STATE

Scoring is done via the regular PAL_xyz_PREDICT procedure for the algorithm in question by passing the state enabled id via the parameter table.

So no need for those pesky wrapper procedures anymore. Yay!

Here's an example for the newly introduced Alternating Least Squares (ALS) algorithm:

-- create state



CREATE LOCAL TEMPORARY COLUMN TABLE "#Empty" ("id" INTEGER);

CREATE LOCAL TEMPORARY COLUMN TABLE "#Params" ("name" VARCHAR(60), "intArgs" INTEGER, "doubleArgs" DOUBLE, "stringArgs" VARCHAR(100));

CREATE COLUMN TABLE "State" ("name" VARCHAR(50), "value" VARCHAR(100));



INSERT INTO "#Params" VALUES ('ALGORITHM', 24, null, null); -- 1: SVM, 2: Random DT, 3: Decision Tree, 4: Cluster Assignment, 5: LDA Inference, 6: Binning, 7: Naive Bayes, 8: PCA, 9: BPNN, 11: FRM, 16: Multiple Linear Regression, 20: Logistic Regression, 24: ALS

INSERT INTO "#Params" VALUES ('STATE_DESCRIPTION', null, null, 'My Trained ALS Model');



CALL "_SYS_AFL"."PAL_CREATE_MODEL_STATE" ("ModelMetadata", "ModelMap", "ModelFactors", "#Empty", "#Empty", "#Params", "State") WITH OVERVIEW;



SELECT * FROM "State";



SELECT * FROM "SYS"."M_AFL_STATES";





-- predict with state



TRUNCATE TABLE "#Params";



CREATE COLUMN TABLE "RatingsNew" ("id" INTEGER, "user" NVARCHAR(255), "album" NVARCHAR(255));

CREATE COLUMN TABLE "EmptyModelMetadata" LIKE "ModelMetadata";

CREATE COLUMN TABLE "EmptyModelMap" LIKE "ModelMap";

CREATE COLUMN TABLE "EmptyModelFactors" LIKE "ModelFactors";



INSERT INTO "#Params" ("name", "stringArgs") SELECT "name", "value" FROM "State";



TRUNCATE TABLE "RatingsNew";

INSERT INTO "RatingsNew" VALUES (1,'Julie','Led Zeppelin IV');

--INSERT INTO "RatingsNew" VALUES (1,'Jamie','Born in the U.S.A.');



CALL "_SYS_AFL"."PAL_ALS_PREDICT" ("RatingsNew", "EmptyModelMetadata", "EmptyModelMap", "EmptyModelFactors", "#Params", ?);





-- delete state



TRUNCATE TABLE "#Params";



CALL "_SYS_AFL"."PAL_DELETE_MODEL_STATE" ("State", "#Params", ?);



TRUNCATE TABLE "State";



SELECT * FROM "SYS"."M_AFL_STATES";

Watch a video tutorial of state-enabled scoring in action:

PAL 158 Real Time Scoring - Alternating Least Squares

Last but not least, 14 established PAL algorithms have been enhanced to support additional methods and parameters as well as improve performance.

So all-in-all lot's of cool new stuff in this release!

For the bigger picture, the PAL playlist covers all aspects of SAP HANA Predictive Analysis Library.

If you’re interested to learn about what’s new with HANA 2 SPS 03 in general check out the following playlist.

Happy in-database machine learning with the PAL!

–

The SAP HANA Academy provides free online video tutorials for the developers, consultants, partners and customers of SAP HANA.

Topics range from practical how-to instructions on administration, data loading and modeling, and integration with other SAP solutions, to more conceptual projects to help build out new solutions using mobile applications or predictive analysis.

For the full library, see SAP HANA Academy Library – by the SAP HANA Academy

For the full list of blogs, see Blog Posts – by the SAP HANA Academy

Enhancements to the Predictive Analysis Library in SPS03

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win