Skip to Content
Technical Articles
Author's profile photo Dirk Kemper

Large-scale forecasting using HANA, APL and Node.js

Background

Last year I became involved in a project for a retailer based in The Netherlands who had finished construction of a new distribution center. This new innovative DC is fully mechanized and is operated with a minimal amount of personnel, which is different from conventional distribution centers where sudden large order spikes are fulfilled by having more staff pick these orders in parallel. The automated nature of the machinery in this new DC means that the order picking process has a fixed capacity and cannot be scaled up easily by adding more personnel on the fly. In order to optimize its capacity, the DC will therefore also be performing tasks during quiet hours.

These tasks are derived from a forecast the client must supply to the DC controlling software. In practice a 7-day’s forecast needs to be provided to the DC on a daily basis.

Forecasting setup and project approach

For generating these forecasts an SAP HANA system was loaded with a data mart of previous daily goods movements from the DC to all stores on an individual article level. The data mart contains about 5 years of history for each of the 70 stores with an assortment of about 10.000 articles. As you can tell this is a quite large dataset but it the large history is very helpful in determining seasonality patterns occurring yearly.

During the project we are loosely applying the CRISP-DM project approach in an iterative fashion. This approach makes sure phases as data understanding, modeling and evaluation are conducted in iteratively, improving upon the setup each time round.

The technology platform was based on the Advanced Predictive Library (APL) on a HANA 2.0 SPS05 system and an SAP Data Services system for the triggering of the database stored procedures and performing ETL on the resulting forecast.

Models are being trained on a store/article level, which means that 700.000 models are trained during a single run. As the underlying hardware has a large number of available cores all stores can be trained in parallel. Every model has about 150 input variables, ranging from data about the articles themselves, the stores, banking holidays, the weather and the past goods movements of that article for the last days. All of these variables are fed to APL which handles the selection of most relevant variables, removes the variables which have correlations amongst each other, etc. The modeling process takes about 12 hours to complete, running in parallel for all stores.

Technical challenges

The main technical challenge of this setup was the optimization of the performance of both the train and apply procedures. As indicated the system manages to train the 700.000 models in about 12 hours, which was acceptable because this process is scheduled to run weekly. It is assumed that the models retain their value during the week as they have been trained on multiple years of data and the external circumstances will not alter that quickly. However the in-database APL apply procedure ran for about 9 hours which would not fit in the appointed time slot between other tasks the system was performing daily.

The solution for this came in the form of a JavaScript-based apply engine which in a recent release is bundled together with the APL[1]. This engine performs the same calculations as the in-database apply procedure but is much faster at it, able to do a single prediction in several milliseconds. By running the code through Node.js this allows the JavaScript engine to be ran standalone on the command line. Moreover Node.js allows for efficient multithreaded programming which was extremely useful to parallelize the workload of applying the millions of models. This allowed the apply process to be brought down to about 20 minutes.

The high-level steps to make this setup work are as follows:

  1. Train a model using the in-database APL train functions as normal
  2. Export the model to JSON format using the EXPORT_APPLY_CODE built-in function
  3. Extract the JSON model and a JSON-formatted extract of the future days the model needs to be applied on from the database into a filesystem
  4. Apply the model onto the future days to get the forecast

As you can see this is a few extra steps when compared to simply calling an in-database procedure to perform a forecast within HANA. But even when taking into account the additional exports of the model and forecast data the total runtime is several orders of magnitude faster than the in-database apply. In fact this process now runs in about 45 minutes including the time taken to move data to the file system.

Detailed technical explanation

I will now give a detailed explanation of the high level steps so you can try a similar setup for you own forecasting scenario. You will need access to an HANA environment with APL >= version 2018.2 using Eclipse (or HANA studio).

Training the APL model

For the brevity of this blog post I am assuming a bit prior knowledge of training a model using APL. There are good examples available in the APL documentation on https://help.sap.com under the CREATE_MODEL_AND_TRAIN function reference. A set of good examples is also available by downloading the APL package from the SAP support site and looking into the ‘samples’-folder.

In this setup I will be training a regression model with a target called TARGET_GOODS_MOVEMENTS based on a set of independent variables. For the sake of this example I will just be using two: DISCOUNT_INDICATOR which is a boolean (0 or 1) variable indicating if the article is discounted and a DISCOUNT_VALUE which a continuous value indicating the applied discount. A real-world example may possibly use hundreds of independent variables.

This makes the following table which we call MOVEMENT_HISTORY_TABLE:

Variable Storage Value type Description
DATE Date Continuous Date of the goods movement
DISCOUNT_INDICATOR Number Ordinal Indicates if article is discounted
DISCOUNT_VALUE Number Continuous Applied discount to the article
TARGET_GOODS_MOVEMENTS Number Continuous Number of goods movements

Note that this is the structure of the historical dataset. Obviously the future dataset will not contain values for the TARGET_GOODS_MOVEMENTS column, however you do need future information on whether the article is discounted. This needs to be available in the same time horizon on which you want to perform the forecast.

When calling CREATE_MODEL_AND_TRAIN make sure to output the trained model into a table called OUT_MODEL_REGRESSION.

Export the model to JSON format

After training the model is stored into binary format in table OUT_MODEL_REGRESSION. It is necessary to convert this into a format which can be consumed by the JavaScript-based apply engine based on JSON. The APL code to perform this conversion is below:

-- --------------------------------------------------------------------------
-- Create the input/output tables used as arguments for the APL function
-- --------------------------------------------------------------------------
drop table FUNC_HEADER;
create table FUNC_HEADER like "SAP_PA_APL"."sap.pa.apl.base::BASE.T.FUNCTION_HEADER";

drop table OUT_EXPORT_JSON_REGRESSION;
create table OUT_EXPORT_JSON_REGRESSION (
    "OID" VARCHAR(50),
    "KEY" VARCHAR(100),
    "VALUE" NCLOB
);

-- --------------------------------------------------------------------------
-- Execute the APL function using its AFL wrapper and the actual input/output tables
-- --------------------------------------------------------------------------

-- Export JSON

drop table EXPORT_CODE_CONFIG;
create table EXPORT_CODE_CONFIG like "SAP_PA_APL"."sap.pa.apl.base::BASE.T.OPERATION_CONFIG_EXTENDED";
insert into EXPORT_CODE_CONFIG values ('APL/CodeType', 'JSON',null);
insert into EXPORT_CODE_CONFIG values ('APL/CodeTarget', 'class',null);

DO BEGIN    
    header           = select * from FUNC_HEADER;            
    model_in         = select * from OUT_MODEL_REGRESSION;
    export_config    = select * from EXPORT_CODE_CONFIG;

    "SAP_PA_APL"."sap.pa.apl.base::EXPORT_APPLY_CODE"(:header, :model_in, :export_config, :out_code);

    -- store result into table
    insert into  "USER_APL"."OUT_EXPORT_JSON_REGRESSION"            select * from :out_code;
END;

At this point table OUT_EXPORT_JSON_REGRESSION will contain your model in JSON format. This is a human-readable format which may expose some details about the model you haven’t noticed earlier. To get this using HANA studio make sure to convert the model payload to a VARCHAR:

SELECT TO_VARCHAR(VALUE) FROM OUT_EXPORT_JSON_REGRESSION

Now right-click on the output cell and select “Copy cells” and paste it into your favorite text editor. This will display the JSON in a formatted way. Note for instance that the DATE-variable is automatically transformed into others like “MonthOfYear”, “DayOfWeek” and many more to make a single date more descriptive. This is one of the automated activities that APL offloads for you.

Extract the JSON format of the model and future days

At this point we are ready to extract the table records from the database into files which can be processed by the JavaScript engine. To perform a forecast we obviously need the model that was exported to JSON format in the previous step, but also one or more records representing future days on which the forecast needs to be applied.

To be able to run the next few programs you’ll require a Node.js runtime to be installed. Node.js is a runtime scripting engine which is similar to environments like Pyton or Perl but allows to use JavaScript as a language. It is in fact the same JavaScript engine as is used in Google’s Chrome browser but now packaged as a standalone executable in order to be used as a generic programming language.

Please refer to https://nodejs.org for instructions on installing Node.js, there are native binaries for Windows and MacOS available. When using Linux you should look into using your package manager.

Node.js comes with a package manager called ‘npm’ which is useful for installing and managing libraries or add-ons you require in your program. For this example we require the SAP HANA client package to natively and easily connect Node.js to HANA. Use command ‘npm install @sap/hana-client’ from your working directory where the scripts are saved.

Save the below code to a file (eg. extract-models.js) and run it using ‘node extract-models.js’. The program will set up a connection to the HANA system and will write the model from the table to a single file. Note: do not forget to modify the connection parameters to your HANA system.

var hana = require('@sap/hana-client');
var hanaStream = require('@sap/hana-client/extension/Stream');
var fs = require('fs');

var connOptions = { serverNode: 'hanasystem:32015', UID: 'USER_APL', PWD: '012345', 
   sslValidateCertificate: 'false' };

var connection = hana.createConnection();

connection.connect(connOptions, function(err) {
    if (err) {
        return console.error(err);
    }

    var stmt = connection.prepare('select * from USER_APL.OUT_EXPORT_JSON_REGRESSION');
    var rs = stmt.execQuery();
    var stream = hanaStream.createObjectStream(rs);
    var writeStream = fs.createWriteStream('model-output.json');

    stream.on('readable', function (data) {
        var data;
        while (null !== (data = stream.read())) {
            // write JSON model record to file
            writeStream.write(data.VALUE);    // model is in the VALUE column
        }
    });

    stream.on('end', function () {
        // done
        connection.close();
    });
});

To extract the future days you will be forecasting create a new file called ‘extract-forecast-days.js’ and use the below code. I have highlighted the lines that have been changed compared to the previous extraction program.

var hana = require('@sap/hana-client');
var hanaStream = require('@sap/hana-client/extension/Stream');
var fs = require('fs');

var connOptions = { serverNode: 'hanasystem:32015', UID: 'USER_APL', PWD: '012345',
   sslValidateCertificate: 'false' };
var connection = hana.createConnection();

connection.connect(connOptions, function(err) {
    if (err) {
        return console.error(err);
    }

    var stmt = connection.prepare('select DATE, DISCOUNT_INDICATOR, DISCOUNT_VALUE 
        from USER_APL.MOVEMENT_HISTORY_TABLE');

    var rs = stmt.execQuery();
    var stream = hanaStream.createObjectStream(rs);
    var writeStream = fs.createWriteStream('forecast-days-output.json');

    stream.on('readable', function (data) {
        var data;
        while (null !== (data = stream.read())) {
            // write JSON model record to file
            var struct = [
              { variable: 'DATE', value: data.DATE },
              { variable: 'DISCOUNT_INDICATOR', value: data.DISCOUNT_INDICATOR },
              { variable: 'DISCOUNT_VALUE', value: data.DISCOUNT_VALUE }
            ];

            writeStream.write(JSON.stringify(struct));
        }
    });

    stream.on('end', function () {
        // done
        connection.close();
    });
});

After running the modified code you should have an additional file called ‘forecast-days-output.json’ together with the ‘model-output.json’ file created in the previous program.

Apply the model onto future days to get the forecast

The both files created in the previous steps need to be applied to one another to generate the forecast. The model encodes information about the expected target based on the independent variables DATE (and its derived variables), DISCOUNT_INDICATOR and DISCOUNT_AMOUNT. These dependent variables are part of the future days in the forecasting horizon and are expected to be known upfront.

Before running the code download the APL package from https://support.sap.com by navigating to ‘Software Downloads’ and searching for ‘APL’. After extracting the downloaded package you will find the scoring runtime in the ‘samples/runtimes/javascript’ directory. Copy the files from this directory into a folder called ‘lib’ which resides in your own working directory where you have stored the extraction scripts.

Now save the below code into a file called ‘predict.js’:

const runtime = require("./lib/autoRuntime");
const fs = require('fs');

let modelDefinition = JSON.parse('model-output.json');
let predictRow = JSON.parse('forecast-days-output.json');
let forecastEngine = runtime.createEngine(modelDefinition);
let prediction = forecastEngine.getScore(predictRow);
let score = prediction["score"];

console.log('number of goods movements predicted: ' + score);

After running the code you will see a statement logging the prediction for the future day you want to get the forecast for. These steps will need to be repeated for each day in the forecasting horizon, so you will possibly need to create a ‘forecast-days-output.json’ file for each of those days. The prediction program should now loop over those files to get a full prediction, possibly logging them into a local file or writing them back into the HANA database.

Wrapup

I have shown you an approach to use a model which was trained in HANA using APL in a standalone environment based on Node.js by exporting the model to a specific format and applying this to the forecasting horizon. The end result in our solution was a huge runtime improvement over the in-database apply functions that APL provides in HANA itself which went down from hours to minutes. This makes the solution very useful for large-scale processing of forecasts.

Of course there are many improvements that could be made to this setup which require a more in-depth knowledge of Node.js engine and programming model, like managing a larger number of models or enabling multithreading to allow for parallel model scoring. If interested you should search the internet for resources on this as the current approach gives you a baseline setup.

In the next part of this blog post I will present an approach for explainable forecasting using the APL which allows a business user to look “under the hood” in case of forecasting anomalies.

This blog post has also been published on https://analytix.nl/enterprise-ai/large-scale-forecasting-using-hana-apl-and-node-js/

 

[1] https://blogs.sap.com/2020/12/07/hands-on-tutorial-score-your-apl-model-in-stand-alone-javascript/

Assigned Tags

      5 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Adam Nekola
      Adam Nekola

      Thank you Dirk Kemper for this very informative and intriguing article!

      I myself am trying to dig deeper into forecasting using HANA libraries as well as standard tools like R or Python and this article is very helpful!

      Could I have some questions?

      1. Approximately how long did each step of the CRISP-DM took and approximately how long it took from the start-to-end of this project?
      2. How were you selecting which input variables to use? Was it mostly the consulting or client that suggested which ones to use or was it a cooperation?
      3. If I understand correctly, every week when you create new forecasts you always use all of the disponible data for the model, is that correct? Do you think that it is neccessary to use all of the data? Wouldn't be, for example, just 1 or 2 last years more accurate instead of using 5 years old data, that could be irrelevant? Maybe it would even increase the speed/performance of the estimation. And do you plan to use all of the data even after another 5-10 years when the amount of the data would increase drastically?

      Thank you very much for you asnwers!
      Adam Nekola

      Author's profile photo Dirk Kemper
      Dirk Kemper
      Blog Post Author

      Hi Adam,

      1. The project is running for about a year now. We went through about 4 iterations of the CRISP-DM process, which in fact is key to improving your forecasts performance. We have seen many issues in the dataset which required solving, as correct data is absolutely crucial to getting your preditictions correct. Even the slightest bugs or offsets in the dataset will severly impact the prediction performance.
      2. We have tried to use as much input variables as the data marts had available together with the client. APL will calculate for each model which variables to actually use. The following types of variables were excluded:
        • Numbers which were the result of other forecasting packages (in order not to offset the APL prediction by incorrect predictions from other methods)
        • Information which is available historically but not forward-looking, like the number of customer visits in a store. This sounds easy enough but is not always easy to determine.
        • Information which is not generic enough, like promotion codes that resemble a date
      3. This is a very good question and is actually one of the topics under investigation now. We are noticing the history is quite broad at this point and for many articles there have been changes in buying behaviour due to external factors like COVID or new legal limits on alcoholic beverage discounts. Optimizing the historical data limit will be a topic going forward.
      Author's profile photo Andreas Forster
      Andreas Forster

      Many Thanks for taking the time to share this cool use case and implementation with the community! Dirk Kemper

      Author's profile photo Marc DANIAU
      Marc DANIAU

      Thanks for decribing your use case Dirk Kemper

      Do you mind using the tag APL to appear in this list ? https://blogs.sap.com/tag/APL

       

      Author's profile photo Dirk Kemper
      Dirk Kemper
      Blog Post Author

      Thanks for noticing, I just added the tag