Skip to Content
Technical Articles
Author's profile photo Wei Han

SAP BTP Data & Analytics Showcase – Machine Learning via Python in SAP Data Warehouse Cloud

Introduction

This is the second part of the blog series “SAP BTP Data & Analytics Showcase – Empower Data Scientist with Flexibility in an End-to-End Way”. I recommend you to look into our overall blog to gain a better understanding of the end-to-end scenario and the 1st blog post for data preparation part.

In this blog post, we’d like to demonstrate how to use Python Script to access training data set from SAP Data Warehouse Cloud, create machine learning models, run prediction through ML models and write results back to SAP Data Warehouse Cloud (for modelling and visualisation purpose).

Following the upcoming sections in this blog post, you’re able to learn and complete the implementation in three steps:

  • Connect to SAP Data Warehouse Cloud from Python script
  • Machine learning via Python in Jupyter Notebook – Train ML models with prepared dataset, run ML models for prediction and save prediction results back to SAP Data Warehouse Cloud
  • Consume prediction results in a data model and visualise findings in SAP Analytics Cloud

1. Connect to SAP Data Warehouse Cloud from Python

We need some configuration to enable connection to SAP Data Warehouse Cloud via Python, so that we can access our prepared training data for machine learning and later consume the prediction data for modelling in SAP Data Warehouse Cloud. For this part, I highly recommend you to read this nice blog post first and follow the step-by-step guideline there.

I summarised some key takeaways and want to share with you. They are issues or confusion I was facing during the implementation, and I hope this will help you with a smooth start.

Takeaway 1: Use SAP HANA Cloud machine learning libraries in SAP Data Warehouse Cloud

To enable your Python script to run SAP HANA Automated Predictive Library (APL) and SAP HANA Predictive Analysis Library (PAL), you need first create a ticket. Please refer to this documentation for a step-by-step guideline. You can check if the enablement is successful in your tenant as follows:

Go to “Database Users” under Space and click on the create button. In the popup after creation, you should see the checkbox “Enable Automated Predictive Library (APL) and Predictive Analysis Library (PAL)” is enabled for selection.

 

Takeaway 2: Space Schema vs Open SQL Schema? Read vs Write Access?

You can create a database user for every person who want to work directly with your database. For every database user, an so-called OpenSQL schema is created automatically, which is a different schema from your space schema.

  • Open SQL schema is the space name + database user prefix – CPM_DEMO_202111#PYTHON:  This schema is used to create a database connection and write access to artefacts. For example, I created a connection to my HANA database via Open SQL Schema in Python script:

Inside the Graphical View Builder, you can find this Open SQL schema under Sources and consume the artefacts of it in your data models.

  • Space schema is the same as my space name – CPM_DEMO_202111: I used this schema to read access all views (you can also read data from tables, but you need to build a view on top of your tables) in SAP Data Warehouse Cloud. For example, I read the view “TK_GasolinePrices_Covid19” via my space schema in my Python script:

Takeaway 3: Where to find your connection details?

If you’re curious how to find the connection details related to this database user, such as host, port or password, you can do as follows:

2. Machine Learning via Python in Jupyter Notebook

In this blog post of Data Preparation, we have already created a view (training data) under SAP Data Warehouse Cloud, including all the necessary data from historical gasoline prices, station master, federal states and COVID-19 case number. You can look into this blog post for more details.

Our objective is to predict the next 7 days gasoline prices (Diesel, Super E5 and Super E10), taking the factor of COVID-19 numbers into account. To achieve this goal, Forecasting Multiple Time-Series is applied in this use case. The training data set was grouped by federal state, and each state represents one time-series forecasting. For demo purpose, we take the federal state “Baden-Württemberg” as an example. You can repeat the implementation for other states if you like.

As I have basic knowledge of machine learning, this blog post from my colleague Andreas, Forster helps me a lot. So, I recommend you to read Andreas’ explanation about multiple time-series forecasting and how to implement it via Python script in Jupyter Notebook.

Step 1: Connect to SAP Data Warehouse Cloud

First, you can utilise the hana_ml library to establish a connection to the database of SAP Data Warehouse Cloud. We have mentioned how to find the connection details in Takeaway 3 of “Part 1: Connect to SAP Data Warehouse Cloud from Python”.

1. Import and check the version of hana_ml library

import hana_ml
print(hana_ml.__version__)

2. Establish a connection to database under SAP Data Warehouse Cloud via Open SQL Schema (‘CPM_DEMO_202111#PYTHON’)

from hana_ml import dataframe
from hana_ml.dataframe import ConnectionContext

conn = dataframe.ConnectionContext(address="YOUR HOST URL", port =443, user = "OPEN SQL SCHEMA NAME")

3. Check read access to training data set via Space Schema (‘CPM_DEMO_202111’)

df_remote = conn.sql('select * from "CPM_DEMO_202111"."TK_GasolinePrices_Covid19"')
df_remote.count()

Step 2: Create machine learning model via HANA-APL library

As a next step, I’m going to use the AutoML algorithm “AutoTimeSeries” from the SAP HANA Automated Predictive Library (APL) in the python script. The HANA-APL library is more simple and user-friendly for beginners of machine learning like me.

1. Before creating and saving new forecasts, drop old forecasts from the target table, if it exists.

from hdbcli import dbapi
dbapi_cursor = conn.connection.cursor()

if conn.sql("SELECT TABLE_OID FROM SYS.TABLES WHERE TABLE_NAME = 'GASOLINEPRICE_FORECASTS_SEGMENTED';").count() > 0:
   dbapi_cursor.execute("""TRUNCATE TABLE GASOLINEPRICE_FORECASTS_SEGMENTED;""")

2. Machine learning model creation

Let’s look at the following Python codes for ML model creation. What is important here is that the last day inside our training data is “2021-10-21”. So, the forecasting shall start with date “2021-10-22”.

# In this example, one time-series is shown. So, the state includes only "Baden-Württemberg"
df_remote_segment = conn.sql('SELECT * FROM "CPM_DEMO_202111"."TK_GasolinePrices_Covid19"' )
    
# Sort data ascending, a requirement of the Automated Predictive Library
df_remote_segment = df_remote_segment.sort("DATE", desc = False)
# Check the last day in training data set, namely 2021-10-21
# Therefore, the forecasting shall starts with 2021-10-22
df_remote_segment.collect().tail(5)

import pandas as pd
columns = ["SEGMENT", "MAPE"]
df_allmapes = pd.DataFrame(columns=columns)

# Import HANA-APL library
from hana_ml.algorithms.apl.time_series import AutoTimeSeries
# Forecast next 7 days --> Super E5, based on column "DATE"
tsapl = AutoTimeSeries(time_column_name = 'DATE', target = 'E5', horizon = 7)
tsapl.fit(data = df_remote_segment)

Step 3: Run prediction and store results via Open SQL Schema

In this step, we will apply the established ML model to forecast Super E5 price in the next 7 days (starting from 2021-10-22) and write the forecasting results back to Open SQL schema (“CPM_DEMO_202111#PYTHON”) in SAP Data Warehouse Cloud. Let’s see the following statements in Python script together.

from hana_ml.dataframe import DataFrame
# Extra-predictable variables' values on the forecast period
forecast_df = DataFrame(conn,
                      'SELECT * '
                      'from "CPM_DEMO_202111"."TK_GasolinePrices_Covid19"'
                      'WHERE "E5" is null LIMIT 7')

df_remote_aplforecast = tsapl.forecast(data= forecast_df)

# Check the next 7 days forecasting
df_remote_aplforecast_next7 = df_remote_aplforecast.collect().tail(7)
df_remote_aplforecast_next7 

# Write the forecasting results in the target table of SAP Data Warehouse Cloud
df_remote_aplforecast = df_remote_aplforecast.select('*', ("'" + segment_bw + "'", 'SEGMENT'))
df_remote_aplforecast = df_remote_aplforecast.cast("SEGMENT", "NVARCHAR(100)") # Needed to ensure future segments with longer names can be stored
df_remote_aplforecast.save("GASOLINEPRICE_FORECASTS_SEGMENTED", append = True)

You can check the prediction data in the target table by using the graphical view builder in SAP Data Warehouse Cloud, and later consume it in your data models.

Step 4: Check the quality of machine learning model

Finally, let’s check quality of the created machine learning model and visualise the accuracy via Python. Actually, the accuracy of prediction is quite good.

# Track model quality
ts_mape = tsapl.get_performance_metrics()['MAPE'][0]
df_allmapes = df_allmapes.append({"SEGMENT": segment_bw, "MAPE": ts_mape}, ignore_index=True)
df_allmapes

%matplotlib inline
import matplotlib.pyplot as plt
df_remote = conn.sql("SELECT * FROM GASOLINEPRICE_FORECASTS_SEGMENTED WHERE SEGMENT ='Baden-Württemberg'")
ts_data = df_remote.collect()
plt.figure(figsize=(15, 5))
plt.plot(ts_data["ACTUAL"], linewidth=1)
plt.plot(ts_data["PREDICTED"], linewidth=1)
plt.fill_between( ts_data.index, ts_data["LOWER_INT_95PCT"], ts_data["UPPER_INT_95PCT"], color='b', alpha=.1);

For demonstration purpose, we showed a time-series for one type of gasoline prices (Super E5) in one federal state (Baden-Württemberg). You can repeat the procedure and pursue multiple time-series forecasting in a loop by your own need.

3. Consume prediction results and visualise findings

In the last section of this blog post, we are going to consume prediction results in a data model and visualise the findings in a dashboard of SAP Analytics Cloud. So, let’s move back to the graphical view builder of SAP Data Warehouse Cloud again and build an analytical dataset there.

A simple view is created for the target table under Open SQL Schema as follows:

In SAP Analytics Cloud, a story was created to visualise the next 7 days forecasting of Super E5 prices. You can combine this chart in your final report and show this finding to business users.

Conclusion

Congratulations! You have completed all the implementation in this end-to-end demo. We’re intending to show how data scientists keep working with their familiar language, e.g., python, in their familiar tools like Jupyter Notebook, easily connect to prepared data in SAP Data Warehouse Cloud and pursue machine learning part. Hope you enjoyed this blog post!

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Assigned Tags

      4 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Deniz Osoy
      Deniz Osoy

      Excellent blog post!

      Author's profile photo Wei Han
      Wei Han
      Blog Post Author

      Thank you, Deniz!

      Author's profile photo Marc DANIAU
      Marc DANIAU

      Thank you for this showcase of machine learning.

      Do you mind using the tag APL to appear in this list ? https://blogs.sap.com/tag/APL

      Author's profile photo Wei Han
      Wei Han
      Blog Post Author

      Sure, I added it. Thank you for the hint. 🙂