Skip to Content
Technical Articles
Author's profile photo Andreas Forster

Azure Machine Learning triggering calculations / ML in SAP Data Warehouse Cloud

If you are comfortable with Azure Machine Learning, you might enjoy the option to use your familiar Azure front ends to trigger data processing in your SAP Data Warehouse Cloud / SAP HANA Cloud systems.

Connect from Azure to the data in SAP Data Warehouse Cloud / SAP HANA Cloud and carry out data explorations, preparations, calculations, and Machine Learning. The data remains in place, thereby avoiding unnecessary data movement and data duplication. Save your outcomes as semantic views or physical tables. And if needed, you can still extract the data. However, you might not need to move all granular records, just the aggregated / prepared / filtered data that you require.

If you would like to stay instead within the SAP Business Technology Platform, you have the familiar choices to deploy your Python / R code with SAP Data Intelligence, CloudFoundry and Kyma.

Either way, SAP’s Python package hana_ml and R package hana.ml.r make it easy to trigger such an advanced analysis from any Python or R environment on the data held in SAP Data Warehouse Cloud.

Let’s go through and an example, which uses a notebook in Azure Machine Learning to explore the data in SAP Data Warehouse Cloud, aggregate it and create a time-series forecast on the aggregates. All triggered from Azure, executed in SAP Data Warehouse Cloud. All necessary code is in this blog.

Prerequisite

In this scenario we are assuming that you are working with SAP Data Warehouse Cloud. However, an implementation with SAP HANA Cloud would be very similar:

  • You need access to a SAP Data Warehouse Cloud system with three virtual CPUs (free trial is not sufficient)
  • The script server must be activated, see the documentation
  • Create a Database User in your SAP Data Warehouse Cloud space, with the options “Enable Automated Predictive Library and Predictive Analysis Library”, “Enable Read Access” and “Enable Write Access” selected, see the documentation
  • To trigger the Machine Learning in SAP Data Warehouse Cloud, you need that user’s password and the system’s host name, see this tutorial

Azure Machine Learning

How to use that SAP Data Warehouse Cloud system now from Microsoft Azure? Open the Azure portal (the free trial is sufficient for this scenario) and go into the “Azure Machine Learning” service. Create a workspace called “DWCworkspace” with a new resource group “DWCresourcegroup”. When the deployment is complete, click on “Go to resource” to open the “DWCworkspace”. From there, launch the “Azure Machine Learning Studio”.

 

Create first a Compute instance. Since the number crunching will be delegated to SAP Data Warehouse Cloud, select the smallest possible, “2 cores, 8 GB RAM” should do. Name it “DWCcompute”.

 

When the compute instance is running, click on it. Take note of the “Public IP address”. Typically this needs to be added to the “IP Allowlist” in SAP Data Warehouse Cloud. See the documentation.

Then create a Notebook “DWC time series” in the Microsoft Azure ML Studio. You see on top, that the notebook is running on our new “DWCcompute” instance.

 

Through the notebook install the hana_ml library from PyPI, as well as the shapely package which is needed for geo-spatial analysis.

!pip install hana-ml==2.13.22072200
!pip install shapely
In the following cell, display the installed version of the hana_ml library, which should return the specified “2.13.22072200”.
import hana_ml
print(hana_ml.__version__)
With the hana_ml library installed, you can establish a connection to the SAP Data Warehouse Cloud. Just enter your database user’s details (see above). This code should result with the output “True”, to confirm that the connection was created. Should this fail, double-check that the IP address of your compute instance was added to the “IP Allowlist” in SAP Data Warehouse Cloud. See the additional comments towards the end of this blog to keep your password secure.
import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(address='[YOURDBUSERSHOSTNAME]',
                                   port=443, 
                                   user='[YOURDBUSER]', 
                                   password='[YOURDBUSERSPASSWORD]')
conn.connection.isconnected()
In your own projects you might be working with data that is already in SAP Data Warehouse Cloud. To follow this example, load some data into the system for us to work with. Load the Power consumption of Tetouan city Data Set from the UCI Machine Learning Repository into a pandas DataFrame. (Salam, A., & El Hibaoui, A. (2018, December). Comparison of Machine Learning Algorithms for the Power Consumption Prediction: Case Study of Tetouan city“. It shows the power consumption of the power consumption of the Moroccan city in 10-minute intervals for the year 2017. We will use the data to create a time-series forecast. (Salam, A., & El Hibaoui, A. (2018, December). Comparison of Machine Learning Algorithms for the Power Consumption Prediction:-Case Study of Tetouan city–. In 2018 6th International Renewable and Sustainable Energy Conference (IRSEC) (pp. 1-5). IEEE)
import pandas as pd 
df_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00616/Tetuan%20City%20power%20consumption.csv') 
df_data.columns = map(str.upper, df_data.columns)
df_data.columns = df_data.columns.str.replace(' ', '_')
df_data.DATETIME = pd.to_datetime(df_data.DATETIME)
Use the hana_ml library and the connection you have already established, to load this data into SAP Data Warehouse Cloud. The table and the appropriate columns will be created automatically.
import hana_ml.dataframe as dataframe
df_remote = dataframe.create_dataframe_from_pandas(connection_context = conn, 
                                                   pandas_df=df_data, 
                                                   table_name='POWERCONSUMPTION',
                                                   force=True,
                                                   replace=False)
Create a hana_ml DataFrame, which points to the table in SAP Data Warehouse Cloud. This variable points to the data that remains in SAP Data Warehouse Cloud. We will be using this variable extensively. Its collect method returns the data as pandas DataFrame. To have a peek at the data, just select 5 rows.
df_remote = conn.table('POWERCONSUMPTION')
df_remote.head(5).collect()

 

You can check on the column types.

df_remote.dtypes()

 

Have SAP Data Warehouse Cloud calculate some statistics about the data content.

df_remote.describe().collect()

 

The above distribution statistics were all calculated in SAP Data Warehouse Cloud. Only the results were transferred to your notebook. You can see the SELECT statement that was created by the describe()-method.

print(df_remote.describe().select_statement)

 

Explore the data with a UnifiedReport of the hana_ml library.

from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(df_remote).build().display()

 

You can also request individual calculations on the data, or ready-made exploratory plots. Have SAP Data Warehouse Cloud calculate a correlation matrix on the granular data. As you would expect, there is for instance a fair positive correlation between temperature and power consumption.

import matplotlib.pyplot as plt
from hana_ml.visualizers.eda import EDAVisualizer
f = plt.figure()
ax1 = f.add_subplot(111) # 111 refers to 1x1 grid, 1st subplot
eda = EDAVisualizer(ax1)
ax, corr_data = eda.correlation_plot(data=df_remote, cmap='coolwarm')
Now plot part of the data’s time series. Download the records of the last 7 weeks and plot them locally. The calculation 6 * 24 * 7 specifies how many most recent rows we are interested in. The data is in 10-minute intervals, so 6 gets a full hour. Multiplied by 24 to get a day, multiplied by 7 to get 7 weeks. The power consumption clearly has a strong weekly pattern.
df_data = df_remote.tail(n=6*24*7, ref_col='DATETIME').collect()
df_data.drop('DATETIME', axis=1).plot(subplots=True,
                                    figsize=(10, 14));
For further analysis add a DATE and MONTH column. These columns are added logically. The original table remains unchanged. Not data gets duplicated.
df_remote = conn.table('POWERCONSUMPTION')
df_remote = df_remote.select('*', ('TO_DATE(DATETIME)', 'DATE'))
df_remote = df_remote.select('*', ('LEFT(DATE, 7)', 'MONTH'))
df_remote.head(5).collect()
With the MONTH column, we can now create a boxplot that shows how consumption changes through the year. August has the highest median consumption.
import matplotlib.pyplot as plt
from hana_ml.visualizers.eda import EDAVisualizer
f = plt.figure()
ax1 = f.add_subplot(111) # 111 refers to 1x1 grid, 1st subplot
eda = EDAVisualizer(ax1)
ax, cont = eda.box_plot(data = df_remote,
                        column='ZONE_1_POWER_CONSUMPTION', groupby='MONTH', outliers=True)
ax.legend(bbox_to_anchor=(1, 1));
Let’s get ready for a daily forecast. Have SAP Data Warehouse Cloud aggregate the 10-minute intervals to daily aggregates and plot the consumption of the last 3 weeks. This is a native Python plot, there will be many options to make this look nicer. By the way, such transformation can also be saved as tables or views in SAP Data Warehouse Cloud, where other processes and programs could continue working with the data.
df_remote_daily = df_remote.agg([('sum', 'ZONE_1_POWER_CONSUMPTION', 'ZONE_1_POWER_CONSUMPTION_SUM')], group_by='DATE')
df_data = df_remote_daily.tail(n=21, ref_col='DATE').collect()

import seaborn as sns
sns.set_theme()
sns.set(rc={'figure.figsize':(15,5)})
sns.lineplot(data=df_data, x="DATE", y="ZONE_1_POWER_CONSUMPTION_SUM")
plt.xticks(rotation=45);
SAP Data Warehouse Cloud has an Automated ML framework built-in as well as 100+ individual algorithms. The hana_ml library’s documentation has all the details. For our time-series forecast we choose to use AdditiveModelForecast, which is an implementation of the Prophet algorithm and train a model.
from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast
amf = AdditiveModelForecast(growth='linear')
amf.fit(data=df_remote_daily.select('DATE', 'ZONE_1_POWER_CONSUMPTION_SUM'))
To create predictions, you just need to specify for which dates predictions should be produced. Use the historic data (on which the model was trained) to dynamically obtain a  pandas DataFrame, which lists the following 21 dates.
import pandas as pd
strLastDate = str( df_remote_daily.select('DATE').max())
df_topredict = pd.date_range(strLastDate, periods=1 + 21, freq="D", closed='right').to_frame()
df_topredict.columns = ['DATE']
df_topredict['DUMMY'] = 0
df_topredict
Bring this data through the hana_ml library as temporary table to SAP Data Warehouse Cloud.
import hana_ml.dataframe as dataframe
df_rem_topredict = dataframe.create_dataframe_from_pandas(connection_context=conn, 
                                                   pandas_df=df_topredict, 
                                                   table_name='#TOPREDICTDAILY',
                                                   force=True,
                                                   replace=False)
Use the temporary table as input for the prediction.
df_rem_predicted = amf.predict(data=df_rem_topredict)
With the prediction now available as hana_ml DataFrame, you are free to use the predictions as required by your use case. Download the predictions for instance into a  pandas DataFrame and have them plotted. The weekly pattern is nicely represented.
df_data = df_rem_predicted.collect()

from matplotlib import pyplot as plt
plt.plot(df_data['DATE'], df_data['YHAT'])
plt.fill_between(df_data['DATE'],df_data['YHAT_LOWER'], df_data['YHAT_UPPER'], alpha=.3);
Or look at the predictions, still from Azure Machine Learning Studio. Here with the predicted columns renamed before the display.
df_rem_predicted = df_rem_predicted.rename_columns(['DATE', 'PREDICTION', 'PREDICTION_LOWER', 'PREDICTION_UPPER'])
df_rem_predicted.collect()
Or save the predictions as table to SAP Data Warehouse Cloud, where SAP Analytics Cloud, or another tool of your choice could pick it to share with a broader user base.
df_rem_predicted.save('POWERCONSUMPTION_21DAYPRED')

 

SAP Analytics Cloud can use a live connection to SAP Data Warehouse Cloud. Just create a view in SAP Data Warehouse Cloud on top of the table. Semantic usage for the view is “Analytical Dataset” and expose it for consumption. SAP Analytics Cloud can then access the data.

 

You have used the Azure Machine Learning Studio to explore data in SAP Data Warehouse Cloud, prepare the data, create a forecast and share the predictions to SAP Analytics Cloud. No data got duplicated along the way, thereby reducing architectural and governmental complexity. This logic can be further automated of course through Azure to regularly produce and share the latest predictions.

Rounding things off

To round off the use of the notebook, here are two related topics you might find useful: keeping the password secure, and creating an environment with the hana_ml package.

Database user password

In the above example, the database user password is entered as clear text in the code. Pragmatic, but not ideal. One option to avoid this, is to simply omit the password. The notebook will prompt you for it when running the cell.
import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(address='[YOURDBUSERSHOSTNAME]',
                                   port=443, 
                                   user='[YOURDBUSER]')
conn.connection.isconnected()
If you don’t want to enter the password every time you run the cell, you can also store / receive it as a secret in Azure.

Environment

In the above example the Python package hana_ml was installed from the notebook. You can also provide a kernel, that has all necessary components installed.
Save the content of the following box as hanaml2.13.22072200azure.yml file. This file specifies the dependencies required to run the code in this example.
name: hanaml2.13.22072200
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.9
  - ipykernel=6.2.0
  - jupyterlab=3.2.5
  - matplotlib=3.4.3 
  - seaborn=0.11.2
  - shapely=1.8.0 
  - numba=0.54.1
  - shap=0.39.0
  - plotly=5.5.0
  - openpyxl=3.0.9
  - pip=21.3.1
  - pip:
    - hana-ml==2.13.22072200

 

Upload the file into the folder where you have the notebook. Then, click the little “Open terminal” icon above and create an environment with this command:

conda env create -f hanaml2.13.22072200azure.yml   

 

When the environment has been build, add it as kernel to the system.

python -m ipykernel install --user --name=hanaml2.13.22072200

 

Now create a new notebook and on the top right you can select the new kernel. Without any further pip install, you have access to the hana_ml package.

Summary

Triggering calculations in SAP Data Warehouse Cloud and SAP HANA Cloud avoids the complexity and risks of moving data around and duplicating it in multiple places. However, there is also the option to bring data from SAP Data Warehouse Cloud to Azure for Machine Learning, without permanently saving it in Azure.

Assigned Tags

      1 Comment
      You must be Logged on to comment or reply to a post.
      Author's profile photo Andrey Rzhaksinskiy
      Andrey Rzhaksinskiy

      Good hands-on, Andreas.

      Integration between algorithms given on different platforms is getting easier