Welcome to week 7 of Machine Learning in a Box!

abdel_dadouche · ‎03-27-2018

In case you are catching the train running, here is the link to the introduction blog of the Machine Learning in a Box series which allow you to get the series from the start. At the end of this introduction blog you will find the links for each elements of the series.

Before we get started, a quick recap from last week time

Last time, we looked at how to leverage the SAP HANA R integration which opens the door to about 11,000 packages. So, if you feel like the built-in libraries (APL & PAL) don't offer what you need or if you feel like doing something your way too, now you can!

I hope you all managed to try this out, and probably some of you already started comparing the PAL implementation with R algorithms. Feel free to share your feedback!

I know that I promised to dive into the TensorFlow integration last time I publish (3 weeks ago).

But due to technical difficulties on my NUC and some of the Virtual Machines I run, which I had to rebuild a couple of times (yes, can still mess up a system if you don't follow instructions), I decided to postpone this piece until I can figure out an easy path to setup the TensorFlow Serving ModelServer on a SUSE or Red Hat system, and provide an interesting but simple model examples.

Next week, I'll do my best get the External Machine Learning (EML) library piece out!

But this week, the topic will be Jupyter Notebook!

Welcome to week 7 of Machine Learning in a Box!

Jupyter Notebook

You may ask why Abdel is talking about Jupyter Notebook as this is not an SAP product. The reason is simple: it's a great tool to be used with SAP products. But also because, the use of Jupyter Notebook is becoming very common these days for Machine Learning related activities.

This is probably the perfect tool to use with SAP HANA, express edition for all your Machine Learning activities as it will allow you to do almost everything you need there:

Run with SQL queries and use the results in visualizations

Code in Python or R and leverage the thousands of available packages while consuming data from your SAP HANA instance if you

and so many other things

About Jupyter Notebook and Project Jupyter

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.

Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

Project Jupyter is a non-profit, open-source project, born out of the IPython Project in 2014 as it evolved to support interactive data science and scientific computing across all programming languages.

Jupyter is developed in the open on GitHub, through the consensus of the Jupyter community. For more information on our governance approach, please see our Governance Document.

Installing Jupyter Notebook

You can install Jupyter Notebook both locally on your machine or on your SAP HANA, express edition server.

The installation is pretty straight forward if you choose a local installation, you can refer to the online installation guide.

Now, if you decide to setup Jupyter on your SAP HANA, express edition, you will need to consider a few things which I have documented in a tutorial.

For example, you will need to generate a configuration file and enable the use of the machine IP address instead of localhost, in addition to set up SSL (optional) and a password (optional).

As usual, I produce a tutorial that guide you through the installation, the configuration and the connectivity to you SAP HANA, express edition instance. Here is the link:

Use Jupyter Notebook with SAP HANA, express edition

Feel free to use the "provide feedback" link in the tutorial to let me know what you think about it.

SAP HANA, express edition + Jupyter + SQLAlchemy = Magic

You have probably realized that Jupyter was initially designed to run Python script (and is actually built using a series of python modules).

This means that you can use the SAP HANA Python driver to connect and consume your data. But would require you to write Python code using the Python Database API.

What if I can tell you that you could build a Jupyter Notebook that runs SQL with using the Python Database API?

This would be like magic? In fact, it's ipython-sql magic. Thanks to Catherine Devlin works, you can now prefix your SQL statement with a simple %sql and get the results!

Using ipython-sql magic implies the use SQLAlchemy, a great Python SQL Toolkit and Object Relational Mapper module, and the SQLAlchemy Dialect for SAP HANA which is part of the open-source SAP repository on Git.

To start playing with it, add and adjust the following code in the first cell of your notebook:

import sqlalchemy

%reload_ext sql

%config SqlMagic.displaylimit = 5



hxe_connection = 'hana://ML_USER:Welcome18Welcome18@hxehost:39015';



%sql $hxe_connection

Then you can prefix all your SQL statement with a %sql like this:

%sql select * FROM M_DATABASE;

If you want you can then manipulate the return result set (API😞

result = _

print(result)

If you want to learn more about open source @ SAP make sure you follow my colleague jonathanb or check his blog Introducing Open Source at SAP for more details.

SAP HANA, express edition + Jupyter + (R or Python) = code, test, demo & share

When you enter the "Machine Learning" universe, you will probably think that your day to day job will be to run algorithms on your data.

But this is quite far from reality.

As you read in Machine Learning in a Box (week 2) : Project Methodologies, a big portion of the effort is spent (and sometime wasted) on data preparation.

So you will invest plenty of your time analyzing data and finding evidence that you will need to document. And I think (and from experience building lots documentation and PowerPoint), you have to make it as interactive as possible.

If we look at the Iris flower dataset, and even if it's "just" 4 numerical attributes to predict species, you may use different visualization to achieve that with your language of choice.

Let's say you want to plot the frequency of species based on each of the 4 attributes, this is what you would code with Python Kernel in Jupyter as an example using sklearn and pyplot from matplotlib:

import pandas as pd

import matplotlib.pyplot as plt 



from sqlalchemy import create_engine

engine = create_engine('hana://ML_USER:Welcome18Welcome18@localhost:39015')



iris_db = pd.read_sql_query("select * from R_DATA.IRIS",engine)



target_names  = iris_db.Species.unique()

feature_names = list(iris_db.drop('Species', axis=1).columns.tolist())



fig, axes = plt.subplots(figsize=(20,10), nrows=2, ncols=2)

fig.suptitle("Iris Frequency Histogram", fontsize=16)

colors= ['red', 'green', 'blue']



for i, ax in enumerate(axes.flat):

    for idx_label, color in zip(range(len(target_names)), colors):

        ax.hist(

            iris_db.loc[iris_db.Species == target_names[idx_label], feature_names[i]], 

            label=target_names[label], 

            color=color

        )

        ax.set_xlabel(feature_names[i])

        ax.set_ylabel('Frequency')

axes[1,1].legend(loc='upper right')

plt.show()

But you can do exactly the same with a R Kernel:

library(ggplot2) 

library(grid)

library(gridExtra)

library("RJDBC")



jdbcDriver <- JDBC(driverClass="com.sap.db.jdbc.Driver", "/usr/sap/hdbclient/ngdbc.jar")

jdbcConnection <- dbConnect(

    jdbcDriver,

    "jdbc:sap://localhost:39015/?autocommit=false",

    "ML_USER",

    "Welcome18Welcome18"

)

iris <- dbGetQuery(jdbcConnection, "select * from R_DATA.IRIS")



sl <- ggplot(data=iris, aes(x=Sepal.Length)) + 

    geom_histogram(binwidth=0.2, aes(fill=Species)) + 

    xlab("Sepal Length") + 

    theme(legend.position="none") + 

    ylab("Frequency")



sw <- ggplot(data=iris, aes(x=Sepal.Width )) + 

    geom_histogram(binwidth=0.2, aes(fill=Species)) + 

    xlab("Sepal Width" ) + 

    theme(legend.position="none") + 

    ylab("Frequency")



pl <- ggplot(data=iris, aes(x=Petal.Length)) +

    geom_histogram(binwidth=0.2, aes(fill=Species)) +

    xlab("Petal Length") +

    theme(legend.position="none") +

    ylab("Frequency")



pw <- ggplot(data=iris, aes(x=Petal.Width )) +

    geom_histogram(binwidth=0.2, aes(fill=Species)) +

    xlab("Petal Width") +

    theme(legend.position="none") +

    ylab("Frequency")



grid.arrange(sl, sw, pl, pw, ncol = 2, nrow = 2, top = textGrob("Iris Frequency Histogram"))



dbDisconnect(jdbcConnection)

And this will output something like this:

Or using scatter plots in Python:

import pandas as pd

import matplotlib.pyplot as plt 



from sqlalchemy import create_engine

engine = create_engine('hana://ML_USER:Welcome18Welcome18@localhost:39015')



iris_db = pd.read_sql_query("select * from R_DATA.IRIS",engine)



target_names  = iris_db.Species.unique()

feature_names = list(iris_db.drop('Species', axis=1).columns.tolist())

feature_len = len(iris.feature_names)

target_len  = len(iris.target_names)





fig, axes = plt.subplots(

    figsize=(20,10),

    nrows=dim, 

    ncols=dim, 

    sharex='col', 

    sharey='row'

)

fig.suptitle("Edgar Anderson's Iris Data", fontsize=16)



plot_colors = ['blue', 'white', 'red']



for x in range(feature_len):

    for y in range(feature_len):

        ax = axes[x,y]

        if x == y:

            ax.text(0.5, 0.5, 

                    feature_names[x], 

                    transform=ax.transAxes, 

                    horizontalalignment='center', 

                    verticalalignment='center', 

                    fontsize=16

                   )

        else:

            for idx_class, color in zip(range(target_len), plot_colors):

                idx = np.where(iris_db.Species == idx_class)                

                ax.scatter(

                    iris_db.loc[iris_db.Species == target_names[idx_class], feature_names[x]], 

                    iris_db.loc[iris_db.Species == target_names[idx_class], feature_names[y]], 

                    c=color, 

                    cmap=plt.cm.RdYlBu, 

                    label=target_names[idx_class], 

                    edgecolor='black', 

                    s=40

                )

plt.show()

And in R:

library("RJDBC")

jdbcDriver <- JDBC(driverClass="com.sap.db.jdbc.Driver", "/usr/sap/hdbclient/ngdbc.jar")

jdbcConnection <- dbConnect(

    jdbcDriver,

    "jdbc:sap://localhost:39015/?autocommit=false",

    "ML_USER",

    "Welcome18Welcome18"

)

iris_db <- dbGetQuery(jdbcConnection, "select * from R_DATA.IRIS")



# repalce text value by color

iris_db$SpeciesColor[iris_db$Species == "setosa"] <- "blue"

iris_db$SpeciesColor[iris_db$Species == "versicolor"] <- "white"

iris_db$SpeciesColor[iris_db$Species == "virginica"] <- "red"



pairs(

    iris_db[1:4], 

    main = "Edgar Anderson's Iris Data", 

    pch = 21, 

    bg = unclass(iris_db$SpeciesColor)

)

dbDisconnect(jdbcConnection)

What other options do I have to "demo & share" but without too much coding?

Then I think SAP Lumira is what you are looking for. And the good news is that you even have now a dedicated openSAP course

Getting Started with SAP Lumira

Conclusion

Adding Jupyter to our software stack will really help you achieve some of the key goals in data science: code, test, demo & share.

Jupyter is definitely the tool I'll use and push during the next hackathon I will attend or host.

This time, the next publication will be about the TensorFlow and I already built the content.

I'm just missing a simple example around image processing, for which one of the colleague in Walldorf kindly sent me an example that I need to investigate.

After this I will take a pause in the publication, as I'm expecting a new family member on the 1st of April (no joke).

(Remember sharing && giving feedback is caring!)

UPDATE: Here are the links to all the Machine Learning in a Box weekly blogs: