Machine Learning in a Box (week 7) : Jupyter Notebook
In case you are catching the train running, here is the link to the introduction blog of the Machine Learning in a Box series which allow you to get the series from the start. At the end of this introduction blog you will find the links for each elements of the series.
Before we get started, a quick recap from last
Last time, we looked at how to leverage the SAP HANA R integration which opens the door to about 11,000 packages. So, if you feel like the built-in libraries (APL & PAL) don’t offer what you need or if you feel like doing something your way too, now you can!
I hope you all managed to try this out, and probably some of you already started comparing the PAL implementation with R algorithms. Feel free to share your feedback!
I know that I promised to dive into the TensorFlow integration last time I publish (3 weeks ago).
But due to technical difficulties on my NUC and some of the Virtual Machines I run, which I had to rebuild a couple of times (yes, can still mess up a system if you don’t follow instructions), I decided to postpone this piece until I can figure out an easy path to setup the TensorFlow Serving ModelServer on a SUSE or Red Hat system, and provide an interesting but simple model examples.
Next week, I’ll do my best get the External Machine Learning (EML) library piece out!
But this week, the topic will be Jupyter Notebook!
Welcome to week 7 of Machine Learning in a Box!
You may ask why Abdel is talking about Jupyter Notebook as this is not an SAP product. The reason is simple: it’s a great tool to be used with SAP products. But also because, the use of Jupyter Notebook is becoming very common these days for Machine Learning related activities.
This is probably the perfect tool to use with SAP HANA, express edition for all your Machine Learning activities as it will allow you to do almost everything you need there:
- Run with SQL queries and use the results in visualizations
- Code in Python or R and leverage the thousands of available packages while consuming data from your SAP HANA instance if you
- and so many other things
About Jupyter Notebook and Project Jupyter
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.
Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
Project Jupyter is a non-profit, open-source project, born out of the IPython Project in 2014 as it evolved to support interactive data science and scientific computing across all programming languages.
Jupyter is developed in the open on GitHub, through the consensus of the Jupyter community. For more information on our governance approach, please see our Governance Document.
Installing Jupyter Notebook
You can install Jupyter Notebook both locally on your machine or on your SAP HANA, express edition server.
The installation is pretty straight forward if you choose a local installation, you can refer to the online installation guide.
Now, if you decide to setup Jupyter on your SAP HANA, express edition, you will need to consider a few things which I have documented in a tutorial.
For example, you will need to generate a configuration file and enable the use of the machine IP address instead of localhost, in addition to set up SSL (optional) and a password (optional).
As usual, I produce a tutorial that guide you through the installation, the configuration and the connectivity to you SAP HANA, express edition instance. Here is the link:
Feel free to use the “provide feedback” link in the tutorial to let me know what you think about it.
SAP HANA, express edition + Jupyter + SQLAlchemy = Magic
You have probably realized that Jupyter was initially designed to run Python script (and is actually built using a series of python modules).
This means that you can use the SAP HANA Python driver to connect and consume your data. But would require you to write Python code using the Python Database API.
What if I can tell you that you could build a Jupyter Notebook that runs SQL with using the Python Database API?
This would be like magic? In fact, it’s ipython-sql magic. Thanks to Catherine Devlin works, you can now prefix your SQL statement with a simple %sql and get the results!
Using ipython-sql magic implies the use SQLAlchemy, a great Python SQL Toolkit and Object Relational Mapper module, and the SQLAlchemy Dialect for SAP HANA which is part of the open-source SAP repository on Git.
To start playing with it, add and adjust the following code in the first cell of your notebook:
import sqlalchemy %reload_ext sql %config SqlMagic.displaylimit = 5 hxe_connection = 'hana://ML_USER:Welcome18Welcome18@hxehost:39015'; %sql $hxe_connection
Then you can prefix all your SQL statement with a %sql like this:
%sql select * FROM M_DATABASE;
If you want you can then manipulate the return result set (API):
result = _ print(result)
If you want to learn more about open source @ SAP make sure you follow my colleague Jonathan Baker or check his blog Introducing Open Source at SAP for more details.
SAP HANA, express edition + Jupyter + (R or Python) = code, test, demo & share
When you enter the “Machine Learning” universe, you will probably think that your day to day job will be to run algorithms on your data.
But this is quite far from reality.
As you read in Machine Learning in a Box (week 2) : Project Methodologies, a big portion of the effort is spent (and sometime wasted) on data preparation.
So you will invest plenty of your time analyzing data and finding evidence that you will need to document. And I think (and from experience building lots documentation and PowerPoint), you have to make it as interactive as possible.
If we look at the Iris flower dataset, and even if it’s “just” 4 numerical attributes to predict species, you may use different visualization to achieve that with your language of choice.
Let’s say you want to plot the frequency of species based on each of the 4 attributes, this is what you would code with Python Kernel in Jupyter as an example using sklearn and pyplot from matplotlib:
import pandas as pd import matplotlib.pyplot as plt from sqlalchemy import create_engine engine = create_engine('hana://ML_USER:Welcome18Welcome18@localhost:39015') iris_db = pd.read_sql_query("select * from R_DATA.IRIS",engine) target_names = iris_db.Species.unique() feature_names = list(iris_db.drop('Species', axis=1).columns.tolist()) fig, axes = plt.subplots(figsize=(20,10), nrows=2, ncols=2) fig.suptitle("Iris Frequency Histogram", fontsize=16) colors= ['red', 'green', 'blue'] for i, ax in enumerate(axes.flat): for idx_label, color in zip(range(len(target_names)), colors): ax.hist( iris_db.loc[iris_db.Species == target_names[idx_label], feature_names[i]], label=target_names[label], color=color ) ax.set_xlabel(feature_names[i]) ax.set_ylabel('Frequency') axes[1,1].legend(loc='upper right') plt.show()
But you can do exactly the same with a R Kernel:
library(ggplot2) library(grid) library(gridExtra) library("RJDBC") jdbcDriver <- JDBC(driverClass="com.sap.db.jdbc.Driver", "/usr/sap/hdbclient/ngdbc.jar") jdbcConnection <- dbConnect( jdbcDriver, "jdbc:sap://localhost:39015/?autocommit=false", "ML_USER", "Welcome18Welcome18" ) iris <- dbGetQuery(jdbcConnection, "select * from R_DATA.IRIS") sl <- ggplot(data=iris, aes(x=Sepal.Length)) + geom_histogram(binwidth=0.2, aes(fill=Species)) + xlab("Sepal Length") + theme(legend.position="none") + ylab("Frequency") sw <- ggplot(data=iris, aes(x=Sepal.Width )) + geom_histogram(binwidth=0.2, aes(fill=Species)) + xlab("Sepal Width" ) + theme(legend.position="none") + ylab("Frequency") pl <- ggplot(data=iris, aes(x=Petal.Length)) + geom_histogram(binwidth=0.2, aes(fill=Species)) + xlab("Petal Length") + theme(legend.position="none") + ylab("Frequency") pw <- ggplot(data=iris, aes(x=Petal.Width )) + geom_histogram(binwidth=0.2, aes(fill=Species)) + xlab("Petal Width") + theme(legend.position="none") + ylab("Frequency") grid.arrange(sl, sw, pl, pw, ncol = 2, nrow = 2, top = textGrob("Iris Frequency Histogram")) dbDisconnect(jdbcConnection)
And this will output something like this:
Or using scatter plots in Python:
import pandas as pd import matplotlib.pyplot as plt from sqlalchemy import create_engine engine = create_engine('hana://ML_USER:Welcome18Welcome18@localhost:39015') iris_db = pd.read_sql_query("select * from R_DATA.IRIS",engine) target_names = iris_db.Species.unique() feature_names = list(iris_db.drop('Species', axis=1).columns.tolist()) feature_len = len(iris.feature_names) target_len = len(iris.target_names) fig, axes = plt.subplots( figsize=(20,10), nrows=dim, ncols=dim, sharex='col', sharey='row' ) fig.suptitle("Edgar Anderson's Iris Data", fontsize=16) plot_colors = ['blue', 'white', 'red'] for x in range(feature_len): for y in range(feature_len): ax = axes[x,y] if x == y: ax.text(0.5, 0.5, feature_names[x], transform=ax.transAxes, horizontalalignment='center', verticalalignment='center', fontsize=16 ) else: for idx_class, color in zip(range(target_len), plot_colors): idx = np.where(iris_db.Species == idx_class) ax.scatter( iris_db.loc[iris_db.Species == target_names[idx_class], feature_names[x]], iris_db.loc[iris_db.Species == target_names[idx_class], feature_names[y]], c=color, cmap=plt.cm.RdYlBu, label=target_names[idx_class], edgecolor='black', s=40 ) plt.show()
And in R:
library("RJDBC") jdbcDriver <- JDBC(driverClass="com.sap.db.jdbc.Driver", "/usr/sap/hdbclient/ngdbc.jar") jdbcConnection <- dbConnect( jdbcDriver, "jdbc:sap://localhost:39015/?autocommit=false", "ML_USER", "Welcome18Welcome18" ) iris_db <- dbGetQuery(jdbcConnection, "select * from R_DATA.IRIS") # repalce text value by color iris_db$SpeciesColor[iris_db$Species == "setosa"] <- "blue" iris_db$SpeciesColor[iris_db$Species == "versicolor"] <- "white" iris_db$SpeciesColor[iris_db$Species == "virginica"] <- "red" pairs( iris_db[1:4], main = "Edgar Anderson's Iris Data", pch = 21, bg = unclass(iris_db$SpeciesColor) ) dbDisconnect(jdbcConnection)
What other options do I have to “demo & share” but without too much coding?
Then I think SAP Lumira is what you are looking for. And the good news is that you even have now a dedicated openSAP course
Adding Jupyter to our software stack will really help you achieve some of the key goals in data science: code, test, demo & share.
Jupyter is definitely the tool I’ll use and push during the next hackathon I will attend or host.
This time, the next publication will be about the TensorFlow and I already built the content.
I’m just missing a simple example around image processing, for which one of the colleague in Walldorf kindly sent me an example that I need to investigate.
After this I will take a pause in the publication, as I’m expecting a new family member on the 1st of April (no joke).
(Remember sharing && giving feedback is caring!)
UPDATE: Here are the links to all the Machine Learning in a Box weekly blogs:
- Introducing “Project: Machine Learning in a Box”
- Machine Learning in a Box (part 2) : Project Methodologies
- Recap Machine Learning in a Box (part 2) : Project Methodologies
- Machine Learning in a Box (part 3) : Algorithms Learning Styles
- Machine Learning in a Box (part 4) : Get your environment up and running
- Machine Learning in a Box (part 5) : Upload Machine Learning Datasets
- Machine Learning in a Box (part 6) : SAP HANA R Integration
- Machine Learning in a Box (part 7) : Jupyter Notebook
- Machine Learning in a Box (part 8) : SAP HANA EML and TensorFlow Integration
- Machine Learning in a Box (part 9) : Build your first Machine Learning application
- Machine Learning in a Box (part 10) : JupyterLab
Is this possible using R only can i not use python instead ?
I'm not sure I get your question, but here are some comments that may help you clarify it.
You can choose which language you want to use in Jupyter by creating either R or Python Kernel Notebook. But, you can't mix R and Python in the same notebook.
The goal, with this blog and the associated tutorial, was to demonstrate that if you are a Python or R lover, you can still use the same tool, Jupyter, and achieve the same goal, consume data from SAP HANA, express edition.
Hope this helps.
you got my question correctly , i just wanna use python instead of 'R' for the same thing , i will use python & follow the series of blogs of ML.
Great that I was able to clarify.
I know that Python (just like R) offer a tremendous number of capabilities, but always keep in mind that you have built-in libraries in HANA that allows you to run many industry standard algorithm.
Using the built-in libraries will not require data to be extracted and transferred to your Python as they will consumed in place.
Thanks for clarification on usage of 'R' instead of python , will go then 'R' way only as of now , will see later if i can try something using python as well , if i keep in mind to learn ML in terms of SAP HANA which language i should mostly opt.
All of them including SQL!
First, you have the ability to run your ML inside HANA via SQL with the HANA libraries, the HANA R integration mixing it with R, and EML that allows you to get access to TensorFlow serving server which is in Python.
Then you can consume your HANA data to run your ML activities outside in more or less anything that can connect to HANA via JDBC, ODBC, Python...
Each options have pros and cons, but my rule of thumbs is always to start simple, and since you will most likely have to manipulate your data, let's start with SQL in HANA, and then use the built in libraries like PAL.
And if you feel like you are missing capabilities or want to compare with your approach, then you can explore R and Python.
Hi Abdel ,
Really thanks a lot for guiding me how to proceed on ML exploration as it is very true i am new in this & making my hands dirty on the Stuff & when i see series of blogs which have posted that encouraged me to try ml with SAP HANA.
Hi Abdel ,
HXE is not listed in my tenant database list hence i tried to create it using SQL Command mentioned in the doc but i ma facing authorization error , Can you please suggest -
"CREATE DATABASE HXE SYSTEM USER PASSWORD sap123C;"
Could not execute 'CREATE DATABASE HXE SYSTEM USER PASSWORD sap123C'
SAP DBTech JDBC: : insufficient privilege: Not authorized
Can you tell which version of HXE you are running? And which "flavor"? VM? Binary install? Docker? on GCP, AWS, Azure or CAL?
Are you sure you are connected as SYSTEM on the SYSTEMDB? (usually using port 39013 from your SQL query tool).
Have you checked the following tutorial : Prepare your SAP HANA, express edition instance for Machine Learning
Hi Abdel ,
Yes , I am sure i am connected , I am using Virtual machine method , No GCP.
Hi Abdel ,
One more error previous one got resolved by assigning appropriate authorization but right now while saving the data on remote/localhost directory i am getting the error -
scp: /usr/sap/HXE/HDB90/work/data_1.csv: No such file or directory
Can you guide please here .
Hi Shivam Shukla
data_1.csv is created as part of Step 1: Save the sample data from Import CSV into SAP HANA, express edition using IMPORT FROM SQL command
The directory /usr/sap/HXE/HDB90/work/ exists with all SAP HANA, express edition virtual machines.
The idea is to insert the set of lines in this file, then use the IMPORT FROM SQL command to import it into the following table:
Thanks a lot for the series, Jupyter Notebook and Hana combination is something I was looking forward too. It really helps to do quick prototyping for data that resides in Hana DB. I has installed the Hana Express Edition in Virtual Machine and Jupyter Notebooks in local machine and was facing error No module named ‘hdbcli’.
To resolve this I had to install the SAP hana client as mentioned