SAP Data Intelligence: The difference between “local” Jupyter Notebook dev and “SAP Data Intelligence”
A purpose of this blog post is to explain how to use Jupyter Notebook on our SAP Data Intelligence and the difference between “local” Jupyter Notebook dev and “SAP Data Intelligence”
The major difference is almost only the part that depend on the local environment such as the path to files(Imagine a file specified by a command like “pd.read_csv” that most data scientists have experience to use.) and libraries etc, so I am convinced that data scientists who already use Jupyter Notebook can smoothly use Jupyter notebook on SAP Data Intelligence:)
One use case for machine learning here. We can build an image similarity scoring page on your online shop. Once we create ML model with the image similarity scoring API on SAP Data Intelligence, the ML model can detect which product is uploaded by user.
I will now explain how to use Jupyter Notebook on SAP Data Intelligence.
■How to install libraries
Begin by installing the scikit-learn library, which is very popular for Machine Learning in Python on tabular data such as ours.
Got an error because the necessary libraries are not installed.
But we can install it as usual like pip install command as below.
import numpy as np import pandas as pd from sklearn import feature_extraction, linear_model, model_selection, preprocessing
pip install sklearn
After installing sklearn, we can see sklearn imported as Jupyter Notebook behave on local environment.
■Upload .csv from local laptop
We can use the pre-defined connection for the DI Data Lake to upload .csv.
In Metadata Explore menu as below, clicking “shared” folder>>View preparations>>Upload file(upper right on the page).
After that, data configuration page appear.
Those 3 csv was uploaded.
The path to a folder can be seen at Metadata Explorer.
In this case, the path to the folder was shown in the code. And the code snippet below can be used to access the files and create a combined data frame from them.
!pip install hdfs from hdfs import InsecureClient client = InsecureClient('http://datalake:50070') client.status("/") fnames=client.list('/shared/MY_CSV_FILES') import pandas as pd data = pd.DataFrame() for f in fnames: with client.read('/shared/MY_CSV_FILES/' + f, encoding='utf-8') as reader: data_file = pd.read_csv(reader) data = pd.concat([data_file,data])
The rest of the development on the Jupyter Notebook is as simple as writing the required code:
For data scientists, you’ve found that the analysis and development on SAP Data Intelligence is no different from the local environment.
In addition to that, we no longer need to waste a time rebuilding the development environment associated with the laptop OS update.
※If you’re Mac user
I guess you had to rebuild the Anaconda environment when you updated to MacOS Catalina because the folder relocation happen, then we need to fix it.
Thank you for reading this blog post.