SAP Data Intelligence: Accessing DataLake in Jupyter notebooks using Data Manager
NOTE : This is an old blog entry and the SAP DI python sdk has since been updated. Please refer (https://help.sap.com/docs/SAP_DATA_INTELLIGENCE/5ac15e8fccb447199fda4509e813bf9f/12f7abac63844d4293a339a8effb3521.html) for up-to-date documentation.
SAP Data Intelligence brings together all the tools familiar to a Data Scientist while still providing them with the advantages of an enterprise level Data Science platform, connectivity to a large number of data sources and easy integration with other SAP services.
This blog post specifically covers a use case frequently encountered by Data Scientists using SAP DI. As a Data Scientist you will often find yourself experimenting with datasets in Jupyter notebooks. More often than not, these datasets will be stored on SAP DI’s internal Semantic Data Lake (SDL). How do you access these files directly from the comfort of your Jupyter notebook? You could of course create your own hdfs InsecureClient, figure out the connection parameters and make a raw call to fetch the contents of the file. But is there an easier way to achieve the same result? This is where ML Data Manager comes in.
ML Data Manager allows you to organize your files/datasets in a hierarchical manner while also helping you capture relevant metadata like features and lineage. ML Data Manager provides a framework within which you can guarantee the traceability and reproducibility of your ML pipelines. ML Data Manager does this by organizing your data into workspaces and datacollections. A workspace here would correlate to an ML or Customer user-case you might be working on. Within the workspace, Data Manager allows you to create multiple datacollections. Each datacollection correlates to a problem specific set of data. As your experimentation moves along, you might morph your data into cleaner and well-structured datasets which you would then write to a another datacollection within the workspace. Data Manager allows you to specify the heirarchy of these datacollections for easy identification of their relationships. Future capabilities also include being able to mark a datacollection as immutable. (Note : Lineage and Hierarchy features are supported by Data Manager API, UI support is under development.)
Now that we have a quick understanding of what Data Manager is, let’s move along with a direct example so you can see how it works.
Here’s a test scenario you could follow in order to understand how exactly this would work. Let’s say you have a csv file on your local system. First, you wish to move it to SAP DI’s DataLake and then read this file into a pandas dataframe within a Jupyter notebook.
Step 1 : Uploading files to DataLake via Data Manager
Login to your SAP DI instance and click the Data Manager tile from the launchpad.
You will be redirected to the Data Manager UI where you will see a list of existing workspaces (or an empty list if no workspaces exist).
A Data Manager workspace co-relates to a Machine Learning use-case you could be working on, in which case you will create a separate workspace for each of your use-cases.
Start by creating a new workspace by clicking the “Create” button at the top right. Provide a name and description for your workspace and click “Create” again.
Note : Your user account should have appropriate permissions for accessing the DataLake for you to be able to create Workspaces and DataCollections in ML Data Manager.
Once the workspace is successfully created, you will be redirected to the workspace details page where you will see an empty list of DataCollections. Create a DataCollection with a name and description of your choice.
After successful creation of the DataCollection, you will be redirected to the DataCollection details page.This page allows you to view the contents of your DataCollection. Currently the contents are obviously empty. Lets upload our csv file to this DataCollection.
Click the “Edit in Metadata Explorer” button on the top right. This will take you to Metadata Explorer where you can upload files to the DataLake folder of the DataCollection we created in the previous step.
Click on the upload files icon and a file upload dialog will open. Click on “+” and select the file you wish to upload. Click on the “Upload” button and you should see a progress-bar indicating the upload progress.
Note : The upload icon will be disabled if you do not have the correct permissions for writing to the DataLake. You will need the sap.dh.metadata policy applied to your user in order to be able to upload files to the DataLake.
Our file is now on the DataLake. You can close Metadata Explorer and switch back to ML Data Manager’s DataCollection details page. Make sure the content tab shows your uploaded file in the content list. (Might require a content list refresh. Use the refresh button next to the “Edit in Metadata Explorer” button.)
That’s it from ML Data Manager side. Our file is now ready to be read from our notebook.
Step 2 : Reading DataLake files from Jupyter notebook
Navigate back to the launchpad and click on the ML Scenario Manager tile.
In ML Scenario Manager, you could have choosen a pre-existing scenario+notebook, but for the sake of this article let’s go ahead and create a fresh scenario.
Click the “+” button and provide a name and description for your ML Scenario. Click “Create” to create the scenario.
Once the scenario is successfully created, create a notebook withing the scenario with a name and description.
This will redirect you to SAP DI’s Jupyter Lab instance. The notebook you created in the previous step should be automatically opened for you with a kernel selection pop-up. Select Python 3 as your kernel.
SAP DI’s Jupyter Lab comes pre-installed with the SAP DI Data Browser extension. You can find it in the left sidebar’s bottom-most icon.
The Data Browser extension allows you to view metadata catalogs as well as collections. But in our case we will utilize the extension’s ability to access ML Data Manager’s workspaces and datacollections. On the top of the Data Manager extension’s sidebar, you will see three icons. Click the 3rd icon that allows us to access Data Manager entities.
Right off the bat, you will see that the extension sidebar displays the Workspace that we created in our previous steps via ML Data Manager. Double-click on the Workspace name and you will see the DataCollection we created within this Workspace. Double-click the DataCollection name and you will see the file we uploaded within this DataCollection.
Clicking the small clipboard icon next to the file name will copy a python code snippet to your clipboard that will quickly allow you to read this file in your notebook. Click the clipboard icon next to the filename and then click within any cell in the notebook and paste (Ctrl+V) the clipboard contents.
And there you have it. If you run this cell, you will see that the code reads the csv file from the DataLake and writes it to a pandas dataframe which you can then use for further experimentation within the notebook.
Note: Notice that the code snippet makes use of the sapdi python package for reading datalake files.There are many other functionalities exposed via the SAP Data Intelligence Python SDK which you can discover from the documentation.
And there we have it. The csv file that began from our local system now sits on the DataLake and can be directly accessed from a Jupyter notebook.
There could be multiple ways of achieving the same result as that of this blog-post. However, the primary aim of this blog was to show the intended way of using ML Data Manager, ML Scenario Manager and Jupyter notebooks for reading files from the DataLake.
What do you think of the Data Manager application? Anything you would like add that could make the user experience more fruitful? Reach out to us at the SAP Data Intelligence team or leave your comments and questions below we’ll try our best to respond to all of them.