Skip to Content
Technical Articles
Author's profile photo Kavish Nareshchandra Dahekar

SAP Data Intelligence: Accessing DataLake in Jupyter notebooks using Data Manager

NOTE : This is an old blog entry and the SAP DI python sdk has since been updated. Please refer ( for up-to-date documentation.

SAP Data Intelligence brings together all the tools familiar to a Data Scientist while still providing them with the advantages of an enterprise level Data Science platform, connectivity to a large number of data sources and easy integration with other SAP services.

This blog post specifically covers a use case frequently encountered by Data Scientists using SAP DI. As a Data Scientist you will often find yourself experimenting with datasets in Jupyter notebooks. More often than not, these datasets will be stored on SAP DI’s internal Semantic Data Lake (SDL). How do you access these files directly from the comfort of your Jupyter notebook? You could of course create your own hdfs InsecureClient, figure out the connection parameters and make a raw call to fetch the contents of the file. But is there an easier way to achieve the same result? This is where ML Data Manager comes in.

ML Data Manager allows you to organize your files/datasets in a hierarchical manner while also helping you capture relevant metadata like features and lineage. ML Data Manager provides a framework within which you can guarantee the traceability and reproducibility of your ML pipelines. ML Data Manager does this by organizing your data into workspaces and datacollections. A workspace here would correlate to an ML or Customer user-case you might be working on. Within the workspace, Data Manager allows you to create multiple datacollections. Each datacollection correlates to a problem specific set of data. As your experimentation moves along, you might morph your data into cleaner and well-structured datasets which you would then write to a another datacollection within the workspace. Data Manager allows you to specify the heirarchy of these datacollections for easy identification of their relationships. Future capabilities also include being able to mark a datacollection as immutable. (Note : Lineage and Hierarchy features are supported by Data Manager API, UI support is under development.)

Now that we have a quick understanding of what Data Manager is, let’s move along with a direct example so you can see how it works.

Here’s a test scenario you could follow in order to understand how exactly this would work. Let’s say you have a csv file on your local system. First, you wish to move it to SAP DI’s DataLake and then read this file into a pandas dataframe within a Jupyter notebook.

Step 1 : Uploading files to DataLake via Data Manager

Login to your SAP DI instance and click the Data Manager tile from the launchpad.

You will be redirected to the Data Manager UI where you will see a list of existing workspaces (or an empty list if no workspaces exist).

A Data Manager workspace co-relates to a Machine Learning use-case you could be working on, in which case you will create a separate workspace for each of your use-cases.

Start by creating a new workspace by clicking the “Create” button at the top right. Provide a name and description for your workspace and click “Create” again.

Note : Your user account should have appropriate permissions for accessing the DataLake for you to be able to create Workspaces and DataCollections in ML Data Manager.

Once the workspace is successfully created, you will be redirected to the workspace details page where you will see an empty list of DataCollections. Create a DataCollection with a name and description of your choice.

After successful creation of the DataCollection, you will be redirected to the DataCollection details page.This page allows you to view the contents of your DataCollection. Currently the contents are obviously empty. Lets upload our csv file to this DataCollection.

Click the “Edit in Metadata Explorer” button on the top right. This will take you to Metadata Explorer where you can upload files to the DataLake folder of the DataCollection we created in the previous step.


Click on the upload files icon and a file upload dialog will open. Click on “+” and select the file you wish to upload. Click on the “Upload” button and you should see a progress-bar indicating the upload progress.

Note : The upload icon will be disabled if you do not have the correct permissions for writing to the DataLake. You will need the sap.dh.metadata policy applied to your user in order to be able to upload files to the DataLake.

Our file is now on the DataLake. You can close Metadata Explorer and switch back to ML Data Manager’s DataCollection details page. Make sure the content tab shows your uploaded file in the content list. (Might require a content list refresh. Use the refresh button next to the “Edit in Metadata Explorer” button.)

That’s it from ML Data Manager side. Our file is now ready to be read from our notebook.


Step 2 : Reading DataLake files from Jupyter notebook

Navigate back to the launchpad and click on the ML Scenario Manager tile.

In ML Scenario Manager, you could have choosen a pre-existing scenario+notebook, but for the sake of this article let’s go ahead and create a fresh scenario.

Click the “+” button and provide a name and description for your ML Scenario. Click “Create” to create the scenario.

Once the scenario is successfully created, create a notebook withing the scenario with a name and description.

This will redirect you to SAP DI’s Jupyter Lab instance. The notebook you created in the previous step should be automatically opened for you with a kernel selection pop-up. Select Python 3 as your kernel.

SAP DI’s Jupyter Lab comes pre-installed with the SAP DI Data Browser extension. You can find it in the left sidebar’s bottom-most icon.

The Data Browser extension allows you to view metadata catalogs as well as collections. But in our case we will utilize the extension’s ability to access ML Data Manager’s workspaces and datacollections. On the top of the Data Manager extension’s sidebar, you will see three icons. Click the 3rd icon that allows us to access Data Manager entities.

Right off the bat, you will see that the extension sidebar displays the Workspace that we created in our previous steps via ML Data Manager. Double-click on the Workspace name and you will see the DataCollection we created within this Workspace. Double-click the DataCollection name and you will see the file we uploaded within this DataCollection.

Clicking the small clipboard icon next to the file name will copy a python code snippet to your clipboard that will quickly allow you to read this file in your notebook. Click the clipboard icon next to the filename and then click within any cell in the notebook and paste (Ctrl+V) the clipboard contents.

And there you have it. If you run this cell, you will see that the code reads the csv file from the DataLake and writes it to a pandas dataframe which you can then use for further experimentation within the notebook.

Note: Notice that the code snippet makes use of the sapdi python package for reading datalake files.There are many other functionalities exposed via the SAP Data Intelligence Python SDK which you can discover from the documentation.

And there we have it. The csv file that began from our local system now sits on the DataLake and can be directly accessed from a Jupyter notebook.


There could be multiple ways of achieving the same result as that of this blog-post. However, the primary aim of this blog was to show the intended way of using ML Data Manager, ML Scenario Manager and Jupyter notebooks for reading files from the DataLake.

What do you think of the Data Manager application? Anything you would like add that could make the user experience more fruitful? Reach out to us at the SAP Data Intelligence team or leave your comments and questions below we’ll try our best to respond to all of them.

Thank you.

Assigned Tags

      You must be Logged on to comment or reply to a post.
      Author's profile photo Jeremy Yu
      Jeremy Yu

      Thanks Kavish! This is very helpful!

      Author's profile photo Raphael Geisel
      Raphael Geisel

      Thank you for this helpful article, Kavish!


      Is it also possible to write in the other direction, i.e. results of scripting into the data lake or other SAP systems like Data Warehouse Cloud?

      So far I only know the possibility to store e.g. csv files in the notebook itself (cell 52).

      Author's profile photo Kavish Nareshchandra Dahekar
      Kavish Nareshchandra Dahekar
      Blog Post Author

      Hey Raphel, the other direction should also be possible afaik but the steps would vary based on the target SAP System.

      Here's a blog I found to integrate SAP DI with SAP Data Warehouse Cloud :

      Hope that helps.

      Author's profile photo Marcus Schiffer
      Marcus Schiffer



      we are trying to use larger files ( e.g. 2.4 GB) with the jupyter notebooks in DI.

      Whenever we load these files (either from SDL repository or local file) the jupyter crashes. Smaller files (up to 1 GB) can be read without problem.

      That seems to be strange while DI is advertised as ML solution (and these typically handle larger files) .

      Is there a way to make jupyter handle larger files in DI ?

      Any help appreciated.




      Author's profile photo Kavish Nareshchandra Dahekar
      Kavish Nareshchandra Dahekar
      Blog Post Author

      Hi Marcus,

      Default request and limit for Jupyter's memory is set at 1Gi and 4Gi by default. If Jupyter runs out of memory while trying to read large data, the kernel crashes.

      You can check the "Memory Limits in JupyterLab" title on the "What's New in SAP Data Intelligence?" help document to see more info and how to increase this limit :

      Also note, increasing the memory allocated to Jupyter is subject to available resources on the DI cluster. I would suggest gradually increasing the memory until you are able to load your file. Something important to note, after changing the memory limit you must restart the Jupyter instance from System Management for the change to take effect.

      Hope that helps.

      Author's profile photo David Bertsche
      David Bertsche

      Hi Kavish,

      When I use the python code in step 2 it works, but I get the following warning:

      "/opt/conda/lib/python3.7/site-packages/ DeprecatedWarning: get_workspace is deprecated as of 0.3.30. This is separate from the ipykernel package so we can avoid doing imports until"

      Is this going to stop working soon? What is the recommended alternate method now?

      Thanks - David

      Author's profile photo Kavish Nareshchandra Dahekar
      Kavish Nareshchandra Dahekar
      Blog Post Author

      Hi David,

      Data Manager has been deprecated as of Jan 2021 and will be removed around Sept 2021, which is why you see the deprecation warning.

      More details here:

      Author's profile photo David Bertsche
      David Bertsche

      Hi Kavish,

      I can't tell from the support note if this warning can be safely ignored or if it means that the steps you describe in your post won't work any more after the depreciated planned for September. If it's the later, what is the new recommended method for accessing the DataLake from Jupyter?

      Thanks, David

      Author's profile photo Kavish Nareshchandra Dahekar
      Kavish Nareshchandra Dahekar
      Blog Post Author

      Hi David,

      The steps described in the blog will stop working after the planned depreciation. Afaik alternate ways to support the same functionality are being planned/in-progress.

      Feel free to drop me an email so I can get you in touch with the concerned team that's working on this.


      Author's profile photo MadanKumar Pichamuthu
      MadanKumar Pichamuthu

      very precise and clear.. thanks for taking the time to write this.. 🙂

      Author's profile photo Ahmed Abdelhady
      Ahmed Abdelhady

      Thanks Kavish for great effort, just request a clarification regarding data lake . Can Data Intelligence used as data lake without connecting to hadoop and is there any limitation for this scenario like limited  data size to handle or data types (structure or unstructured)


      Author's profile photo Kavish Nareshchandra Dahekar
      Kavish Nareshchandra Dahekar
      Blog Post Author

      Hi Ahmed, thanks for your comment.

      I'll connect you to someone via email who can answer this question. Please do update this comment thread once your query is resolved for future visitors.


      Author's profile photo Former Member
      Former Member

      Hi, I tried your approach, however every time when I try to use sapdi.get_workspace() I'm getting an error

      module 'sapdi' has no attribute 'get_workspace'

      I checked via help() function and this function is not available for me.

      Can you help me with this issue?


      Thanks in advance

      Author's profile photo Kavish Nareshchandra Dahekar
      Kavish Nareshchandra Dahekar
      Blog Post Author

      Hi Agnieszka, this is a very old article. The SAP DI python sdk has most probably been updated, I will mention this in the article.