Skip to Content
Technical Articles

Installing Python packages from tarball/zip files into SAP Data Intelligence: An example with hana_ml

To guarantee for a Python script to execute properly, custom environments need to be created, where the required modules and packages are specified. Pip, a tool allowing the search in the Python Package Index repository (PyPI), simplifies this process significantly. In addition to its standard functionality, it also allows the installation from archive and wheel files.

In this blog post, I will demonstrate how to create in SAP Data Intelligence (SAP DI) a Docker file for Python environment including packages located in archive files. As an example we are going to use the hana_ml package. If you want to learn more about this package and how to download it, please consult this blog post. Additional installation details with HANA Express can be found here.

Uploading the hana_ml archive into the DI System

In order to be able to install from a local archive (tarball/zip) file during the Docker definition step, the archive needs to be uploaded into DI. We are going to look into the following two options to achive this:

  • Through the DI System Management
  • Using the Modeller Repository directly

With the first method, we open DI System Management from the DI Launchpad and proceed to the Files tab. Then, we navigate to the path files->vflow->dockerfiles and choose a location for the new Docker file:

With a click on Import File from the menu (top right side), the file can be located on the hard drive and selected:

After a refresh of My Workspace section (first button on the left) you should be able to see the newly uploaded file in your selected path.

The procedure to upload the archive file using the Modeller directly is similar. From the modeller, the Repository tab is selected and using the Import File menu, the archive file is uploaded into the destination folder, as shown below:

Remark: Since the import function is currently used to import solutions into DI (packed as tar.gz archives), it would automatically unpack all provided archive files. For that reason, you would want to simply rename your archive file by removing the ending (e.g. from hana_ml-1.0.5.tar.gz into hana_ml-1.0.5). Once uploaded, it is up to you whether you rename it back or leave it as it is.

Creating and building the Docker file

The process of building a Docker file in SAP DI has been described extensively in this excellent blog post. To avoid repeating those steps, I will continue directly with the new file definition, which is simplified and aims to only demonstrate the required lines of code:

FROM $com.sap.python27
COPY hana_ml-1.0.5 hana_ml.tar.gz
RUN pip install hana_ml.tar.gz

The first line specifies the inheritance path for our Docker file. With the copy command, we specify that the uploaded local archive file needs to be copied into the Docker container (a rename is also taking place, for easier reference). In the next step, pip is used to install the local file into the container environment.

Finally, the tags of the Docker file need to be updated, so that it can be used in custom operators or groups during the pipeline (graph) creation process. The tags I have selected here are:

The first three tags are internal requirements of DI, since the Docker file inherits from Python2.7. The last one was chosen by me to refer to this Docker environment. It will be used in the next step, during the pipeline creation process, to tag the Python operators, which require this environment to ensure the successful execution of the corresponding Python scripts.

Specifying the new Docker environment in a pipeline

The last step in the process is to link the newly created Docker environment with the pipelines, using the tags.The example pipeline created here uses OpenAPI to expose an APL model stored in HANA for scoring. The idea is that the user calls the API endpoint with the data to be scored and the Python operator uses hana_ml to load the model from the repository and to apply it to the new data set. Finally, the results are sent back to the requester and at the same time shown in Wiretap for debugging purposes.

The contents of the Python operator are as follows:

from hana_ml.algorithms.apl.classification import AutoClassifier

def on_input(data):
    # create a data frame from the input message (skipped)
    ...
    # create connection context
    conn = dataframe.ConnectionContext(address='xx.xx.xx.xx',port='00000',user='MYUSER',
                                       password='MYPASS',encrypt='true',
                                       sslValidateCertificate='false')
    
    # load the model from the HANA repository
    model2 = AutoClassifier(conn_context=conn)
    model2.load_model(schema_name='MY_SCHEMA_REPO', table_name='DI__APL_MODEL')
    
    #apply the model
    applyout2 = model2.predict(data.body)
    outmsg = applyout2.collect()
    
    # prepare the output message from the dataframe (skipped)
    ....
    api.send("output", outmsg)

api.set_port_callback("input", on_input)

Please pay attention that the Python operator was added to a group called Hana ML. This allows us to specify the tag selected during the previous step (hanaMLdocker) in the Group Settings, and thus to link the Docker environment with this operator:

Now, the Python script, stored in the operator, will be executed in the right Docker container, providing the environment we defined during the Docker file creation process above. This ensures, that the package hana_ml can be imported and the execution will be successful.

Note: If you want to use the Configuration Manager to handle the HANA connection (and not hardcode the credentials in the Python script) please check out this great blog post.

I hope this will be helpful to someone, trying to use pip with local archive files. By the way, the same procedure is valid also for preparing a custom R environment and using custom made R packages.

Thanks,

Stojan

4 Comments
You must be Logged on to comment or reply to a post.
  • Hi Stojan,

    good example on how to install non-pypi python libs.

    I also particularly liked your approach on how to expose the model inference task as a RESTful API with the OpenAPI Servlow operator. You should explore that more in another blog.

    One comment I’d make, though, is that while it’s fine for prototyping, in productive deployments you probably don’t want to have the .whl files directly uploaded to the vrep, since it will be copied for each user and you don’t have any control on versions etc. I’d recommend using a local nexus repo if you have libs that are not in pypi or if you don’t have internet access from your DH cluster. This blog by Remi explains that concept:
    https://blogs.sap.com/2019/08/15/using-sap-data-hub-without-internet-access/

    Cheers,
    Henrique.

    • Thanks for your comment, Henrique and for recommending Remi’s great blog post. Very interesting to see his approach on how to build your own offline repo, which I agree will be more suitable for a productive environment.

      Thanks,

      –Stojan