Installing Python packages from tarball/zip files into SAP Data Intelligence: An example with hana_ml
To guarantee for a Python script to execute properly, custom environments need to be created, where the required modules and packages are specified. Pip, a tool allowing the search in the Python Package Index repository (PyPI), simplifies this process significantly. In addition to its standard functionality, it also allows the installation from archive and wheel files.
In this blog post, I will demonstrate how to create in SAP Data Intelligence (SAP DI) a Docker file for Python environment including packages located in archive files. As an example we are going to use the hana_ml package. If you want to learn more about this package and how to download it, please consult this blog post. Additional installation details with HANA Express can be found here.
Uploading the hana_ml archive into the DI System
In order to be able to install from a local archive (tarball/zip) file during the Docker definition step, the archive needs to be uploaded into DI. We are going to look into the following two options to achive this:
- Through the DI System Management
- Using the Modeller Repository directly
With the first method, we open DI System Management from the DI Launchpad and proceed to the Files tab. Then, we navigate to the path files->vflow->dockerfiles and choose a location for the new Docker file:
With a click on Import File from the menu (top right side), the file can be located on the hard drive and selected:
After a refresh of My Workspace section (first button on the left) you should be able to see the newly uploaded file in your selected path.
The procedure to upload the archive file using the Modeller directly is similar. From the modeller, the Repository tab is selected and using the Import File menu, the archive file is uploaded into the destination folder, as shown below:
Remark: Since the import function is currently used to import solutions into DI (packed as tar.gz archives), it would automatically unpack all provided archive files. For that reason, you would want to simply rename your archive file by removing the ending (e.g. from hana_ml-1.0.5.tar.gz into hana_ml-1.0.5). Once uploaded, it is up to you whether you rename it back or leave it as it is.
Creating and building the Docker file
The process of building a Docker file in SAP DI has been described extensively in this excellent blog post. To avoid repeating those steps, I will continue directly with the new file definition, which is simplified and aims to only demonstrate the required lines of code:
FROM $com.sap.python27 COPY hana_ml-1.0.5 hana_ml.tar.gz RUN pip install hana_ml.tar.gz
The first line specifies the inheritance path for our Docker file. With the copy command, we specify that the uploaded local archive file needs to be copied into the Docker container (a rename is also taking place, for easier reference). In the next step, pip is used to install the local file into the container environment.
Finally, the tags of the Docker file need to be updated, so that it can be used in custom operators or groups during the pipeline (graph) creation process. The tags I have selected here are:
The first three tags are internal requirements of DI, since the Docker file inherits from Python2.7. The last one was chosen by me to refer to this Docker environment. It will be used in the next step, during the pipeline creation process, to tag the Python operators, which require this environment to ensure the successful execution of the corresponding Python scripts.
Specifying the new Docker environment in a pipeline
The last step in the process is to link the newly created Docker environment with the pipelines, using the tags.The example pipeline created here uses OpenAPI to expose an APL model stored in HANA for scoring. The idea is that the user calls the API endpoint with the data to be scored and the Python operator uses hana_ml to load the model from the repository and to apply it to the new data set. Finally, the results are sent back to the requester and at the same time shown in Wiretap for debugging purposes.
The contents of the Python operator are as follows:
from hana_ml.algorithms.apl.classification import AutoClassifier def on_input(data): # create a data frame from the input message (skipped) ... # create connection context conn = dataframe.ConnectionContext(address='xx.xx.xx.xx',port='00000',user='MYUSER', password='MYPASS',encrypt='true', sslValidateCertificate='false') # load the model from the HANA repository model2 = AutoClassifier(conn_context=conn) model2.load_model(schema_name='MY_SCHEMA_REPO', table_name='DI__APL_MODEL') #apply the model applyout2 = model2.predict(data.body) outmsg = applyout2.collect() # prepare the output message from the dataframe (skipped) .... api.send("output", outmsg) api.set_port_callback("input", on_input)
Please pay attention that the Python operator was added to a group called Hana ML. This allows us to specify the tag selected during the previous step (hanaMLdocker) in the Group Settings, and thus to link the Docker environment with this operator:
Now, the Python script, stored in the operator, will be executed in the right Docker container, providing the environment we defined during the Docker file creation process above. This ensures, that the package hana_ml can be imported and the execution will be successful.
Note: If you want to use the Configuration Manager to handle the HANA connection (and not hardcode the credentials in the Python script) please check out this great blog post.
I hope this will be helpful to someone, trying to use pip with local archive files. By the way, the same procedure is valid also for preparing a custom R environment and using custom made R packages.