Skip to Content
Technical Articles

Offline Custom Operator Development

Introduction

The SAP Data Intelligence modeler is designed to develop data processing pipelines without the need to do any scripting. In particular the “Structured Data Transform” operators deliver a good deal to this promise. Nonetheless there is always the need for more flexibility and sometimes a script is a shortcut for lengthy ‘no-code’-pipelines. I have to admit that writing custom operators using the embedded editor is no fun when the code spans more than a few lines. The main reason is that the testing could only be done by starting a whole pipeline with the time-consuming docker image build. In addition you have no tools for debugging or watching variables during run-time.

For mitigation in particular data scientists are using the Jupyter Notebook for the code development though for many developers it is not a really satisfying remedy. Others are using cut-and-paste together with mock-ups of the APIs to be able to develop operators offline. I developed myself tools to support writing offline python operators from scratch that I described in a previous blog. Nonetheless I have to admit that there is a need for a more sophisticated but also an easier-to-use solution. When being in a discussion with our Business Application Studio colleagues I have learnt that they use mostly YEOMAN for their supportive development solutions and fancied if this could help me create an operator development assistance tool too. So I started to learn node.js, yeoman and along the way produced an offline development tool di-pyoperator. The outcome has surprised me: Ever since I am only using this tool for offline development. I am now hoping that this might also serve some benefit to you.

The di-pyoperator generator tool is publicly available (GitHub) and can be extended under the licence of Apache-2.0. It enables you to

  • use your favourite IDE (integrated development environment)
  • download and upload SAP Data Intelligence operators
  • use a mock-api and a test package
  • do a quick-start by providing scaffolding code based on the operator definition

 

Installation

For the deployment of the tool you need node.js and with this you get the package manager npm as well. For MacOs the easiest way is to use homebrew with a terminal:

brew install node

Once you have this installed get yeoman by

npm install -g yo

and the generator-di-pyoperator by

npm install -g generator-di-pyoperator

For the integration with SAP Data Intelligence the tool is using the System Management Command-Line Client (vctl). How to download it you find in the documentation. This you have to add to your execution path.

This is all you need and now you can start using the di-pyoperator for developing your first offline operator.

Example

As an example I took a requirement that is a bit too complex at least for my level of expertise to do it without some preliminary testing: exploding an embedded list in a csv-file.
From Kaggle you can download a csv-file of Netflix-titles that has a column containing the cast of the movies/tvs. Because you are interested in the actors you like to explode the ‘cast’-column as a separate column to produce a flat table and send the result as a message.table to the outport. Additionally you like to filter the data on movie or tv using a config parameter.

As a starting point I chose with this first release the operator definition in SAP Data Intelligence.

Define the Operator in SAP Data Intelligence

Add a new operator ‘movie.explodeNetflixTitles’ in the SAP Data Intelligence Modeller. It is necessary for the current version of di-pyoperator to add a package (‘movie’) before the name of the operator (explodeNetflixTitles). Because I think it is always a good practice I am not sure to change this ;).

For the basic information enter the following:

  • Inport and Outport
    • input – basic – message.file (csv-file)
    • output – basic – message.table
  • Configuration
    • type – Type – string (filter for movie/tv)

the script is kept unchanged. This will be added offline.

Offline Development

Download and add Custom Code

Open your favourite python IDE (In the example case it is PyCharm) and

  1. Create a project-folder
  2. Open a terminal window (within IDE)
  3. Enter the command
    # For initializing offline environment
    yo di-pyoperator --init
    
    # Usually
    yo di-pyoperator​ 

    and add the required information.

All files are downloaded:

and the script-code has already been initialised for a quick-start:

# First 3 lines generated by di-pyoperator - DO NOT CHANGE (Deleted again when uploaded.)
from utils.mock_di_api import mock_api
api = mock_api(__file__)

import pandas as pd
import copy
import io


def on_input(msg) :

    # Due to input-format PROPOSED transformation into DataFrame
    df = pd.read_csv(io.BytesIO(msg.body))

    # config parameter 
    #api.config.type = '*'    # datatype : string



    # Sending to outport output
    # Due to output-format PROPOSED transformation into message.table
    #df.columns = map(str.upper, df.columns)  # for saving to DB upper case is usual
    columns = []
    for col in df.columns : 
        columns.append({"class": str(df[col].dtype),'name': col})
    att = copy.deepcopy(msg.attributes)
    att['table'] = {'columns':columns,'name':'TABLE','version':1}
    out_msg = api.Message(attributes=att, body= df.values.tolist())
    api.send('output',out_msg)    # datatype: message.table

api.set_port_callback('input',on_input)   # datatype: message.file

 

Now you can add your custom code. The following are my result based on several trials like learning that there are some titles without a cast.

    ### Custom Code
    df = df[['title', 'type', 'cast', 'release_year']]
    df = df.loc[df['cast'].notna()]
    df['cast'] = df['cast'].apply(lambda x: x.split(','))
    df = df.explode('cast')
    df = df.rename(columns={'cast': 'ACTOR'})
    df['ACTOR'] = df['ACTOR'].str.strip()

Testing

For testing you can download the data from Kaggle: https://www.kaggle.com/shivamb/netflix-shows and store it to the folder: <your project>/testdata/movie/explodeNetflixTitles

Data Subset

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Valente, Vaneza Oliveira, Rafael Lozano, Viviane Porto, Mel Fronckowiak, Sergio Mamberti, Zezé Motta, Celso Frateschi",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi & Fantasy","In a future where the elite inhabit an island paradise far from the crowded slums, you get one chance to join the 3% saved from squalor."
s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, Azalia Ortiz, Octavio Michel, Carmen Beato",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies","After a devastating earthquake hits Mexico City, trapped survivors from all walks of life wait to be rescued while trying desperately to stay alive."
...

Modify the script_test.py for running the local test by just replacing the name of the test-file. Because of the given folder structure you do not need to provide the whole path.

msg = optest.get_msgfile('netflix_titles.csv')

Running the script_test.py shows that it produces the expected result:

/usr/local/bin/python3.8 "/Users/me/GitHub/di-pyoperators/operators/movie/explodeNetflixTitles/script_test.py"
*********************
Port: output
Attributes: {'testfile': '/Users/d051079/OneDrive - SAP SE/GitHub/di-pyoperators/testdata/movie/explodeNetflixTitles/netflix_titles_xxl.csv', 'table': {'columns': [{'class': 'object', 'name': 'title'}, {'class': 'object', 'name': 'type'}, {'class': 'object', 'name': 'ACTOR'}, {'class': 'int64', 'name': 'release_year'}], 'name': 'TABLE', 'version': 1}}
Data: [['3%', 'TV Show', 'João Miguel', 2020], ['3%', 'TV Show', 'Bianca Comparato', 2020], ['3%', 'TV Show', 'Michel Gomes', 2020], ['3%', 'TV Show', 'Rodolfo Valente', 2020], ['3%', 'TV Show', 'Vaneza Oliveira', 2020], ['3%', 'TV Show', 'Rafael Lozano', 2020], ['3%', 'TV Show', 'Viviane Porto', 2020], ['3%', 'TV Show', 'Mel Fronckowiak', 2020], ['3%', 'TV Show', 'Sergio Mamberti', 2020], ['3%', 'TV Show', 'Zezé Motta', 2020], ['3%', 'TV Show', 'Celso Frateschi', 2020], ['7:19', 'Movie', 'Demián Bichir', 2016], ['7:19', 'Movie', 'Héctor Bonilla', 2016], ['7:19', 'Movie', 'Oscar Serrano', 2016], ['7:19', 'Movie', 'Azalia Ortiz', 2016], ['7:19', 'Movie', 'Octavio Michel', 2016], ['7:19', 'Movie', 'Carmen Beato', 2016].....]

Process finished with exit code 0

 

Upload the operator

Now you can upload the operator again by calling yo di-pyoperator. Please be noted that all data has been stored so you can just hit the enter-key after having selected ‘Upload‘.

di-pyoperators$ yo di-pyoperator 
? Download or Upload operator Upload
? SAP Data Intelligence URL https://vsystem.ingress.xxxxxxx.dh-canary.shoot.live.k8s-hana.ondemand.com/
? Tenant pm-dev1
? User thorstenh
? Password [hidden]
? Operator movie.explodeNetflixTitles

 

Git Repository Integration

It is planned to integrate git seamlessly into the modeler in one of the next releases. In the meanwhile Christian Sengstock has outlined in his famous blog how you could integrate it using a VSCode. When developing custom operators offline then an integration using the features of your IDE is a piece of cake. E.g. I have now a github repository for all my custom operators that I connected to my PyCharm IDE. So I can use the modified toolbar for update/pull, commit and push.

Integration Test

Now that you have developed the custom operator you can use it to for a pipeline that reads the data from an object store and store it into HANA DB table of the structure:

CREATE COLUMN TABLE "DEMO"."NETFLIX_ACTORS"(
	"TYPE" NVARCHAR(25),
	"TITLE" NVARCHAR(150),
	"ACTOR" NVARCHAR(150),
	"RELEASE_YEAR" INTEGER,
	PRIMARY KEY (
		"TITLE",
		"ACTOR"
	)
)

And it all works perfectly fine without further adjustments.

Conclusion

I hope the di-pyoperator helps you as lot as it has done for me already. Once you installed the required applications it is easy to use and the scaffolding shortens the development a lot. At least 80% of my ports conveying data of data type message.file, message.table or message data and for these cases I now have not to care about the transformations. In addition because of the easy integration I develop nearly all custom operators offline and avoid the waste of time when overestimating my ability to write code without wrong indentions, missing brackets or similar carelessness.

If you like to get some more information on how I developed the tool and a more detailed documentation check out the README.md and the code published in GitHub.

 

1 Comment
You must be Logged on to comment or reply to a post.