Skip to Content
Technical Articles

Pandas and SAP Data Intelligence

Introduction

Within minutes when I am introducing data processing with the Modeler of Data Intelligence to a data scientist I get the question of how Pandas is supported. Pandas is the “high-performance, easy-to-use data structures and data analysis” (citation) package for Python that no data scientist can ignore unless she is still an R-aficionada.

My usual answer : “You can add whatever package you like to a docker container and with the Custom Python operator you can use them as you like. And of course pandas is the tool of choice for developing more elaborate data analysing operators. Without saying, you always use Pandas when working with the Jupyter Notebook of the ML Scenario Manager.

Admittedly this is not the whole truth. You can use Pandas for more than analysis. The versatility and the performance makes it also as an ideal tool for data transformations. In particular when the  standard Data Transform operators are not providing all functionalities you need.

In the following I describe how I use pandas with SAP Data Intelligence and introduce to some helper functions I have written.

Data Transformation with Pandas

To DataFrame From CSV

The first step when using pandas is converting the data into the DateFrame format. Because in all my projects the source format was exclusively CSV, therefore I have only written an operator expecting the csv (byte-coded) output from the Read Operator port ‘outFile’.

The “Read File” connectivity operator covers a host of connection types and provides already very useful features, like polling and selecting files using regular expressions. The later is very useful because you can divide the data into smaller files where the filename encodes a meaning like date or data range, as sort of intermediate step before embarking on using more elaborate techniques like Parquet.

Example

consumption_5000kWh_2019_06_06.csv.gz
consumption_2000kWh_2019_06_24.csv.gz
consumption_3000kWh_2019_06_11.csv.gz
consumption_3000kWh_2019_06_27.csv.gz
consumption_3000kWh_2019_06_13.csv.gz
consumption_3000kWh_2019_05_30.csv.gz
consumption_5000kWh_2019_07_02.csv.gz
consumption_3000kWh_2019_06_25.csv.gz
consumption_2000kWh_2019_06_10.csv.gz
consumption_5000kWh_2019_05_27.csv.gz
consumption_5000kWh_2019_06_04.csv.gz
....

The operator also provides meta-data information using the message attributes that helps you to decide how many files will be provided.

The receiving operator should cope with these features that means basically collecting all data and adding this to a DataFrame before sending it to the output port.

I have written an operator that is using the pandas method read_csv for reading the receiving data and put some of the arguments into the parameters of api.config. If there is a need for additional ones you can download and change the code from fromCSV (GitHub).

From DataFrame to CSV

Passing the DataFrame data to the WriteFile Operator is less complex. With the use of pandas.to_csv and a couple of parameter you can easily create this operator or use the one I have developed.

Passing Data

Now that you have created the entry port – creating DataFrames –  and are able to write them back, you need a kind of working operators based on DataFrames and for this in the first place you need to agree on how you like to pass DataFrames. I decided on using the envelope pattern of messages. With this dictionary datatype I can put the metadata to the “attributes” element and add the DataFrame to the “body” without data conversion. Thanks to the Data Intelligence architects pipelines are running per default within one Docker Container provided that all operator tags are encompassed by one Docker image. In addition you have the “grouping” feature in the pipeline Modeler to force explicitly the container builder to do so. That means that the operator within one container share the same memory space and can handle when basically the memory address of the DataFrame is shared.

To ensure that only pandas-operators can be plugged together I have created a subtype of the standard data type message

message.DataFrame

This is nothing more than a name given as a data type to the input and output ports of an operator and the Modeler tests if the names match. You can detect the special data type by the colour coding of the port nodes. E.g. in the “From CSV” operator colours refer to

  • pink –  standard message
  • yellow –  message.DataFrame
  • turquoise – string

Configuration Parameters

Many of the operators designed for reuse are wrappers around pandas methods like read_csv(), to_csv(), drop(), fillna, merge(), … and therefore we have to cope with a fundamental syntax difference between Data Intelligence operators and Python function calls.

  1. Modeler discern between input data and configuration parameters while the latter only have arguments.
  2. The configuration UI of the Modeler displays all parameters while Python can use default values for arguments not passed.
  3. The configuration UI of the Modeler offers only basic data types (boolean, number, string) while the function can work with all kinds of data types like lists, dictionaries, etc.

The first difference is of no further complication due to our context of having a well defined data processing pattern sending DataFrames through a pipeline. We mostly do not have to ponder about what is data and what is a parameter and therefore set this more philosophical question  aside. Sometimes if the parameter is more complex then it is for convenient reasons designated as data but later below I describe how to cope with this.

The 2nd and the 3rd difference need some more considerations.

Configuration Parameter Definition

When you first design your pandas operators you might select the parameters from the pandas function arguments list depending on the specific requirements of your data process. But it will not take long that you need to add other arguments to the operator configuration. In particular these small arguments like low_memory=True or error_bad_lines=False, etc. can be very annoying. You might need these parameters only once or for testing purpose only but nonetheless it takes some effort to add and they clatter your configuration pane.

It took me some suffering before I tackled this issue. Now I only add the core arguments to the operator configuration and leave all the other arguments to a particular “Keyword Arguments” textfield.

With this design patter I have all the flexibility like calling the pandas function directly without any development effort. The call in the operator script would then look like :

pd.read_csv(csv_io, api.config.separator, usecols=use_cols, dtype=typemap, \
            decimal=api.config.decimal, \
            nrows=nrows, **kwargs)

For putting all arguments in a textfield, however, you need a parser and this leads us to the 3rd requirement of how to pass parameters of a more complex structure.

Textfield Parser

There are arguments that require data types beyond the primitive ones (integer, float, boolean and string). In particular the pandas functions expects often arrays, dictionaries and nested dictionaries, e.g.

  • usecols argument of pandas.read_csv: array of column names
  • mapper argument of pandas.rename: dictionary
  • to_replace argument of pandas.replace: nested dictionary

I started to write the parsing for each textfield occurence before realising that putting all in a module (part of the sdi_utils package) might be more efficient. Because I am more familiar with regex than with tokenise I used the first one without being sure if it was the right choice. But you can decide this on your own (GitHub).

The defined format of such textfields are quite simple and intuitive. There are 3 kinds of separators that can be set with each call individually

  • inner separator for separating the elements of a list (default: ‘,’)
  • outer separator for separating lists of lists (default: ‘;’
  • map separator for mapping keys to values (default ‘:’)

and you can quote elements either with single or double quotes. Brackets are ignored (exception: read_json) when parsed but could be added for a better reading experience.

Finally there are 3 modifiers at the beginning of a string possible (not case-sensitive):

  • NONE that visualises the ‘no value’ better than empty strings
  • ALL for all values should be used, e.g. all columns should be read from a csv file
  • NOT for defining an exclusion list

On default the values are checked for being a number (integer or float) or boolean. But each function has a toggle if only strings should be passed (argument test_number=True).

Currently (Dec 2019) the textfield_parser covers the following cases:

  • read_value
  • read_list
  • read_dict
  • read_dict_of_lists
  • read_dict_of_dict
  • read_json
  • read_comparisons

Examples:

Lists with Modifier

  • Not Mercedes, Renault, Citroen, Peugeaut, ‘Rolls Royce’

Dictionary:

  • ‘Mercedes’: ‘expensive’, ‘Audi’: ‘sportiv’, ‘VW’: ‘people’, ‘Citroen’: ‘cool’, ‘Rolls Rocye’: ‘royal’

Nested Dictionary

  • {‘High Class’: {‘Mercedes’: ‘expensive’, ‘BWM’: ‘sporty’}, ‘Sport Class’: {‘Porsche’: None, ‘Ferrari’: ‘special’}, ‘Middle Class’: {‘VW’: ‘people’, ‘Renault’: ‘fashionable’, ‘Citroen’: ‘classy’, ‘Peugeot’: ‘modern’}, ‘Luxury Class’: {‘Rolls Rocye’: ‘royal’, ‘Bentley’: ‘rare’}, ‘All’: {4: 8.9, 6: 90, 7: 7}}

Comparisons

  • anna > 1.70, norbert != 900, cindy <= 1.65

Example sdi_pandas

The concept and tools described in the previous chapter I am using myself in all my projects where data transformations are a task. All operators that have the potential to be reused are added to a GitHub repository that you are invited to visit and use for your projects.

Because I am using the local development pattern described in a earlier blog

Local Development of Custom Python Operators

you can download the source code directory, modify and test it locally before uploading it for your purpose.

Conclusion

Finally we have all the pieces that it needs to create quickly data transformation operators that offers the unbound flexibility of pandas. These operators can then be used by SAP Data Intelligence user who are not too familiar with scripting. Thus we got further to the end of efficiently use the scarce resources of developing data engineers who provide the ready-to-use tools for the more ‘ETL’-trained data engineers.

Please let me know if and how you use pandas with Data Intelligence and what you like to have in the future.

 

Be the first to leave a comment
You must be Logged on to comment or reply to a post.