Skip to Content

Not enough love: The Python User-Defined Transform


/wp-content/uploads/2014/04/ss_438759.png

In my opinion, the python user-defined transform (UDT) included in Data Services (Data Quality -> UserDefined) bridges several gaps in the functionality of Data Services.  This little transform allows you to access records individually and perform any manipulation of those records.  This post has two aims: (1) to encourage readers to consider the Python transform the next time things get tricky and (2) to give experienced developers an explanation on how to speed up their Python development in BODS.

Currently, if you want to apply some manipulation or transformation record by record you have two options:

  1. Write a custom function in the BODS Scripting language and apply this function as a mapping in a query.
  2. Insert a UDT and write some python code to manipulate each record.

How to choose?  Well, I would be all for keeping things within Data Services, but the built-in scripting language is a bit dry of functionality and doesn’t give you direct access to records simply because it is not in a data flow.  In favour of going the python route are the ease and readability of the language, the richness of standard functionality and the ability to import any module that you could need.  Furthermore with Python data can be loaded into memory in lists, tuples or hash-table like dictionaries.  This enables cross-record comparisons, aggregations, remapping, transposes and any manipulation that you can imagine!  I hope to explain how useful this transform is in BODS and how nicely it beefs up the functionality.

For reference, the UDT is documented chapter 11 of http://help.sap.com/businessobject/product_guides/sbods42/en/ds_42_reference_en.pdf

The best way to learn python is perhaps just to dive in, keeping a decent tutorial and reference close at hand.  I won’t recommend a specific tutorial; rather google and find one that is on the correct level for your programming ability!

Making Python development easier

When developing I like to be able to code, run, check (repeat).  Writing Python code in the Python Smart Editor of the UDT is cumbersome and ugly if you are used to a richer editor.  Though it is a good place to start with learning to use the Python in BODS because of the “I/O Fields” and “Python API” tabs, clicking through to the editor every time you want to test will likely drive you mad.  So how about developing and testing your validation function or data structure transform on your local machine, using your favourite editor or IDE (personally I choose Vim for Python)?  The following two tips show how to achieve this.

Tip#1: Importing Python modules

Standard Python modules installed on the server can be imported as per usual using import.  This allows the developer to leverage datetime, string manipulation, file IO and various other useful built-in modules.  Developers can also write their own modules, with functions and classes as needed.  Custom modules must be set up on the server, which isn’t normally accessible to Data Services Designers.

The alternative is to dynamically import custom modules given their path on the server using the imp module.  Say you wrote a custom module to process some records called mymodule.py containing a function myfunction.  After placing this module on the file server at an accessible location you can access its classes and functions in the following way


import imp
mymodule = imp.load_source('mymodule', '/path/to/mymodule.py')
mymodule.myfunction()


This enables encapsulation and code reuse.  You can either edit the file directly on the server, or re-upload it with updates, using your preferred editor.  What I find particularly useful is that as a data analyst/scientist/consultant/guy (who knows these days) I can build up an arsenal of useful classes and functions in a python module that I can reuse where needed.

Tip#2: Developing and testing from the comfort of your own environment

To do this you just need to write a module that will mimic the functionality of the BODS classes.  I have written a module “fakeBODS.py” that uses a csv file to mimic the data that comes into a data transform (see attached).  Csv input was useful because the transforms I was building were working mostly with flat files.  The code may need to be adapted slightly as needed.

Declaring instances of these classes outside of BODS allows you to compile and run your BODS Python code on your local machine.  Below is an example of a wrapping function that I have used to run “RunValidations”, a function that uses the DataManager and Collection, outside of BODS.  It uses the same flat file input and achieves the same result!  This has sped up my development time, and has allowed me to thoroughly test implementations of new requirements on a fast changing project.


def test_wrapper():
     import fakeBODS
     Collection = fakeBODS.FLDataCollection('csv_dump/tmeta.csv')
     DataManager = fakeBODS.FLDataManager()
     RunValidations(DataManager, Collection, 'validationFunctions.py', 'Lookups/')



Limitations of UDT

There are some disappointing limitations that I have come across that you should be aware of before setting off:

  • The size of an output column (as of BODS 4.1) is limited to 255 characters.  Workaround can be done using flat files.
  • You can only access data passed as input fields to the transform.  Variables for example have to be mapped to an input column before the UDT if you want to use them in your code.
  • There is no built-in functionality to do lookups in tables or execute sql through datastore connections from the transform.

How a powerful coding language complements a rich ETL tool

Python code is so quick and powerful that I am starting to draw all my solutions out of Data Services into custom python modules.  It is faster, clearer for me to understand, and more adaptable.  However, this is something to be careful of.  SAP BODS is a great ETL tool, and is a brilliant cockpit from which to direct your data flows because of its high-level features such as authorizations, database connections and graphical job and workflow building.  The combination of the two, in my opinion, makes for an ideal ETL tool.

This is possibly best demonstrated by example.  On a recent project (my first really) with the help of Python transforms and modules that I wrote I was able to solve the following:

  • Dynamic table creation and loading
  • Executeable metadata (functions contained in excel spreadsheets)
  • Complicated data quality analysis and reporting (made easy)
  • Reliable unicode character and formatting export from excel

Data Services 4.1 on the other hand was indispensable in solving the following requirements

  • Multi-user support with protected data (aliases for schemas)
  • Maintainable centralized processes in a central object library with limited access for certain users
  • A framework for users to build their own Jobs using centralized processes.

The two complemented each other brilliantly to reach a solid solution.

Going forward

With the rise of large amounts of unstructured data and the non-trivial data manipulations that come with it, I believe that every Data analyst/scientist should have a go-to language in their back pocket.  As a trained physicist with a background in C/C++ (ROOT) I found Python incredibly easy to master and put it forward as one to consider first.

I do not know what the plan is for this transform going forward into the Data Services Eclipse workbench, but hopefully the merits of allowing a rich language to interact with your data inside of BODS are obvious enough to keep it around.  I plan to research this a bit more and follow up this post with another article.

about me…

This is my first post on SCN.  I am new to SAP and have a fresh perspective of the products and look forward to contributing on this topic if there is interest.  When I get the chance I plan to blog about the use of Vim for a data analyst and the manipulation of data structures using Python.

To report this post you need to login first.

17 Comments

You must be Logged on to comment or reply to a post.

  1. Ashok Vemulapalli

    I kind of like the idea of using a programming language in BODS to extend the ability of the tool. But using python in places where the tool provides functionality is where i am a little skeptical about.

    As said earlier i like the idea and thanks for a good post.

    (0) 
    1. Jake Bouma Post author

      Agreed — scepticism will save you from doing unnecessary backflips in python when in fact most problems can be solved very quickly with Data Services.  It does lack some freedom needed for tricky situations though, where Python comes into play.  Particularly, with growing amounts of diverse unstructured data increasing the demands of the “T” in ETL, I feel that BODS will need the power of a programming language to keep up.

      (0) 
  2. Akash Banerjee

    Hi Jake,

    This is a really good share , I had worked on this transform and used Python code in the Python Smart Editor of the UDT using which I had cleansed some of the data in a previous project of mine. I worked on this transform using the per record option however I was not confident in the using the per collection option. It would be nice in case you can share an example on how to use this per collection option.

    Regards

    Akash

    (0) 
    1. Arne Weitzel

      Nice post! I have been using the per collection option. It’s actually quite easy, you have to setup a break group. But for legal reasons, I cannnot provide the code here.

      (0) 
  3. Michael Jess

    Nice post! I’ve been looking for ways to reuse python code in our project for quite some time. Is there a way to include custom python libs in the repository export ATL? Putting them in some accessible place does not appear very customer friendly to me

    (0) 
  4. Joshua Blythe

    I usually suggest against writing python code inside DataServices, and I have a LOT of python code getting called by DS.  If it resides outside of DS, you dont have to migrate it. User defined transforms cant be pushed down and create a break in your dataflow causing your data to be paged.  The collection sortation does not allow you to alternate ascending descending break keys, and the preservation of source sortation may or may not work. (I dont trust it)

    That being said … I found that if you happen to use complex data structures inside of a user defined transform … you can easily pickle the structure out to disk (think complex python dictionary).   Then read that structure in on your desktop, complete your processing on said dictionary, then easily migrate your code over to DS.   This makes debugging the actual logic much simpler than trying to do it inside DS. 

    and dont forget you can use the locals()  dictionary to store data between collections. 

    (0) 
    1. Jake Bouma Post author

      Nice strategy using a pickled dict!

      Risks and rewards of using Python.  Of course if you are after dataflow performance then you need to stay within BODS.  But if you need to be agile, or need to enrich using python libraries, it is worth the cost.  Consider the BODS Twitter blueprints — fully python.

      I see even HANA’s text analysis is python powered.  Why wouldn’t it be, string processing in python is very comfortable.

      So how do we overcome the performance limitations?  I feel strongly that this would be a valuable thing for SAP to pursue.  Could the BODS python modules be extended to give more access to the dataflow?

      (0) 
  5. Jake Bouma Post author

    Anyone using python Pandas?  I have been so impressed with this new data analysis framework.  Together with ipython this solves all my ad hoc data analysis requirements, and makes reporting on it pretty fun.

    Now I just use BODS to move data around, and run use Pandas straight on the database.

    (0) 
  6. Joshua Blythe

    For making debugging your python code a little easier, you can redirect the stdout to a file and use simple print statements.

    import sys

    stdoutSave = sys.stdout

    sys.stdout = open(r'<folder path>\logfile.txt’,’a’)

    <UDT CODE GOES HERE>

    sys.stdout.close() 

    sys.stdout = stdoutSave

    (0) 
  7. Jose Palacios

    Hello ,

    thank you for the post , very interesting ! Python is a powerful, flexible, open-source language that is easy to learn, easy to use, and has powerful libraries for data manipulation and analysis. so is a logical  option for BODS.

    (0) 
  8. Jake Bouma Post author

    Busy trying to get PyRFC module working in a docker container to query SAP systems.  The battle to make SAP better friends with Python continues.  😏

    (0) 
  9. Swetha N

    hi all,

    is there any tutorial available for python codding syntax inside BODS?

    also how do you evaluate whether user-defined-transform or custom-function for any scenario?

    (0) 
  10. Venkata Ramana

    Hi Jake,

    Your blog is very informative.

    I am doing the twitter analysis using user defined transform best practices. But I am facing the problem of tweet length including hash tags etc crossing the output field length 255.  As you mentioned output field in UDT is not taking more than 255.  You told that as workout , we can write in to the flat file.

    Please suggest me how to write it to flat file from the collection.

    Thanks & Regards,
    Ramana.

    (0) 
  11. Steven Dauw

    Anybody knows how to add python libs to data services.

    In a normal env. this would be pip isntall…  But with data services is less clear 🙁 Python is a powerful addon to the tool.

    But without the possibility to add the vast availability of Python libs it loses a lot of power.

    Side comment : the 4.2 version still sticks with v2.7. That seems odd at least

    Regards,

    Steven Dauw

    (0) 

Leave a Reply