Not enough love: The Python User-Defined Transform
In my opinion, the python user-defined transform (UDT) included in Data Services (Data Quality -> UserDefined) bridges several gaps in the functionality of Data Services. This little transform allows you to access records individually and perform any manipulation of those records. This post has two aims: (1) to encourage readers to consider the Python transform the next time things get tricky and (2) to give experienced developers an explanation on how to speed up their Python development in BODS.
Currently, if you want to apply some manipulation or transformation record by record you have two options:
- Write a custom function in the BODS Scripting language and apply this function as a mapping in a query.
- Insert a UDT and write some python code to manipulate each record.
How to choose? Well, I would be all for keeping things within Data Services, but the built-in scripting language is a bit dry of functionality and doesn’t give you direct access to records simply because it is not in a data flow. In favour of going the python route are the ease and readability of the language, the richness of standard functionality and the ability to import any module that you could need. Furthermore with Python data can be loaded into memory in lists, tuples or hash-table like dictionaries. This enables cross-record comparisons, aggregations, remapping, transposes and any manipulation that you can imagine! I hope to explain how useful this transform is in BODS and how nicely it beefs up the functionality.
For reference, the UDT is documented chapter 11 of http://help.sap.com/businessobject/product_guides/sbods42/en/ds_42_reference_en.pdf
The best way to learn python is perhaps just to dive in, keeping a decent tutorial and reference close at hand. I won’t recommend a specific tutorial; rather google and find one that is on the correct level for your programming ability!
Making Python development easier
When developing I like to be able to code, run, check (repeat). Writing Python code in the Python Smart Editor of the UDT is cumbersome and ugly if you are used to a richer editor. Though it is a good place to start with learning to use the Python in BODS because of the “I/O Fields” and “Python API” tabs, clicking through to the editor every time you want to test will likely drive you mad. So how about developing and testing your validation function or data structure transform on your local machine, using your favourite editor or IDE (personally I choose Vim for Python)? The following two tips show how to achieve this.
Tip#1: Importing Python modules
Standard Python modules installed on the server can be imported as per usual using import. This allows the developer to leverage datetime, string manipulation, file IO and various other useful built-in modules. Developers can also write their own modules, with functions and classes as needed. Custom modules must be set up on the server, which isn’t normally accessible to Data Services Designers.
The alternative is to dynamically import custom modules given their path on the server using the imp module. Say you wrote a custom module to process some records called mymodule.py containing a function myfunction. After placing this module on the file server at an accessible location you can access its classes and functions in the following way
import imp mymodule = imp.load_source('mymodule', '/path/to/mymodule.py') mymodule.myfunction()
This enables encapsulation and code reuse. You can either edit the file directly on the server, or re-upload it with updates, using your preferred editor. What I find particularly useful is that as a data analyst/scientist/consultant/guy (who knows these days) I can build up an arsenal of useful classes and functions in a python module that I can reuse where needed.
Tip#2: Developing and testing from the comfort of your own environment
To do this you just need to write a module that will mimic the functionality of the BODS classes. I have written a module “fakeBODS.py” that uses a csv file to mimic the data that comes into a data transform (see attached). Csv input was useful because the transforms I was building were working mostly with flat files. The code may need to be adapted slightly as needed.
Declaring instances of these classes outside of BODS allows you to compile and run your BODS Python code on your local machine. Below is an example of a wrapping function that I have used to run “RunValidations”, a function that uses the DataManager and Collection, outside of BODS. It uses the same flat file input and achieves the same result! This has sped up my development time, and has allowed me to thoroughly test implementations of new requirements on a fast changing project.
def test_wrapper(): import fakeBODS Collection = fakeBODS.FLDataCollection('csv_dump/tmeta.csv') DataManager = fakeBODS.FLDataManager() RunValidations(DataManager, Collection, 'validationFunctions.py', 'Lookups/')
Limitations of UDT
There are some disappointing limitations that I have come across that you should be aware of before setting off:
- The size of an output column (as of BODS 4.1) is limited to 255 characters. Workaround can be done using flat files.
- You can only access data passed as input fields to the transform. Variables for example have to be mapped to an input column before the UDT if you want to use them in your code.
- There is no built-in functionality to do lookups in tables or execute sql through datastore connections from the transform.
How a powerful coding language complements a rich ETL tool
Python code is so quick and powerful that I am starting to draw all my solutions out of Data Services into custom python modules. It is faster, clearer for me to understand, and more adaptable. However, this is something to be careful of. SAP BODS is a great ETL tool, and is a brilliant cockpit from which to direct your data flows because of its high-level features such as authorizations, database connections and graphical job and workflow building. The combination of the two, in my opinion, makes for an ideal ETL tool.
This is possibly best demonstrated by example. On a recent project (my first really) with the help of Python transforms and modules that I wrote I was able to solve the following:
- Dynamic table creation and loading
- Executeable metadata (functions contained in excel spreadsheets)
- Complicated data quality analysis and reporting (made easy)
- Reliable unicode character and formatting export from excel
Data Services 4.1 on the other hand was indispensable in solving the following requirements
- Multi-user support with protected data (aliases for schemas)
- Maintainable centralized processes in a central object library with limited access for certain users
- A framework for users to build their own Jobs using centralized processes.
The two complemented each other brilliantly to reach a solid solution.
With the rise of large amounts of unstructured data and the non-trivial data manipulations that come with it, I believe that every Data analyst/scientist should have a go-to language in their back pocket. As a trained physicist with a background in C/C++ (ROOT) I found Python incredibly easy to master and put it forward as one to consider first.
I do not know what the plan is for this transform going forward into the Data Services Eclipse workbench, but hopefully the merits of allowing a rich language to interact with your data inside of BODS are obvious enough to keep it around. I plan to research this a bit more and follow up this post with another article.
This is my first post on SCN. I am new to SAP and have a fresh perspective of the products and look forward to contributing on this topic if there is interest. When I get the chance I plan to blog about the use of Vim for a data analyst and the manipulation of data structures using Python.