Technical Articles
How to make SAP Data Services better
I am married with SAP Data Services (BODS) since the year 1999. While employed by Acta and later BusinessObjects and SAP, I used BODS as consultant, trained, created trainings and certifications, worked as developer for it, was one of the product managers and had a bit of an architectural role as well. Some features I am proud of, some got implemented despite my arguments against and for one or two you might want to curse me.
Meanwhile the world has moved on and it would be a good time to carry BODS into the new world order as well.
Installation
My first change would be the installation of BODS. The biggest mistake we made was the Business Objects integration. The argument back then was that the same people using BO are also using BODS but I never bought into that. In BO you maintain the permissions for 1000s of report users but the BODS users are just two or three with different types of permissions. So what is the value of the integration? What are the commonalities?
Anyway, meanwhile the question should be sorted out, the BO Enterprise Server no longer is the center of all future development at SAP, is it? Further more, the integration was never finished, e.g. central repo users are still not maintained in BOE but in the BODS admin web pages.
Removing BOE is pretty simple. The Designer application connects to the repository database using the database credentials, just as it did before the BOW integration.
Two features only the BOE integration provides are the read-only repos and a quick repo switch. This would need to be enabled via a direct database login as well. We would need a functionality where one database login can use different repos, for some the user has full permissions, for others only read permissions. Such functionality exists – it is called database security. The user logs in with his personal database login and will be presented with a list of all database schemas that contain an AL_LANG table and he can see. These are obviously all the repositories this user has access to. If he has read and write permissions on this table, he can use it to design new objects, with select permissions only it is a read only repo.
Jobserver installed as container image
The jobserver and the webserver I would wrap into a download-able container image. Then installing a new jobserver is as simple as executing a “docker run” command. Upgrades, job server groups, scaling, multiple jobserver on the same host computer,… all no issue any longer.
With these two relatively simple changes in the product, the installation would be done in minutes instead of hours, would essentially never fail and upgrades are easy also.
Designer UI
While the al_engine is state of the art, the Designer application looked old fashioned even 15 years ago, not to mention how it feels today. In addition the Designer has the following issues:
- New transforms require code changes in lots of places. There is no good object hierarchy in the code. That is the reason why no transforms were added in the last 10 years – too expensive.
- In some areas the Designer screen looks like: If the backend has 100 switches, the UI screen shows 100 switches. That is not user friendly.
- Some settings the user has to make although they could be read from the connected systems. Example: database codepage.
- A web-based UI would be nice if it is equally powerful as the Windows UI. If not, then better stick to a native OS application.
The Data Services Workbench was the first try to revamp the underlying technology and its aim was to create 1:1 port of the Designer in a modern Eclipse IDE. The important parts were implemented quickly but the rest never made it into the product. Management questioned correctly what the value of replacing a C++ UI by an Eclipse UI is, when it is the same thing. Hence the project was stopped and now we have Designer and Workbench. Designer is superior in almost everything, Workbench got some goodies in the area of Data Quality transforms. Which one to use???
Together with another developer I started a proof of concept showing how a new UI should look like. We took the good parts of Designer, got rid of the annoying parts and all customers we showed the result were thrilled. The already excellent efficient workflow got improved by at least a factor of two with this UI. The highlights where:
- Data Driven configuration of transforms as secondary option. Some things can be done via data driven UIs much more efficient than via formulas. For example joining two tables along their foreign key relationships by simply showing the ERDiagram is easier to understand than a where/join clause over 5 lines.
- But Data Driven UIs must be just a different view of the same thing. You still can see the formulas if you like. And if you change the formula, the data driven UI does change as well. Until the point is reached where the data driven representation does no longer make sense. Then this tab is readonly or disabled. For example a theta join “where a.emp=b.emp and a.date between b.from and b.to” can be visualized as a line between two tables but there is not much value in it. You cannot drag the line from one column to another an expect the formula to make sense still.
- Automapping is available in the Designer also, but only if the columns have the same name. But the situation where you have an output of a transform and an existing target table and you need to align the two via a query transform is a task everybody has to do thousands of times. We implemented an automapping that calculated the probability of a mapping based on column name (source: FROM_DATE, target: T_FROM_DATE would get a high rank), data type (varchar(100) -> varchar(100) gets a high ranking, varchar(10) -> varchar(100) less and int -> date a very low), existing mapping formula and other information available. Then you could set a threshold and all above where checked as “map those” and the user can add some other columns also.
But that was the time when Cloud started and the project was killed. Only web frontends are good frontends was the mantra. Nobody can be bothered to download an application and install it. Granted, a web frontend would be better if the user efficiency is equal, the development of it equally expensive and many users are working with it. But none of these statements were true and frankly, even today you install IDEs and use web frontends only occasionally.
The biggest mistake in that area was however to question the user requirements. The ETL engine has so many features and options not for the fun of it, but because they are needed. For example not every customer is using all transforms, we can agree on that. Drawing the conclusion that because all customers use a query transform the query transform is the only transform needed is … brave.
You get a Data Integration tool that allows to build 80% of the data flows the user needs. Such tool is worthless. The customer cannot say, well at least we loaded the order header table, line items we cannot. Too bad, but 80% of the tables is enough, isn’t it?
No, the philosophy must be the one Data Services is built on: Normal things can be done easily, complex things possbile and for things the tool has no out-of-the-box solution there are workarounds possible.
This mistake of 80%-of-the-tables-are-enough was repeated over and over. Data Services Workbench. Cloud Platform Integration – Data Service (CPI-DS). Agile Data Preparation (ADP).
What do Data Services users use today? The same old Designer.
Engine
Once the Designer issue is fixed, it would be time to look into missing transformations and functions, contradictions within the product (e.g. a dataflow with a global variable can be valid and invalid at the same time), little enhancements of some transformations (history preserving transform comes to mind).
A larger topic would be Change Data Capture. Instead of using database specific objects to read changes or not provide a CDC solution at all, CDC should be standardized. In other words building delta dataflows should be easy. In fact most of the time there should be a single dataflow only and the optimizer knows what to do at initial load and delta load.
But about those things later, in case I get a lively feedback in the comments of this post.
Web based UI would definitely make many customers happy. That said, looking at the future road map for Data Services from SAP doesn't look promising in terms of new features that will be available.
Hi Werner,
it’s always a great pleasure to read your blog posts.
I totally agree that there is still a lot of room for improvement / enhancements / new features for SAP Data Services. Your blog post is a perfect summary of some bigger ones. And in addition I’m sure that there are still a lot of new requirements from the existing Data Services users (even though the SAP Continuous Influence Session / the former SAP Idea Place is not available anymore).
But I’m afraid that the bigger investments will be made somewhere else. That is why it is all the more important that you continue to contribute your ideas.
Best regards
Marcus
I couldn't agree more with the above. I always had the impression that when SAP bought Business Objects, they didn't realise they also acquired a state-of-the-art ETL tool. Despite SAP largely neglecting the application for 12(?) years, it still has a place and with some updates could still be a leader in the market.
The client installation is a bit of a nightmare though - compounded by compatibility issues between server version and client version.
Please give us something Eclipse based, or indeed Web based.
A Snowflake adapter wouldn't go amiss either, as well as connectors to other Cloud native sources and targets.
Thanks for this post Werner, as always spot on and a pleasure to read.
Jan.
All great ideas!
Maybe undo. Lol.
Flipping this around slightly, there would be huge value, and a big competitive advantage, in real time dashboards showing current state, and the ability to drill down on individual jobs, across time windows, views by business unit for data consumers or analytics admins.
Or how about views that help prioritize maintenance activities by showing recurring problems organized by number of failures or time to repair or impact to downstream processes.
Or smart reports that can identify unexpected changes to source data that aren't part of the known data dictionary (simple to implement) or, even better, let's predict downstream failures during data extraction and pause the process instead of sending the bad data forward, resulting in costly rollback and reprocessing.
All EIM monitoring is like watching the matrix from the outside. Just a bunch of green squiggles.
The data for real-time, robust, adaptive analytics is already sitting in the repos. Show me what this enterprise-level product is doing across my whole enterprise- right now.
What I had planned in regards to monitoring is to take the output of the "show optimized dataflow" and draw a dataflow that looks like the original dataflow but based on the engine optimizer output.
Similar to the way I showed what DoP does. You designed a dataflow with 4 nodes, display optimized dataflow redesigned the dataflow into 12 nodes. https://wiki.scn.sap.com/wiki/display/EIM/Degree+of+Parallelism
The rendering engine for that does exist already in the WebAdmin's Autodoc feature.
Next we could overlay each node with the number of rows it processed yet and each line with the number of rows in the queue. These monitor log data can be shown while the monitor log file builds or using a selector move through the execution time and show the states at this second. This information was previously not available, so I had asked to enhance the monitor log file by that. This part made it into the product even.
Then you can see at one glance where the bottleneck is. If all input buffers are full up to one transform and its output buffer is empty, obviously this transform is the bottleneck.
The recurring problems was themed "Operational statistics and dashboards". Got started but far from extensive or complete.