Machine Learning Prototype – Automating ETL
Here’s a write up of a prototype we demonstrated at the Intel booth at SAP TechEd 2018. This was performed using Intel Skylake CPUs with Machine Learning (ML) to automate ETL (Extract Transform and Load) mappings. There’s a video of the demo at the bottom.
As often is the case this was a team effort, you can see those who built this prototype on the slides, both the SAP team and the Intel team. The main contributors who I would like to recognise would be Nidhi Sawhney, Stojan Maleschlijski and John Santerre.
The following components were used for the prototype
- Python for implementing a Google TensorFlow model created with Keras
- SAP HANA Smart Data Integration (SDI) for acquiring new data
- SAP HANA External Machine Learning (EML) for inference (scoring) new data
- SAP Analytics Cloud for model evaluation.
The scenario we are address is the following.
A company receives data from their customers in a variety of different formats. A single data file received could contain a variety of formats and a variety of data fields, even different types of transactions. This prevents them from using a traditional ETL approach. These files and formats do change, but the type of data remains consistent.
They have already built rules manually, which are then used for ETL mappings.
The rules are held at a row (record) level.
For each record they would know.
- What type of data it was.
- What is the field separator
- Field to Column mapping
We then wanted to see if we could teach a machine to perform the same task and to what level of accuracy could we achieve.
If you look closely at the same of data below, you should be able to see that within one file we have numerous problems that would need to be handled within an ETL process. None of these are insurmountable, but significant logic and rules are required.
- Different field separators (“,” and “|”)
- Different field orders, look at the currency (GBP, EUR)
- Different number of fields per record
- Different date formats
- Different type of data (transactions)
We then wanted to teach a machine to learn how to process this data. We would train a deep learning models with previously labelled data. These models will learn to understand patterns from strings of transactions.
First we identify the type of transaction, by teaching it from previous transactions.
Secondly we perform the field to column identification.
To train the ML model we used both a horizontal scan and vertical scan of the data, this allows the model to “see” the data, so that it recognises different types of data, much in the same way that we do. We also used Intel CPUs and not GPUs and with the right configuration (correctly compiled MKL tensorflow python wheel and environment variables shown below)
## Linux OS Variables MKL_THREADING_LAYER=GNU KMP_BLOCKTIME=1
The Architecture is shown below, the model training and scoring is performed outside of SAP HANA, using similar Intel Skylake hardware.
The specification of the Intel hardware that we used is below.
One of the challenges with ML is having specialised hardware with GPUs, however we proved that these are not always required.
I made a short video that captures the main pieces of the Machine Learning Prototype.
- Model Training in Python
- Model Evaluation in SAP Analytics Cloud (SAC)
- Data Acquisition with SAP HANA Smart Data Integration (SDI)
- Scoring Data with SAP HANA External Machine Learning (EML)
It begins with the training of the ML model using Python.
We then evaluate the model using SAC, here we can see instances where there is very data the model does note perform well as it requires more data. The bars represent the record count for each of the fields, and the line is the model accuracy (F1 Score).
Finally we use the SAP HANA WebIDE to acquire new data with SDI and then to score this data using the External Machine Learning. The model captured in the video is classifying the transaction (trade) type.