A typical data science story starts with a business use case, where the experts are interested in how to improve their processes whether it is to increase the profit, to lower the cost, an early detection, a better system usability or an increase in customer satisfaction. No matter what the reason, the first step is to collect some data.
In real world scenarios, datasets can’t be downloaded from Kaggle or Data.gov but is collected across different silos accessible to the company. This data is usually saved in various systems such as SAP Hana, SAP Vora, ERP systems and with various file systems such as HDFS, S3, WASB, etc. and may appear in different formats such as Parquet, ORC, Avro, and CSV. Companies have different opinions on saving their data on-premise, in cloud or in both. I believe the challenge for a data scientist is to be able to bring all data together and work across them seamlessly.
Another challenge is that real data is not always clean and ready to be processed. According to Forbes, data scientists spend 60% of their time on data cleansing. Same research reports that 76% of data scientists find data cleansing the least enjoyable part of their job. There are always some missing values, anomalies, normalization, sampling, data standardization, de-duplication and enrichment which consumes the data scientist’s time. After the data is cleansed, the user need to develop the mathematical model which represents the data, to cluster data or to find some rules and in general to come up with a system that can predict some future behavior and maybe to come up with some recommendations.
Of course, if the model is good enough, we should think about reusing it for unseen and new datasets and make sure that it serves its purpose; whether it is prediction, early detection or other. It would be great to receive an alert if the model’s accuracy drops under the certain threshold and the platform is able to monitor the quality on our behalf. In general, operationalization play a big role in machine learning-based development and maintenance which includes but is not limited to the topics such as control access policy, security, orchestration, scheduling and governance.
Data Science process flow
The purpose of this blog is to break down the data science-related tasks and to demonstrate how SAP Data Hub can be useful in completion of each task and provides features that facilitate their achievements.
Simultaneous Access to Multiple Data Sources:
SAP Data Hub provides data scientists with a flexible tool to ingest, prepare and process data. For instance within SAP Data Hub you can use visual artifacts in Vora Tools to read from different systems. Vora is the distributed execution engine for SAP Data Hub that has specialized storage and processing engines for the most common “Big Data” types such as Time Series, JSON Collections, Graph, Disk-stored, Streaming tables and In-Memory Relational. You can also use partitioning capabilities to partition your data across your cluster.
Create different table types and views in Vora Modeler
Use Vora modeler to read from supported file systems and file types such as Avro, Parquet, CSV and ORC
Pipeline modeler is another useful tool within SAP Data Hub portfolio that provides some operators to connect to different sources such as Vora and BW natively. You can also connect to Smart Data Integration(SDI) and Data Services(DS) to read/write from/to multiple services. Please refer to DS Product Availability Matrix (slides 20-31) and SDI Product Availability Matrix (slides 11-18) to get the list of these services and the compatible versions.
Some predefined operators for data ingestion
Data Discovery, Preparation, and Enrichment
After you have all your data in one place the preprocessing phase begins where you have to clean, refine, enrich, sample, or normalize your data. If you need to search through your datasets, regardless of the source, you can use the Metadata Management feature in SAP Data Hub Cockpit.
For interactive data cleansing use SAP Agile Data Preparation. ADP provides you with the functionalities such as“ reading Hana Calculation views”, “Add formula to columns”, “Aggregate Data”, “Replace null values”, “Add Columns”, etc. For more information on ADP please refer here.
SAP ADP: Interactive Data Manipulation and Enrichment
If your data resides in Vora, use the Vora Modeler for data enrichment and discovery. For more information on Vora modeler check here(SAP Vora Tools and Data Modeling with SAP Vora chapters).
Even if you are one of those data scientists that prefer to write a script to clean the data, you can either connect to your Zeppelin notebook or add a Java Script, Python or R operator to your Pipeline to write your own code. To learn how, follow this blog.
Use the Pipeline engine to code the data preprocessing
Connect to your Zeppelin Notebook to play with your data
Using pipeline to preprocess your images
Data Visualization can improve your understanding of the problem and the effectiveness of your solution. It can help with comparing different groups, displaying and discovering connections, finding anomalies, finding trends, highlighting some data vs the others and many more.
Visualize your data in Data Discovery using Data Hub Cockpit to access metadata, profile or preview the data or run simple filtering. Use your Jupyter or Zeppelin Notebook to import matplot, ggplot2 or any of your favorite libraries to visualize your data with R or Python. If your data resides in Vora, use the Vora modeler to visualize Time Series, Graph, Collection or Relational Data.
Visualize your data using Vora Tools, Zeppelin or Jupyter notebook
Data Processing and Modeling
Depending on the problem you’re trying to solve you can extract features from your data or run machine learning algorithms for Classification/Regression, Clustering, Time Series, Recommendation, Dimensionality Reduction, etc.
Feature extraction using SAP Leonardo within Data Hub Pipeline
SAP Data Hub provides you with a flow-based programming model called Pipeline Engine, which allows you to develop and execute your script in a containerized environment that is managed by Kubernetes. This means you can use your language of choice such as R or Python to use a predefined operator or to develop your custom one. You can also install the required libraries as a docker file and tag your operator for the execution. For a step-by- step guide please refer to Jens’s blog here.
With SAP Data Hub 1.3, Pipeline Modeler provides more than 230 predefined operators which are organized as different categories such as Python, R, PA, Image Processing, Tensorflow, SAP Leonardo MLF, Natural Language Processing, etc.Below you can see different categories:
Categories of predefined operators with SAP Data Hub Pipeline Modeler
Predefined operators such as Image Processing, Tensorflow, NLP, R and PA
Predefined operators such as Object Trackers, Python2, Python 3 and SAP Leonardo MLF
These operators are configurable and provide you with a documentation that describes their purpose and have a how-to guide.
Every Operator provides you with documentation
Configure each operator on the right side of the graph
To make it easier, we also provided some predefined graphs that can serve as examples for how-to-use different operators or simply to be re-used and re-configured for your data scenario. Choose the Graphs tab in Pipeline Modeler to access the predefined graphs.
An example graph which uses MNIST dataset for digit recognition with TensorFlow
Object Detection Pipeline using multiple pre-defined and custom operators
If you are happy with your model, you should think about operationalization. How to orchestrate , schedule and monitor the execution flow and debug the process if necessary.
Orchestrate the execution of workflow by using the SAP Data Hub Modeler. You can use this workflow to simultaneously write the result in multiple destinations such as Vora, Hana, BW, Amazon S3, GCP, HDFS, etc.
Orchestration in SAP Data Hub Modeler
You can also schedule the pipeline to run when you have more system bandwidth or less traffic. The recurrence can be triggered by receiving new data or by choice; for instance every night at 12am.
Scheduling with SAP Data Hub
SAP Data Hub also provides you with monitoring and debugging capabilities through the Data Hub Cockpit, Pipeline modeler or the Kubernetes Dashboard.
Monitoring with SAP Data Hub
Check this blog for more information on how to trace your detailed error in Pipeline or Kubernetes dashboard.
Trace the detailed error in Pipeline Modeler [more info]
Debugging in Kubernetes Dashboard[How to]
SAP Data Hub is a flexible platform that could be used for Data Science scenarios. It not only allows you to use your preferred language such as Python or R, but also provides you with Modeling Tools such as Vora Modeler, ADP, Pipeline Engine, Data Hub Modeler, etc. that facilitates data tasks. You can easily connect to multiple SAP or non-SAP sources either in cloud or on-prem or both and manage the data and metadata in one single platform. You can also distribute your executions, reuse your code within an operator and even connect to your Github repository for version control. Also you can orchestrate, schedule and monitor your execution and write to multiple destinations at the same time.