SAP Data Intelligence Metadata Explorer – Part 2
More Curious about “how-to” automate the Data profiling and Data quality process in SAP Data Intelligence?
I am back with one more blog post with the exciting information, let’s gear up in this blog post to see how we can automate the generation of profiling reports and data quality reports. I will provide the short recap of my earlier blog post to get more aligned with the current blog.
In the Blog1, we have seen all the features of metadata explorer it has, but it requires lot of human intervention to publish the data set, profile the data set, define the data quality rules and for creation of report cards, also another drawback here is you cannot export profiling report, score card report or metadata of the data but for most of the organizations this brings a lot of value to analyze and act up on it.
There are lot of challenges in generation of profiling reports with the Metadata Explorer, mainly it requires human intervention to generate the Profiling report or Data Validation report (Please refer to Blog1 to get the more detailed view on generation of Profiling report and Data Quality report manually) of the particular dataset in Meta data explorer and this challenge is achieved with the pipeline modeler using standard operators and with the Pipeline scheduler.
The Manual Process and the automated process is clearly depicted in the following diagram.
Data profiling and Data validation can be automated using pipeline modeler with the operators available natively as part of Data intelligence. The result of profiling and Data validation operators can be loaded into any of the target destination.
Automated Data Profiling Process:
We have 10+ operators available natively as part of Data intelligence to generate the profiling report based on different sources. All the operators that are used to generate profiling report are listed below.
There might be a scenario where you don’t have the operator which is compatible with the source, in our case we were working with AWS S3 bucket, for AWS as source we doesn’t have any standard operator where you can generate profiling report with data residing in AWS S3. So, you can make use of “Local File Profiler” operator but for this operator the file has to be locally available to the SAP Data Intelligence.
The following tutorial illustrates the process of getting information from Amazon S3 and automating the generation of profile reports. The scenario is depicted at a high-level in the following diagram:
We have used different standard operators for the generation of profiling reports on the data which is present in AWS S3 bucket, below is the modeled graph which comprises of different operators that helps in generation of profiling report. The raw data is read from AWS S3 bucket and the profiled report is generated on fly and it is written back to AWS S3. We have attached one Cron job to the graph in order to regenerate and overwrite the existing profiling report whenever there is change in source data.
Automated Data Quality Process:
Now let’s start using the standard operators to automate the Data Quality process. We have few operators where it helps you to define data quality rules on the source data. All the operators that are used for Data Quality process are listed below.
The following tutorial illustrates the process of getting information from Amazon S3 and automating the Data Quality Process. The scenario is depicted at a high-level in the following diagram:
We have used different standard operators for the Data Validation on top of data which is present in AWS S3 bucket, below is the modeled graph which comprises of different operators that helps in generation of Data Validation report. The raw data is read from AWS S3 bucket and the result of Data Validation is written back to AWS S3. Data Validation has 4 output ports Pass, Fail, Fail Information and Error Message. Based on the validation rules and Fail action the data was spread across the output ports. In the below scenario we have two validation rules one with Pass action and other with Fail action. The output of these rules were written to AWS S3 in different Files. We have attached one Cron job to the graph in order to regenerate and overwrite the existing Data Validation result whenever there is change in source data.
Now, we will try to dive deeper on “how – to” automate the generation of profiling reports using pipeline modeler.
Read data from Amazon S3
Now let’s start using this data in a data pipeline. We first create a new Graph in the Modeler application and add a Flow agent File Consumer to read the data from Amazon S3:
Flow agent File Consumer, reads the data from any of the supported cloud storage or local file. It uses Flow agent sub-engine for execution.
The list of supported cloud storages are –
- Azure Data Lake (ADL)
- Azure Data Lake V2 (ADL_V2)
- Google Cloud Storage (GCS)
- Amazon S3 (S3)
- Windows Azure Storage Blob (WASB)
- Semantic Data Lake (SDL)
- Alibaba OSS (OSS)
Open the Configuration of the Flow agent File Consumer operator and point it to the connected storage location where the file .csv resides:
First choose the Storage Type (S3 in this case).
Read Data from Flow agent consumer
Now, we will read the data from Flow agent consumer and produces a file. For this we add Flow agent producer to the same graph, the location of the file depends on the specified storage type, being it cloud or local file.
Open the Configuration of the Flow agent File Producer operator and specify the storage type, file name, mode, format and CSV properties.
In our case we keep the file locally because “Local File Profiler” operator will generate the profiling report on the local file.
The output of this operator generates the Filename. This Filename will act as input to the next operator.
Generation of profiling report
For this we add “Local File Profiler” operator to the same graph. This operator accepts the file name from the previous operator with the input port or file name can explicitly specified in the configuration Property.
Write the Profiled data as .json to Amazon S3
In the last step, we will write the results back to Amazon S3 using a different file format, which is .json. For this, we add “Format Converter” which converts to .json format to the existing graph. The input port of Format converter is of type blob so first lets convert the output of “Local Profiler” to blob object, for this we use “To-Blob Converter”
Open the Configuration of the Format Converter operator and specify the Target format as .json.
The last step is to write the result back to Amazon S3. For this, we add “Write File” operator to the same graph, before we add the “Write File” operator we add “To-File Converter” to convert it to file format.
Open the configuration property of “Write File” operator and choose an existing connection to S3 and provide the necessary properties.
That’s it, let’s now Save the graph and click on Run.
Wait until the graph switches into Running and Completed state.
Inspect the results
After the graph execution has completed, you can use the Metadata Explorer application to inspect the written .json file
At the end, create one Cron job to the graph that executes in regular intervals of time, Cron job can be created from schedule tab in Monitoring tile provided with necessary details.
we have seen the detailed steps on how to automate the data profiling, we can apply the same steps for data quality as well with the Data Quality operators that are natively available as part of SAP Data Intelligence.
In this Blog we explored on “How-to” automate the Data Profiling and Data Quality process using SAP Data Intelligence. Exporting the results of Data profiling and Data quality provides you the complete three dimensional analysis of the data and that’s leads to get more insights on feature engineering and further data modeling.
This automation helps the customers for implementing a data strategy, and informs the creation of data quality rules and score cards.
You can post any queries and concerns you face in the comment section below and also feel free to share any findings you come across that I missed in this blog.
Happy Learning 🙂
Great information - thank you!
thanks for your blog post!
Is there any posibility to schedule the profiling within the Metadata Explorer?
Or could we reuse the genereated output in the Metadata Explorer?
Thanks and best regards
With the current version of DI we don't have any technical feasibility to schedule the profiling within the Metadata explorer.
For the second point, We can definitely use the output of the profiling pipeline in Metadata explorer and we can leverage all the functionalities of metadata explorer on the generated output.
thanks for your answer.
By using the output you mean that the result can be profiled as all other files or do you mean that we can use the results from the file as profiling results of the inital dataset within the UI of Metadata Explorer?
If the source (Connection) supports the profiling capability then you can profile the output file.
For Example, If you're generating the file output in AWS S3, If AWS S3 supports profiling capability then you can profile the file as all other files, but if the source doesn't support the profiling capability on files then you can just view the file in Metadata explorer.
Thank you for a very informative blog.
We could profile, validate, enrich the data in modeler as well as in Metadata explorer.
From your blog, I could understand that we could automate and schedule the data quality processes in modeler which can't be achieved in metadata explorer as yet.
My question is why do we have similar capabilities in both the components, If we can achieve data quality, validation, profiling in Modeler, then why would we perform these activities in MDE?
Could you please elaborate?
I don't see profiling operators in Data Intelligence Modeller.
How to find them?
Please make sure you select all the relevant operators that you would like to use in operator task bar.
There is no category [INTERNAL] Profiling Operators.
Probably I should open them in some way.
Operators flagged with INTERNAL are used for internal activities and do not have the same lifecycle as normal operators. They are used internally for very specific scenarios and not deprecated or backwards compatible between releases. We do not recommend you use them for any production scenarios as they may break with any upgrade that is done on the system.
Hello Pavan ,
I have tried the same way to do profiling i pulled data from SDL and stored that data in SDL did not do any validation but after running the pipeline I could not see that my file got profiled.
I can not see a copy of that file in the catalog
sap data intelligence version 3.1.43
could you help me to understand?
First of all, congratulations on the publication.
One question, is it possible to automate data publishing? Do you have a “how-to”?
So that you can automate the entire data ingestion process: load, profile and publish.
Thanks in advance.
Hello, I have an opinion on this even though I am not working on the SAP side. My approach has always been to use a separate data load pipeline in the Modeler app to push data to SDL from a data source - and then use the Task Schedule tool within the Metadata Explorer app to do scheduled publications, rulebooks and profilings.