Intelligent rule mining and data dependency analys...

juliahauri · ‎11-13-2020

The SAP Data Intelligence product team recently sponsored a Business Content Sprint-At-Home initiative where they engaged several partners to develop Business Content for SAP Data Intelligence. SAP asked us to be part of the content sprint and we were happy to participate based on our both technical and business expertise with SAP Data Intelligence.

The scenario developed is around intelligent rule mining. In this blog we will explain what we mean by that, why it is important, how it works and the benefits of the solution.

What intelligent rule mining is about

Data rules describe business operations, definitions and constraints that apply to an organization’s data. They reflect the business structure, serve to guide or control organizational behavior and are put in place to help the organization achieve its goals. Camelot’s intelligent rule mining solution powered by SAP Data Intelligence offers a flexible and versatile toolset to identify data dependencies and improve the overall data quality of your enterprise

Data rules vary in complexity. Data guidelines, e.g., requiring a “.” and an “@” character for a valid e-mail address, can be classified as non-complex rules and are easy to define on field level. However, there is more value to extract from data. Based on the semantics of your data, complex rules can also be mined in the form of if-then dependencies, e.g., given the value of a field A, what are the most likely implications for the values of other fields? Compared to non-complex data guidelines, these dependencies require an initial analysis step and are therefore more complex to find and utilize. Especially the extraction of complex dependencies is an exhausting process for domain experts, as it is difficult to check their applicability across several systems. Furthermore, the identification of incorrect data entries is a complex and time-consuming task performed manually by data stewards or data owners in enterprises.

Camelot’s intelligent rule mining solution powered by SAP Data Intelligence offers a flexible and versatile toolset to identify data dependencies and improve the overall data quality of your enterprise – within and beyond your master data. Free up qualified resources by reducing time spent on tedious data mining tasks and easily identify scenario-specific rules based on your data.

Identify data dependencies

Utilizing the powerful connection management of SAP Data Intelligence, you can directly connect to your HANA databases and access your data landscape. From here, you can select available tables and columns in order to construct a scenario for rule mining. You also have the option to upload your data tables in CSV format to the rule mining solution. Mining the uploaded data will automatically extract and identify relationships and dependencies between seemingly independent data based on the selected data source, e.g. product development status or hazardous materials.

Manage data rules in rule sets

Based on the mined data, rules can be identified and rated with respect to their relevance. Users have the possibility to define golden rule sets for their enterprise, store them in the repository and export them to Excel. This also includes the option to compare rule sets with new data sources and generate data quality reports, e.g., comparison of master data rules with supply chain data.

Enable value prediction for enterprise applications

With those data rules in place, data outliers and rule non-compliance can quickly be detected by validating rule conformity and expected values for any other data source in your enterprise. In case of inconclusive data, affected rows are highlighted in your data set and correct values can be populated automatically. Furthermore, besides rule management within our intuitive UI, the solution also allows for programmatic calls to the mined knowledge with the help of an exposed API: simply submit a combination of fields with some of their respective values, and the response contains the most likely to-be values for the empty fields.

Benefits

Identify and compare complex enterprise rule sets against additional data sources within your enterprise. The solution detects data inconsistencies through outlier detection, confidence and frequency distributions supporting the user during field value population through input suggestions and automated data population options. This leads to an overall higher data quality, hence strengthening decision-making in your enterprise and efficiency gains.

Technical information

SAP Data Intelligence allows for a versatile manipulation of the data across the enterprise landscape through the use of encapsulated logic of operators inside data pipelines. This paradigm offers maximum code re-usability, a simple setting tuning interface and the flexibility to accommodate for different system requirements. Following this principle, we offer a rule mining operator that takes the data and the columns to be mined as an input and outputs the mined results to be stored in the knowledge base. Once the data is stored, an openAPI operator is used to expose different functionality to the user based on the generated knowledge (e.g., rules summary and filtering, field value population, non-compliant rows, etc.). The process is described in the diagram below:

Intelligent Rule Mining Process

The rule mining operator uses a Parallel Frequent Pattern Tree Growth Algorithm implementation. The operator leverages a local Spark cluster to enable distributed computing capabilities for mining rules, allowing a faster processing time. In SAP® Data Intelligence, you also have the option to assign dedicated CPU and memory to the pipeline depending on the complexity of the data.

One can easily ingest new data sources and map them to the same pipeline by leveraging all the SAP prebuilt operators. Data can be provisioned from different clouds or on-prem databases (MSSQL, MySQL…). Furthermore, data jobs can be started via SAP Data Service and SAP® Business Warehouse Data Marts can be consumed. In a similar manner, the extracted knowledge can be pushed to any downstream systems.

Want to learn more?

If you would like to learn more about this scenario, please feel free to ask questions via the comments section of this blog post.

You can also reach out to us via this blog or email.

We also developed the Extreme Data Maintenance Solution as part of the SAP Data Intelligence Content Sprint-At-Home initiative.

You can also learn more by visiting the following community pages:

SAP Analytics Cloud https://community.sap.com/topics/cloud-analytics

SAP Data Warehouse Cloud https://community.sap.com/topics/data-warehouse-cloud

SAP Data Intelligence: https://community.sap.com/topics/data-intelligence