Fuzzy Matching Reimagined with Data Science- Part 2
Ever looked for matches that were less than 100% perfect? Key-based matches not working for you? Until a few years ago, fuzzy matching was the only answer we had. Today, data science algorithms are breaching the boundaries of the “possible”. Find out how we used algorithms typically applied to document similarity to solve the traditional fuzzy matching problem using SAP Data Intelligence & SAP Analytics Cloud.
In Part 1 of this blog series, we set the context for this problem statement, the pre-processing performed before explaining the models used. In this part, we are going to compile all the models in DI and generate Master and Bridge and write it into SAP Analytics Cloud to visualize the results.
Step 01: Load data
The three different data sources positive, quarantine, and de-quarantine are stored in Semantic Data Lake (SDL). Later these files will be loaded into the modeler for profile matching.
Step 02: Create the training pipeline
- In the ML Scenario Manager, the new pipeline is created with the template as “Blank”. We load these datasets in the graph using Read File Operator by providing the configurationType as “Connection Management” and connectionID as “DI_DATA_LAKE” in Configuration. The path of the file will be provided in the path. The sample Read File Operator for positive data looks as below. All 3 files, Positive, Quarantine, and De-Quarantine will be loaded into the graph in a similar fashion.
- These three files will be passed as input to the Python3 Operator for profile matching. Steps described in Part 1 are implemented inside the Python3 Operator. Data is pre-processed and passed to functions nearest_neighbour and cosine_similarity to find similarities between these different data sources. The snippet to load the data inside the Python3 Operator and pass it to two functions: nearest_neighbour and cosine_similarity.
- Both of these algorithms will generate a dataframe, grouping similar profiles with a similarity score. These groups are provided to a data steward to confirm the machine’s merge / de-merge recommendations. The dataframe containing the individual source records and corresponding profiles with the group ID is called the Bridge, which is a lineage file. The dataframe containing the golden record for each group ID, created from the best source attributes is called the Master. Records that require review are passed to the file created for the data steward. Below snippet shows how to write the data to file in the graph.
- To write the dataframe to the file, it is converted to string and written to the output port. This string will be passed to “in” parameter in the “toFile” Operator. We are writing all these dataframes in Semantic Data Lake (SDL). Below is the configuration of the Write File Operator for Master Data.
- Before we run our graph, a Docker image needs to be created for the Python3 Operator. Please check this blogpost by Andreas Foster to see how to create a DockerFile and build it. For reference, our DockerFile and tags are as below.
- To use the docker image for our Python3 Operator, we group it and added the docker image in the tags of the configuration like below.
- Once the files bridge.csv and master.csv are written to SDL, these two files are written to SAP Analytics Cloud using “SAP Analytics Cloud Formatter” and “SAP Analytics Cloud Producer”. Check this blogpost by Ian Henry for the integration of SAP Data Intelligence and SAP Analytics Cloud.
- The final pipeline looks as below.
- Once the pipeline executed, three csv’s master.csv, bridge.csv and steward.csv are written to SDL inside result folder like below.
Step 03: Ingest review decision from Data Steward
- Once the steward.csv is written to result, the Data Steward will download the steward.csv to review the matches generated by nearest neighbour and cosine similarity. The Data Steward will set the column “Review” as “Y” for the correct matches and “N” for incorrect matches. He will upload this reviewed csv to the SDL inside steward_data folder as below,
- The decisions are used to update the files in a pipeline within ML Scenario manager. Here we filter the groups reviewed by the Data Steward and update the group IDs in the Master & Bridge file accordingly. Using Read File Operator the steward_corrected.csv is loaded from SDL. This file is passed as an input to the Python3 Operator in which the rows will be filtered based on the column “Review” entered by the Data Steward. This reviewed data will be written as bridge_reviewed.csv in the SDL result folder. Subsequently, the files are converted into datasets and written into the SAP Analytics Cloud tenant. The pipeline looks like below,
- In order to avoid manual intervention, this pipeline is scheduled to execute at the end of every day after the business hours. If there’s a new steward_corrected data generated for the day then this scheduled pipeline will filter the data and write the data to SAP Analytics Cloud. See the configuration details of the scheduled graph below,
- Once the graph is executed the bridge_reviewed will be created in SDL like below.
In this blog, we have shown a step by step approach on how we deployed these models in SAP Data Intelligence. Armed with the latest Bridge & Master files, written by SAP Data Intelligence into the SAP Analytics Cloud tenant, we are ready to visualize our results. Read Part 3 of this blog post to find out how the end-user can review and consume these results.
Great work and well written,look forward to seeing more of the series!