Match Review with Data Cleansing Advisor (DCA)
Match Review is a tool in Information Steward which allows Data Stewards to engage in the process of reviewing match results produced by Data Services or by a data cleansing solution published by Data Cleansing Advisor. The process of creating a dataflow to produce match results that Information Steward can then use to create a Match Review task can be found by reading the following article: http://wiki.scn.sap.com/wiki/display/EIM/Match+Review. The process of creating a Match Review task can be found in this article: http://wiki.scn.sap.com/wiki/display/EIM/How+to+create+a+Match+Review+task+in+My+Worklist+and+review+the+suspects
This section will focus on leveraging Match Review to review the results of a data cleansing solution published by Information Steward’sData Cleansing Advisor. Data Services Workbench will be used to stage the appropriate data for a Match Review task. Data Cleansing Advisor simplifies the entire data cleansing process by intelligently recommending rules for cleansing and matching based upon SAP best practices. These rules can then be modified to further meet the Data Steward’s business requirements while immediately showing them the impact of the modifications. Once the Data Steward is satisfied with the results, the solution can be published and immediately used within Data Services Workbench by an ETL Developer.
Create a Dataflow
The published data cleansing solution is highlighted in yellow in the dataflow pictured below. The data cleansing solution represents the cleansing and match rules that were created and reviewed by the Data Steward in Information Steward. The image below is a basic example of what a dataflow will look like within Data Services Workbench in order to stage the appropriate tables to create a Match Review task in Information Steward.
The first step is to create a new project, dataflow and input source in Data Services Workbench. Add the input source that was used to create the data cleansing solution within Information Steward to the dataflow. If a sample set of input data was used for the data cleansing solution that was created in Information Steward, then you will want to reference the full (production) set of data. If you are unsure, then locate the “Data Cleansing Solutions” tab in Data Services Workbench and view the details of the published data cleansing solution.
Mapping Input Data
All relevant details, including the input source’s name and schema, will be available for you to review. Add the data cleansing solution from the same “Data Cleansing Solutions” tab to the dataflow.
The input source will need to be connected to the selected data cleansing solution. The next steps would be to map the input source’s fields to the input schema of the data cleansing solution and to select the relevant output fields that will be used to drive the Match Review task.
The published data cleansing solution will provide you details as to how the data was automatically mapped within Information Steward when viewing the “Input” tab. You can choose to manually map the fields or use the “Detect mappings” feature which will provide you suggestions as to how the data may be mapped. The image below how my example input data set was mapped to the data cleansing solution.
Selecting Output Fields
By default there is no output fields selected when the data cleansing solution is added to the dataflow. Match Review requires the following output fields to be available within the staged data table in order to create a Match Review task:
Along with the 3 required match fields noted above, you must also select fields to include as part of match review or best record creation.
The image below is the “Output” tab of the data cleansing solution and the output schema that will be used. All fields were selected in this example. You can customize the output schema in any way you want, but you do need to have the 3 match fields listed above along with some output fields for the review and best record creation process.
Preparing the Data for Match Review
Now that the data cleansing solution has been added to the dataflow and configured, the next step is to prepare the data for the appropriate staging tables. Two staging tables are created: a job status table and the data table that will hold the match groups to be reviewed.
The first query transform (called “DCAOutput”) is used to simply allow you to modify the column names to better fit your business requirements. For example, STD_ADDR1_LOCALITY1 may have its name changed to LOCALITY. Use this query transform to adjust the output schema from the data cleansing solution.
The second query transform (called “MatchReviewPrep”) is used to add the required fields for the job status table and the data table. The job status table requires only 2 columns: JOB_RUN_ID and JOB_STATUS. JOB_RUN_ID will get the value of “job_run_id()” and JOB_STATUS will get the value “Pending Match Review”. The JOB_RUN_ID column will allow newly match data to be added to a new Match Review task without modifying the status or data of a Match Review task that has already been created using the same input source. The JOB_STATUS field’s value will be used when setting up a Match Review task (http://wiki.scn.sap.com/wiki/display/EIM/How+to+create+a+Match+Review+task+in+My+Worklist+and+review+the+suspects).
The output schema of the “MatchReviewPrep” query transform needs to contain all required Match Review fields along with the output fields from the data cleansing solution. This transform will then need to be connected to your data staging table (as shown above). A field called SOURCE_RECORD_ID
needs to be added and given the value of “gen_row_num()” if your input schema does not have a record ID field. If it does, then make sure that this field is available within the output schema so that it is available within the data staging table.
The third, and final, query transform (called “JobStatusPrep”) is used to simply filter out all output fields from the “MatchReviewPrep” query transform to ensure that only the JOB_STATUS and JOB_RUN_ID fields exist. This transform will then need to be connected to your job status staging table (as shown
All that is left to do is to validate the dataflow for any issues, deploy and execute the dataflow so that the staging tables are created and populated. Once the dataflow has successfully created the required data, you can create a Match Review task in Information Steward.
There are two options when designing a dataflow to stage data to create a Match Review task. The first option (and what was described above) is to send all match groups regardless of match score to the data staging table to be reviewed within Match Review. The second option is to use a case transform to route suspect matches that have a MATCH_SCORE less than or equal to 93 to the data staging table to be reviewed within Match Review. This means all other matches will be sent to the final target awaiting the results of the Match Review process.
The image below shows a dataflow that was created to route suspect matches to the data staging table and auto matches to the final target table.
Data Cleansing Advisor Best Practices Blog Series
Determining Duplicates and a Matching Strategy
Publishing to Data Services Designer
Configuring Best Record Using Data Services Designer
Match Review with Data Cleansing Advisor (DCA)
Data Quality Assessment for Party Data
Using Data Cleansing Advisor (DCA) to Estimate Match Review Tasks
Creating a Data Cleansing Solution for Multiple Sources