Matching Methods in Data Services Match Transform
Matching Methods:
There are three matching methods are available in Data Services match transform for setting up the match criteria.
1. Rule Based method
2. Weighted Scoring method
3. Combination method
Match Method |
Description |
Rule Based |
Allows controlling which criteria determines the match. |
Weighted Scoring |
Allows assigning importance or weight to any criteria. |
Combination Method |
Combines the rule-based and weighted scoring methods of matching. |
1. Rule Based Method:
With rule-based matching, we rely only on our match and no-match scores to determine matches within the criteria.
The following example shows how to set up this method in the Match transform.
Criteria |
Record A |
Record B |
No Match Score |
Match Score |
Similarity Score |
First Name |
Mary |
Mary |
82 |
101 |
100 |
Last Name |
Smith |
Smitt |
74 |
101 |
80 |
E mail |
79 |
80 |
91 |
By entering a value of 101 in the Match score for every criterion except the last, the First Name and Last Name criteria never determine a match
By setting the Match and No-Match score for the E-mail criteria with no gap, any comparison that reaches the last criteria must either be a match or a no-match.
A match score of 101 ensures that the criterion does not cause the records to be a match, because two fields cannot be more than 100 percent alike.
In rule-based match scenarios, we should avoid gaps between the Match Score and No Match Score.
In the output file under “Match Type” we can find values D and R.
2. Weighted Scoring Method:
In a rule-based matching method, the application gives all of the criteria the same amount of importance (or weight). That is, if any criterion fails to meet the specified Match-Score, the application determines that the records do not match. When we use the weighted scoring method, we are relying on the total contribution score for determining matches, as opposed to using match and no match scores on their own.
Contribution Values:
Contribution values are our way of assigning weight to individual criteria. The higher the value, the more weight that criterion carries in determining matches. In general, criteria that might carry more weight than others include account numbers, Social Security numbers, customer numbers, Postcode1, and addresses.
Total 100. All contribution values for all criteria that have them must total 100. We do not need to have a contribution value for all of our criteria.
We can define a criteria’s contribution value in the Contribution-To-Weighted-Score option in the Criteria-Definition option group.
Contribution and total contribution score
The Match transform generates the contribution score for each criteria by multiplying the contribution value we assign with the similarity score (the percentage alike). These individual contribution scores are then added to get the Total contribution score.
Weighted Match Score
In the weighted scoring method, matches are determined only by comparing the total contribution score with the weighted match score. If the total contribution score is equal to or greater than the weighted match score, the records are considered a match. If the total weighted score is less than the weighted match score, the records are considered a no-match.
We can set the weighted match score in the Weighted Match Score option of the Criteria Match Spec option group.
Weighted scoring example
The following table is an example of how to set up weighted scoring. Notice the various types of scores that we have discussed. Also notice the following
· When setting up weighted scoring, the No Match Score must be set to -1, and the Match Score must be set to 101. These values ensure that neither a match nor a no-match can be found by using these scores
· We have assigned a contribution value to the E-mail criteria that give it the most importance.
Criteria |
Record A |
Record B |
No Match |
Match |
Similarity |
Contribution Value |
Contribution Score |
First Name |
Mary |
Mary |
-1 |
101 |
100 |
25 |
25 |
Last Name |
Smith |
Smitt |
-1 |
101 |
80 |
25 |
20 |
|
-1 |
101 |
91 |
50 |
46 |
||
Total Contribution |
91 |
In this example, the total contribution score is 91. If the weighted match score is 90 or less, the records are considered a no-match.
In the output file under “Match Type” we can find values D and W.
3. Combination Method:
This strategy combines the rule-based and weighted scoring methods of matching.
· A no-match can be determined by the similarity score of any criteria not equaling or exceeding the no-match score. However, a match cannot be determined by the match score (we must have a match score of 101).
· A match can be determined only by comparing a total contribution score with the weighted match score.
Criteria |
Record A |
Record B |
No Match |
Match |
Similarity |
Contribution Value |
Contribution Score |
First Name |
Mary |
Mary |
59 |
101 |
100 |
25 |
25 |
Last Name |
Smith |
Hope |
59 |
101 |
22 |
N/A |
N/A |
|
49 |
101 |
N/A |
N/A |
N/A |
||
Total Contribution |
N/A |
Hello Durai,
I just read your article about 'Matching Methods in Data Services Match Transform' and tested them.
The question that I have is: How do I add/ get the 'total contribution score' within the weighted scoring method? (I already defined the contribution scores)
Thanks in advance.
Best regards,
Mourice
Hi Mourice,
You have to assign contribution score for each criteria and should not be greater than 100%. You have to assign the weighted score on the level. If you click on the level you will see the weighted score option.
Thanks & Regards,
Ramana.
Hello Durai,
I did not get Similarity Score concept . Could you please explore more about this.
Regards,
Shazia
Hi Shazia,
Similarity score calculated by match fuzzy logic based on match transform options. Match score and no match score are the threshold values to considering the duplicates.
Thanks & Regards,
Ramana
Hi Venkat Ramana,
I am trying to find the duplicates by using the match transform.
EX:Source
Name
Soumyadeep
Soumyadeep
Somyadip
Ajay
Ajoy
In the above example
Can you explain Comparison rules in match Transform
Can you please help me that what exact match score it will get with an example of screen shots.
Thanks & Regards
Venkat
Hi Venkat,
Based the match score you will get the match status,match type etc.
if you are doing duplicate check on single field and you are getting only 100% match records means you have to change the match score limit. If you are using the field similarity it will check the character by character match and calculate the similarity score . If similarity score is greater than or equal to the match score then it give the duplicate records with the similarity score else it will go the unique records. you have to change the below settings for this.
Please share with me about your match transform settings screenshot like above for the your sample input data. I will check and if possible I will correct it.
If you need more information please contact me through mail.
Thanks & Regards,
Ramana.
Hi
In the above example how we can get exact (match score ,no match score and total contribution score)
Can you please explain with screen shot or detail view.
Regards,
Venkat
Hi Venkat Ramana,
I have gone through the screen shot which you shared with me but SET1_LEVEL_MATCH_SCORE is showing as 100% and i am not able to find exact match score.
Can you please share the screen shot of which you are getting exact match score.
Regards
Venkat
Hi Venkat,
Try to change the match score to 65 from 95 . no match score always should be less than match sore . Remaining two match scores equals to no match score.
Change the match score and run the job. I thought tour data similarity score is less than 95.
Thanks & Regards,
Ramana.
Hi Ramana,
I have tried with match score 65 and no match score 30 but it is still not getting the exact match score.
Regards
Venkat
Hi Venkat,
Please provide me your sample data. I will try and let you know .
Thanks & Regards,
Ramana
Hi Ramana,
This is the sample source file.
Source file:
Name
Soumyadeep Mitra
Saumya Meetra
Soumya
Venkat Chowdhury
Venkata Choudhri
Venkat
Ajay Gupta
Ajoy Guptu
Ajay
Manish Sing
Maneesh Singhas
Regards
Venkat
Hi Venkat ,
I didn't notice your screenshots properly.
There was a mistake in the match properties .
You have given the break group for input as shown below.
You have only one field but you gave the group with length 11.
This is almost equal to the your field values max length.
break group creates the group based on length and check the duplicates with in the group . In your case as you given the length 11 , each record is separate group until it has the exact match( 100%). With in the group finding the duplicates means here each group will contain only one record then it gives as unique record .
Try to remove the break group . For your scenario no need of any break group.
Please find the match transform settings and results .
for the field similarity.
Result :
In the above result , you missed few names like Ajay, Soumya because it has partial names . In this case you can use word similarity it will compare word by word instead of character by character( Filed Similarity).
Result :
I hope you can understand .
If you need any clarification feel free to contact me .
Thanks & Regards,
Ramana.
Hi Ramana,
Now i am getting the Match_Score.
Thanks for your support.
Regards
Venkat
Hi Ramana,
If suppose we are having fields like
Name Address Email
Venkat hyd abc@ibm.com
VenkatRao hyderabad abc@cts.com
soumyadeep mumbai def@wipro.com
Soumyadip delhi pqr@cgi.com
soumya kolkata xyz@pwc.com
Ruthvick bang gdh@cts.com
Sathvick chennai aaa@bbb.com
- what is the exact match score.
- Can you please share the screen shot of MATCHING LEVEL in Match Transform
I am not understanding . What you require .
What is the meaning of the question ?
- what is the exact match score.
- Can you please share the screen shot of MATCHING LEVEL in Match Transform
Hi Ramana,
I want to find out the match score in the above 3 field names for finding the duplicates.
I have done some examples but it is not showing the exact match score.
Lets Showing some outputs which i was facing.
In the above output
- "Beach" ,"Jasch" and "Zacch" are different names but it is generating as single group in set 1_Level_Group_Number. (2)
- "JPMOR" and "MORGA" are different names but it is generating as single group in set 1_Level_Group_Number. (6)
- "Power" and "Moder" are different names but it is generating as single group in set 1_Level_Group_Number. (7).
Can you please Share the MATCHING LEVEL options in in Match Transform
Regards
Venkat
Hi Ramana,
Can you please share the Match Transform Screen Shots what ever the modification you have done for finding the duplicates by using Multiple Field Names.
Regards
Venkat
Hi Venkat ,
You were giving two contradictory things .
1. First You have mentioned that you need to find the duplicates on three fields on the above data. As per your
Name Address Email
Venkat hyd abc@ibm.com
VenkatRao hyderabad abc@cts.com
soumyadeep mumbai def@wipro.com
Soumyadip delhi pqr@cgi.com
soumya kolkata xyz@pwc.com
Ruthvick bang gdh@cts.com
Sathvick chennai aaa@bbb.com
2. In the below you gave some other output and asking for improper result.
But it is more confusing me .
Finding duplicates on the single filed is different. If you want do match transform on multiple fields you can merge the fields and find the duplicates . If you want to use weighted score you can find the duplicates on multiple fields. Normally it will give duplicates for the first criteria only.
The other question you were asking why we are getting wrong results means you are using the match score is like that you were getting .
If you are using field similarity it will check character by character and calculate the similarity score. For your match score may be greater than or equal to similarity score.
"JPMOR" and "MORGA" are different names but it is generating as single group in set 1_Level_Group_Number. (6)
In this case 3 characters matching and similarity score around 60% .
Match transform is not straight forward transform unlike other integrator transforms or platform transforms. You have to study your requirements and you have to do .
Other thing you were asking my screenshots. I didn't understand that one also . Why you need screenshots for my match. Match options completely depend on requirements. That day you gave different things and I worked on that and shared the screenshots. Those screenshots will not workout for other as requirements are completely different.
last post requirement is same as what I gave it you before . Input is different . requirements may vary . According to that we need to change the options . But you should understand the match options first before doing development.First go through technical manual with examples .
Thanks & Regards,
Ramana.
Hi,
We have created a real time scenario where we are comparing data from web service (nested XML) and CRM tables data and generating Unique and duplicate Records using Base_Match transform of Data Quality.
Job is validated successfully , but when we are executing the job with sample data , we are getting below error:
<Table> for column <fieldname> doesn't occur in the FROM clause, for a top level query all columns must belong to some table in the from clause
Please help us in sorting this error and let us know if any other information is required from our end
Attached documents
1.Full job
2.Data from XML file
3.Data from CRM table
Joining condition as :Query.nameSur=Merge.nameSur and Query.ElectronicAddress.eAddress=Merge.eAddress
Thanks,
Venkat
Hi Venkat,
Check the from class of the query_11 transform and check merge and query schema is there or not?
Thanks & Regards,
Ramana.