SAP BusinessObjects Data Quality Management (DQM): Enhancements for more fields other than address data for Duplicate Check (Part 2)
These enhancements were successfully implemented in the following systems:
- Customer Relationship Management (CRM) 7.02 SP4
- SAP BusinessObjects Data Quality Management (DQM), version for SAP Solutions 4.0 SP2
- SAP BusinessObjects Data Services 4.0
This blog is relevant for:
- Data Services/DQM consultants who have been asked to customise the default DQM matching and break key generation jobs to include additional fields for duplicate matching.
This blog does not cover:
- Details around Postal Validation; we are only concerned with the Duplicate Check process.
- Details on how to make the required ABAP changes to pass through additional field data to the DQM jobs – see Part 1 of this article for more information.
DQM is an add-on for ERP/CRM systems to provide address validation (SAP calls it Postal Validation or PV) and duplicate checking for Business Partners. After installing the add-on and performing the initial setup you get address validation and duplicate checking via the “BP” transaction and automatic integration into the Account BSP Components out of the box.
The duplicate checking aspect of this product specifically, however, is heavily focused on address fields. It uses fuzzy logic algorithms to compare name and address data between Business Partners in order to determine whether duplicates exist. In some customer-specific cases, this isn’t a valid approach – e.g. in situations where address data is known to be either unreliable or variable, you would want your duplicate checking to happen on other available fields and reduce the emphasis on address.
This is the situation that arose at a particular customer that had purchased the DQM solution. This blog post will cover the steps we took to extend the standard DQM messages between ERP/CRM and the Data Services jobs and the DQM matching algorithm to include additional, customer-specific fields.
For additional background and a more detailed problem definition, see part 1 of this article: http://scn.sap.com/community/data-services/blog/2013/07/08/sap-business-objects-data-quality-management-dqm-enhancements-for-more-fields-other-than-address-data-for-duplicate-check written by Leigh Mason. He covers off the changes required on the ERP/CRM side and the DQM add-on. I will now proceed to detail the considerations on the Data Services side.
Final note before we begin: this article assumes a baseline installation of DQM has been installed (including the required DQM Data Services repositories).
A Break Key (or Match Code – an older term) is an index made up of search terms that are arranged in a specific order. It is used to reduce the number of potential records which are sent to DQM to match against to determine whether a record is a duplicate by filtering records have no realistic probability of being duplicates.
DQM comes with five best practice Break Keys. All of them segment the record set on the country field along with some combination of the other address fields. For example, the default best practice Break Key “Match Code 1” is made up of the two-character ISO country code, the first three characters of the postcode and the first character of the street name.
Since we’re adding additional fields to the matching algorithm and de-emphasising the importance of addresses data matching, we need to re-think the Break Key we use. The Break Key must be generic enough to support the widest possible match scenario you want to support. For example: if you want DQM to indicate a potential duplicate even where the address details between records are completely different (i.e. a potential match is identified based on matching non-address fields), then the Break Key cannot use any address fields to segment the record set – the segment (or “break group”) must be broader than this to capture these potential matches.
One pitfall here is that if your matching algorithm is broad enough, you may run out of options as far as which fields you can use as part of your Break Key. The risk here is that using very wide Break Keys can result in large record sets being sent to DQM for matching leading to performance issues, especially in a real-time processing scenario. Keep this in mind when defining custom Break Keys and refer to Appendix E the User Guide for instructions on using the included Break Key Profiler to assess the effectiveness of your Break Key.
We ended up using the first three characters of the Last Name field processed by a phonetic algorithm as our break key. This is a rather wide index, however it worked with our specific dataset.
Customising the Break Key
To customise the Break Key there are changes you must make on both the ERP/CRM back-end and in Data Services.
- In the Data Services Designer change the “$$DQ_SAP_Match_Function” substitution parameter to “MatchCodeCustom”.
- Change the “CF_DQ_SAP_Calculate_Match_Code_Custom” custom function to consume additional data fields as required and return the Break Key for each record.
ERP/CRM Back End
- If required, create the required Function Modules to pass through additional fields to DQM when calculating Break Keys. Refer to Leigh’s earlier blog for more information on this and the “Understanding break keys and data fields” section in the User’s Guide.
- Break Key values are calculated by running an initial report in ERP/CRM for existing records and real-time on commit for newly created records. If the Break Key algorithm is changed, the Break Key values for existing records must be updated before it will take effect. To do this you must re-run the initial reports to re-generate Break Key Values. Refer to Leigh’s earlier blog for more information on this and the “Run the initialization program” section in the User’s Guide.
Duplicate Check Algorithms
Now that we’ve configured our custom Break Key, let’s look at how we can customise the standard DQM matching algorithms to take into account the additional fields we want to match against.
Accessing Additional Fields
The standard input structure for duplicate check messages passed to DQM is as follows:
As Leigh explained in his blog post, the DQM RFC server places a hard constraint on the messages passed from an ERP/CRM back-end to the DQM services (both in terms of fields and the length of the overall message). The XML structure above cannot be customised (practically speaking).
In order to meet our requirements for matching on fields additional to those listed above, we created a design where we re-purposed fields that weren’t normally utilised in our use case. As we were dealing with Person Business Partner records (e.g. where ADDR_TYPE = “2”), the Organisation NAME1 and NAME2 fields in the structure above were typically blank. Depending on your particular requirements and Business Partner record types in use you may have to repurpose different fields to these.
In our case, the CRM component that calls DQM matching was customised to pipe Birthdate, Middle Name, Gender and Country of Origin data into these two fields (up to 80 characters of data in total). Refer to Leigh’s blog post for details more details on the ABAP side. As a result, we created the following specification for consumption by DQM:
The Data Services Dataflow where the DQM input structure is customised is DF_Realtime_DQ_SAP_Name_And_Address_Match (in the Data Services Job Job_Realtime_DQ_SAP_Name_And_Address_Match). The Create_Compound_Fields Query transform was customised to extract these four pieces of data from the NAME1 and NAME2 fields into their own fields on the output structure. Note: this was only done where the ADDR_TYPE is “2” – this piece of conditional logic is important in order to not break DQM matching functionality for non-Person record types.
Once you have these additional fields in your structure you are free to modify the rest of the Job_Realtime_DQ_SAP_Name_And_Address_Match job to suit your requirements. There are a couple of things to watch out for though – keep reading.
Updating the Match Transform
The Match transform in the DF_Realtime_DQ_SAP_Name_And_Address_Match dataflow performs matching using three match groups: Level_Name_O (Organisation), Level_Name_P (Person) and Level_Address.
The Organisation and Person match group scores are taken into account irrespective of the type of Business Partner record being matched because the Organisation match group is used for name cross-matching for Person records and vice versa. Names are cross-matched into the fields used for matching in the Create_Compount_Fields Query transform – look for the “CompoundNames*” output fields.
Thus, if you’re not replacing the default matching logic but simply enhancing it by adding fields and adjusting weighting, keep this cross-matching process in mind. If you add fields to the Organisation match group, you must do so for the Person match group irrespective of the Business Partner record type you’re interested in. Moreover, be careful to set the “One field blank operation” and “Both field blank operation” comparison rules to “IGNORE” in the field matching options:
This will ensure that if a field is not passed through in either the driver (new un-committed record) and passenger (existing record being matched against) record – as can happen if you’ve got Person-specific fields configured in the Organisation match group for cross-matching, but an Organisation-type record is being processed – then the weighting assigned to these fields will be redistributed equally to the other fields in the match group.
Please validate this thinking to ensure it meets your specific requirements, but this is the setup that we used in our case.
Weightings occur on two levels. The first is on individual fields within a match group in the Match transform discussed above. These can be adjusted within the Match transform in the Data Services Dataflow DF_Realtime_DQ_SAP_Name_And_Address_Match.
Subsequently, the match groups themselves are weighted and summed in order to calculate an overall match score. This weighting and the overall match score calculation is performed in the CF_DQ_SAP_Calculate_Match_Score Custom Function.
This Custom Function has two key elements to consider: by default it assigns a 30% weighting on the Person or Organisation match group score (e.g. “Name” data only in the out-of-the-box configuration) and 70% weighting on the Address match group score. This is something you will need to re-evaluate in conjunction to the relative weightings you give to each field within the Match transform as it will impact on the overall contribution of a particular field to the final match score DQM passes back to the back-end system.
Also, it has procedural rules to take the higher of the Person or Organisation match group score (as it assumes cross-matching was performed, as discussed in the section above). This may also have to be adjusted depending on your matching setup.
In our case, we simply adjusted the high-level weightings to 90% on the Person match group and 10% on the Address match group but largely kept the function the same.
This Custom Function is also where any records with a match score below a particular threshold value are excluded from the outgoing result set:
#if the matchScore is below the Threshold we set it to -1 and will drop the record in the next transform.
IF ($MatchScore < $Threshold)
$MatchScore = -1;
Records with a match score of “-1” are then filtered out in the subsequent Query transform. While this functionality can be bypassed or customised, by default the $Threshold is passed to DQM by the back-end system and is configurable via the DQM IMG in either ERP or CRM as part of the initial DQM setup (refer to the User’s Guide for more information on this).
The threshold value you use should be determined by your requirements and matching rules – e.g. determine what the lowest possible match score, given your particular matching rules, should constitute a potential duplicate that is presented to an end-user for their consideration.
Note: after making and saving your changes to the Dataflow (Match Transform) or Custom Function in Data Services you must restart the corresponding Realtime Service (Service_Realtime_DQ_SAP_Name_And_Address_Match) in the Data Services Management Console before changes take effect.
Often times when the business wants to customise the matching process in DQM they will present their requirements as a list of procedural rules – e.g.: if fields a, b and c match then return 100% match; else if fields a and b match return 66%, else if fields a and c match return 50%, etc.
This is slightly different, conceptually, to the way DQM performs matching by default. DQM comes with a standard matching algorithm (using the Data Services Match transform) that considers all input fields in one pass. Matching is granular – i.e. a similarity score is calculated for each field instead of a binary match/no-match result assumed by the example above – and happens in stages. In the first stage, asimilarity score is calculated for fields that are grouped together (e.g. Person, or Address fields) and in a subsequent stage, these group match scores are weighted again to determine the overall match score for the full record.
While the matching logic used by DQM can be completely replaced by procedural logic that literally follows customer requirements, it often doesn’t make sense to rip out standard DQM functionality for a number of reasons:
- Requirements can be fully met by delivering substantially similar or better results can be simply by carefully setting the field weightings used during matching
- Instead of a binary match/no-match result for each field comparison, using DQM-style matching logic can return a more granular similarity score for each field (e.g. a First Name match can be, say, 60% similar instead of simply 0% (not the same) or 100% (the same))
- Similarity scores are calculated using data type-specific functionality in Data Services where possible (i.e. Data Services uses specific algorithms and fuzzy-logic matching for names, dates, addresses, etc.), leading to improved accuracy and a higher likelihood of finding matches where duplicates exist
- DQM considers all of the input fields in one pass reducing the need to build difficult to maintain decision trees
- Matching via weighted similarity scores is infinitely more adjustable as weightings can be tweaked throughout testing and even when the solution has gone live
- This framework is delivered in the out-of-the-box DQM installation and leveraging it therefore saves a lot of time and effort
I include this section in this write up as this is a hurdle that we had to cross with our customer – but one I believe we were better off for crossing as it resulted in a more robust solution that didn’t require us to remove standard DQM functionality (on the Data Services side) and replace it with custom rules.
Without going into too much detail, we went through an analysis exercise where we decomposed the business rules we were provided and came up with field similarity score weightings that would result in the desired outcome in each case identified by the business.
Indicatively, the weightings (and fields) we ended up going into testing with looked like this (note the significant reduction in the emphasis on address data):
Summary and Final Thoughts
The DQM add-on combined with the Data Quality functionality of Data Services is a really powerful option for ensuring the ongoing maintenance of data quality levels in operational systems. The out of the box functionality will be 100% fit with requirements for some customers and thus represents a great option that can simply be “plugged in”.
Between Leigh Mason’s original blog post on customising this solution and this follow-up, we have collectively shown how you can customise the solution where the out of the box fields and matching algorithms do not meet customer requirements. Our customisations aimed to minimise enhancements and attempted to leverage the core DQM functionality as much as possible while still meeting our customer’s specific matching requirements.
Ultimately, however, it is still important to ensure a good fit between requirements and DQM’s capabilities up-front before going ahead with an implementation. While DQM is easily customisable in the ways that we have demonstrated, attempting to circumvent its core matching is a more involved, error-prone and time-consuming endeavour and certainly not recommended.
This requirements analysis, however, should be performed both from a bottom-up perspective and a top-down perspective simultaneously – i.e. consider the approach DQM takes to match for duplicates and assess whether it represents a better way of meeting the same matching outcomes as the customer’s stated requirements. That is certainly what we discovered at our customer.
And finally, as Leigh mentioned, DQM should be implemented in conjunction with an overall master data governance strategy – it’s not a replacement for one!