Data Quality Assessment for Party Data
Data Cleansing Advisor has the tools available to do an in-depth data quality assessment for party data. The following entities can be identified and cleansed by Data Cleansing Advisor: address, person, title, firm, phone, email, date and SSN. A single person, address, firm and title can be cleansed per record within the input source and 0-6 emails, phones, dates and SSNs can be cleansed. This section will explain the cleanse results and the tools included
within Data Cleansing Advisor to do a data quality assessment.
The initial UI when navigating to the cleanse results displays a bar chart that helps show you the quality of data that has been cleansed and the impact of the cleansing rules themselves. The bar charts start at a high-level (rows) and can drill-down to low-level (column) details. As you drill-down through the charts, a filter (e.g. Suspect Phone -> Phone1 -> Invalid) is automatically created to view the cleansed results. The bar charts are displayed on an entity-by-entity basis and navigation between each entity is as simple as selecting a different one from the dropdown box.
Date – Cleansed Data
The Date bar chart will initially show the records that are considered to be blank, suspect and cleansed. The same is true for Person, Title, Firm, Phone, Email and SSN. The “Cleansed” bar chart represents rows of records where a date (or other entity type) was parsed and successfully cleansed. The “Suspect” bar chart represents rows of records where a date (or other entity type) was parsed, but the value was either invalid, the result is of low confidence or additional data was found that could be skewing the result. Finally, the “Blank” bar chart simple represents rows of records where a date (or other entity type) did not have a value (null/blank). Images displayed within this section will give you a sample of the filters that can be created when reviewing the cleanse results bar charts.
The filter (Date: Cleansed -> Date4 -> No/minor change) displayed below will allow you to immediately see the rows of data where the input column “DATE4” had valid date’s that either had no change or a minor change to standardize the format. Details for each date input field (up to 6) will be displayed within
Being able to identify , on a column-by-column basis, where blank data exists or invalid data has been entered will help with the remediation of the issue.
Email – Blank Data
The image below shows all email columns (“EMAIL1”, “EMAIL2”, “EMAIL3”, “EMAIL4”, “EMAIL5” and “EMAIL6”) from the input source that do not have a value. Invalid or empty personal information is a data quality issue – especially when working with party data. You want to ensure that you customers,
suppliers or vendors can be contacted and their identifying information is complete.
After realizing that most of the email columns do not have a value, it may be a good idea to create a validation rule within Data Insight. It may not be apparent, but a Data Steward working with this data set may very well know that the input field “EMAIL1” is used for contacting customers. This field is only composed of blank value 6% of the time. A validation rule that ensures this column is not blank can be created within Data Insight and bound to this input source for future profiling purposes.
Phone – Suspect Data
Phone entity type cleansing is supported for both U.S. and international phone formats. The image below shows that 7 records within the data set had invalid data within the “PHONE1” field. Again, data remediation can be used to correct this issue, a basic transformation could be added to massage the data to a correct format or a validation rule could be derived from values we (and you) deem to be invalid.
Address – Invalid
The bar charts to show the impact of address cleansing vary slightly from the person, email, date and phone entity types. Address data for each row will initially be split into 3 categories: valid addresses, invalid addresses and corrected addresses. The data is split within each category using a combination of the reference data that is being used along with what happened during the address cleansing process.
The same functionality exists for the address cleanse results and what’s available for non-address cleanse results; being able to drill-down into the results from the row level down to the column level. The exception with address cleanse results is that the bar chart goes down to the component level since address information can be contained across multiple input columns of data. The filter (Invalid Addresses -> Address Line -> Street number) will return the 360 records that have an invalid street associated with them.
Drilling-down into the address results will give you very specified examples of what is invalid and what is valid (both valid addresses and corrected addresses). Data remediation or validation rules can be discovered using the address charts to determine why there is no address line or why no street numbers sometimes do not exist.
Data Cleansing Advisor will help you fully understand the impact the cleansing rules had on your input data and also the data quality issues that exist within your data. Being able to quickly and easily identify this information will help identify the processes that could be improved and to have a higher level of
data quality in the future.
Data Cleansing Advisor Best Practices Blog Series
Determining Duplicates and a Matching Strategy
Publishing to Data Services Designer
Configuring Best Record Using Data Services Designer
Match Review with Data Cleansing Advisor (DCA)
Data Quality Assessment for Party Data
Using Data Cleansing Advisor (DCA) to Estimate Match Review Tasks
Creating a Data Cleansing Solution for Multiple Sources