When searching for data analysis products, you will find many companies who have solutions, but each of them calls their data analysis activity something different – discovery, assessment, analysis, profiling, exploration, diagnostics or health check.
Some of these companies may even split the analysis into structural analysis and content analysis.
Ultimately, they all wind up providing the same set of statistics.
So what is the difference after all?
The difference is not in the actual statistics that are provided, but how these statistics are used. The essence of the analysis is taking elements such as pattern analysis, distinct value analysis, fill rate, and so on, and focusing on those elements that represent data defects and which subsequently reflect your business in a negative way.
For example, a “fill rate” of 98% is just another number, but when you show that 570 records with a missing industry sector could affect accuracy of reports, and lead to wrong decision making, this is a business relevant observation.
It may be good to know that there are 25 different patterns of phone numbers, but if one of the patterns is not expected as valid, (and therefore phone numbers are incorrect), communication with the customer is most probably going to fail.
Needless to say, missing or inaccurate data causes a high amount of customer returns, collection problems (due to incorrect invoicing), high stock volumes, and delayed customer orders.
The tools that deal with data quality analysis provide a huge amount of numbers. Using them all will not be efficient as you can’t see the forest for the trees. You need to sort out these numbers, disregard numbers that have little relevancy to your business processes and focus on a subset of numbers that show substantial damage to processes that rely on your data. Take these selected numbers, and show how they indicate a data quality issues that affect you business.
This procedure can be compared to taking raw material – the numbers, and turning them into finished goods – the analysis. This production-like process is a mixture of automated and manual methods, while the automated methods are provided by the analysis tools, the manual part is dependent on the knowledgeable person packing the analysis results.
In short, a tool itself is useless without a human significant involvement.
Now, you have the data quality analysis tool and you have a knowledgeable domain data expert! How do you make the most out of both?
- Focus on the data quality issues that show tangible damage to your organization.
- Where ever possible, show that by eliminating the defective data, expenses can be reduced and revenues increased.
- Avoid excessive use of numbers.
By following these few steps when analyzing data, you can pave the way for a successful data quality program:
- Start with data quality analysis.
- Continue with planning, implementing and monitoring quality improvement steps such as standardization, cleansing, deduplication, and classification.
- Complete with establishing governance program to maintain the high quality level reached.
For additional details on data quality analysis of master data, please refer to:
Information sheets in SMART