It is essential to know and understand your data in many projects. It is also essential to store and to share this information among all stakeholders but there are not a lot of possibilities on the market to do it.
Many companies don’t know what’s inside their systems and are not capable to measure data quality even using manual and costly operations.
Using these elements we decided to use PowerDesigner as a data profling tool by creating specific and reusable extension.
Why this choice? PowerDesigner can make reverse-engineering efficiently, you can use it to store data model in a repository and you can share information to all stakeholders using portal server or PowerDesigner Viewer. PowerDesigner has already a function “Update Statistics” but we wanted
to have complete control.
The first point was to define our goals:
- What is data quality of my sources
- How can I make control on all data profiling parameters
- How can I add control on vertical and horizontal constraints
- How to make analysis results shareable with other teams
- How can I automate analysis
- How to improve data knowledge
- How to determine appropriate data quality rules and cleasing in earlier stages of a project
- How to easily share information with people involved in a project
Next step was the realization…
Phase 1: create extended model
In this phase we create a physical extended model and we add metaclasses for tables, columns, views, view columns, Business Rules, data source definition.
As you can see, elements defined here are quite classical for data profiling except Business Rules and values. These two elements need some explanation. “Values” is used to capture all values of a field if the number of distinct values is lower than a specific value. Business Rules are defined to check more functional questions; for example if you have a table with several insurance policies, you may check dates referenced depending of policy type Life/Non-Life. The “Data Source” element is used to specify ODBC connection that will be used for data profiling.
Phase 2: retrieve ODBC information
In order to make data profiling, you must be able to connect your database; ODBC information can be retrieved from windows registry
but needs some handling. You have to analyze registry keys to determine configuration for a particular data source (presence of DSN, SERVER, DATABASE, HOST,UID,PWD,…).
Phase 3: implement parameterization
Using forms and menus capabilities, we provide the capacity to customize data profiling from data model to table/view or column element.
Here’s the data model form definition:
Here’s the display:
As you can see, we plan to customize many elements: size of string to be handled, number of records to be checked (then it becomes sampling database), you can populate these parameters to all tables / views. You have nearly the same menu at table/view level and at column level. Using this approach you can control all parameters.
Phase 4: Executing data profiling
We defined several methods at data model level, table level or view level
Using data model database type, you have to customize generated SQL. For example, in the small piece of code below, we created a generic function which
will build ad hoc SQL. This will permit to reuse the same extended model for any supported database by PowerDesigner.
Transforming PowerDesigner into a data profiling tool was very funny while we can virtually connect any database, making sampling, complete or customized data profiling, preparing report in excel, store information within data models and share them using repository to communicate within any customer’s site.
You have white paper available on our web site.