This document will explain address cleansing using custom cleansing package which include cleansing of CITY, COUNTRY and PINCODE
Custom Cleansing Package :
In custom cleansing package , developer can parse the data & define static standard forms as well as Regular Expressions.
Suppose, User have address list of customers from legacy database, and that address list is like –
1 1 113 Germantown Ave
PHILADELPHIAS US 19115
Under Location column we have combination of “CITY COUNTRY and PINCODE”. Now User has a requirement that in target database Location should be broken into 3 parts / columns and data should be stored like:
1 1 113 Germantown Ave
Also in target database User want to have standardization of data like:
1. Postal Code: Consider only those postal codes which are 5 digits long.
2. City & Country grouping: There should be one single convention for country & city. For example in legacy data, user has “US “and “United States”. But in target, User just wants to see data in one convention i.e. United States .
Therefore, conversion of non stand forms into standard form.
Cleansing Package Builder: Data cleansing can be done through information steward under Cleansing Package Builder tab, as shown in below screen shot:
- Log in Information Steward and click Cleansing Package Builder.
- Under Cleansing package builder , click on New cleansing Package drop down and Custom Cleansing Package as shown below :
3. Create Custom cleansing Package: Give the desired name to cleansing package. Also a sample file is also required on which will include sample data of source system .
Browse the file as shown below:
4. Select the sample file from the local system and click Open
Sample file data will be available while creating CP and standardization.
5. Select the column which you want to parse and standardize. In this case we have “LOCATION ” which holds the list of (CITY, COUNTRY and PIN CODE)
6. Select all rows to analyze while building CP and click NEXT
7. Select the way by which you want to parse the data. In our scenario , we want to parse data with white spaces
8. Create a Category under which you will define attributes like CITY, COUNTRY, etc.
9. Custom cleansing package is created as shown. But we still need to standardize the data
10. Once you double click the CP_Cust , you will get below screen :
11. Add Attributes under Location. Click add to create attributes of LOCATION that is CITY COUNTRY and PINCODE
13. Once you have added attributes you can see those attributes appearing under “INPUT SAMPLE RECORDS” tab.
Below screenshot shows three different columns CITY , COUNTRY and PINCODE:
Now Data needs to be filled under these three columns with the help of “Input Sample Records”.
As shown in below screen shot PHILADELPHIAS should come under CITY column, US under COUNTRY and 19115 under PINCODE
Also, US needs to be replaced with United States (Standard form)
14. Select “CITY” and add Standard forms(which user want to see in target system)
Click add to include PHILADELPHIAS as standard form as shown below:
As and when PHILADELPHIAS added under standard forms same will be reflected under Variations and more Values can be added for one standard form. Suppose in legacy system there are chances that few records are with PHILADELPHIA , So this will also come under variation.
Click add to include PHILADELPHIA under variation panel
Also in above screen shot, it can be seen that now under CITY column PHILADELPHIAS is appearing.
In similar fashion it is required to add all the CITY names (which are expected to be in Source System) under Standard Form and their respective Variations.
Once standard forms and their Variations included for CITY, data will appear under CITY column as shown below :
15. Once standard forms are included for CITY, in similar way standard forms for COUNTRY needs to added.
In the Input Sample data we have US & we want it to appear as UNITED STATES as a standard form .
Select COUNTRY and click add against Standard Forms as shown below:
16. Include UNITED STATES under standard Forms and enter
17. Include US under Variation and click enter as shown below:
18. Under COUNTRY tab UNITED STATES appearing as shown below :
19. Once standard forms for COUNTRY are added, we move on to cleansing of PINCODE.
PINCODE cleansing is different from CITY & COUNTRY because PINCODE is a 5 digit no. and there will be many combinations for standard forms.
So to make cleansing simple we will use regular expression.
Select PINCODE and click on Define Context as shown below :
20. When you click Define Context you will get below screen , click add to define Regular Expression
21. Add context for PINCODE
22. Add Component
22. Give name to component as PINCODE_1 and select pattern as shown below :
Click define pattern button
23. Create regular expression and test and validate it, as shown in below screen shot:
For PIN CODE , we need 5 digit number only , therefore , regular expression will be as shown below:
24. Click OK to save the context
25. Once you save the context , you can see the required 5 digit PINCODE appearing under PINCODE column as shown below :
26. After adding regular Expression for PINCODE click Cleansing Package Tasks as shown below :
27. Publishing Custom Cleansing Package: Once you have created cleansing package , it will be available to you as a local copy and if you want that it should be visible to others and also at BODS side , you need to Publish it.
Select the Cleansing Package and click Publish as shown below
28. Give description and click Publish :
29. Published Cleansing Package will appear in published panel as shown below :