Address Cleanse Best Practices for India
The global address cleanse transform deployed within Data Services or DQM SDK supports a wide variety of input fields that can be mapped to an incoming data source. The input fields can be categorized as multiline, discrete or hybrid.
Multiline fields can generally be used when the incoming data is in a mixed format and is not properly fielded within the source itself. Typically this format of data isn’t widely used. The global address cleanse transform names these fields as “MULTILINE[1-12]”.
Discrete fields within the global address cleanse transform represent data in an input source that is properly fielded. This is the most typical format of data used by our customers. For this use case, the input source will have address data in columns that generally describe the data component such as locality, region, country and address line. These fields are named “LOCALITY[1-3]”, “REGION”, “POSTCODE”, “COUNTRY”, “ADDRESS_LINE”.
The hybrid format uses both multiline and discrete fields to give greater flexibility to map the data from the input source to the global address cleanse transform.
The standardized and cleanse results may vary slightly based upon which input format is used to map the data into the global address cleanse transform. The reason for this is that when the input data doesn’t have an exact format (MULTILINE), we need to do additional parsing to determine what the various address components are. The discrete method generally may be more accurate.
The goal of the different input formats that are supported is to give flexibility to the developers leveraging the global address cleanse transform. The same can be said about the output fields that are supported, which will be covered below.
The global address cleanse transform supports both discrete and compound output fields. Discrete output fields represent a singular entity with an address. An example of this would be the PRIMARY_NAME1 (street name) output field. Compound output fields represent a group of discrete fields. An example of this would be the PRIMARY_NAME_FULL1 output fields. This output field has the following structure:
PRIMARY_PREFIX1 PRIMARY_NAME1 PRIMARY_TYPE1 PRIMARY_POSTFIX1
The global address cleanse transform is able to standardize and cleanse data on a country-by-country basis using purchased postal address files. The data that is output to the fields supported depends on 2 factors:
- The address data that can be parsed and identified using the input data
- The level of detail of the address data in the postal address file
Parsing and Identification
There are instances where the input data can be identified and parsed into discrete or compound components, but the data may not be standardized or corrected. For instance, the global address cleanse transform may be able to identify that a building name was found, but the following may happen:
- The building name could not be found in the postal address file due to the quality of the data being so low a match couldn’t be made.
- The postal address file does not have building name data (this is the case for India).
Output Components Supported for India
As mentioned above, the data that can be standardized and cleansed is directly related to the level of data that can be found within a specific country’s postal address file. An analysis was completed using the India address postal file and the following output fields below should consistently be populated. There are a few caveats to this:
- The data cannot be of such low quality that individual address components cannot be identified or parsed. If this happens, then they will not be used in the assignment process.
- If the data can be parsed and identified, then we can use it in the assignment process. However, using our fuzzy matching and lookup algorithms, similar data needs to be present in the postal address file for the standardization and cleansing to happen.
- Parsing of data uses key words in the input data that are flagged as building name, POR (Point of Reference used in many IN addresses, see example below), or Firm words. If these words are not present in the input we have no way of knowing that the data is a building, POR or Firm.
Output Fields with Cleansed Data – ordered from largest area to smallest
- POSTCODE_FULL, which contains:
- POSTCODE_DESCRIPTION AND POSTCODE_1
- PRIMARY_NAME1_FULL, which contains
- PRIMARY_NAME1, PRIMARY_TYPE1
- If the address is a postal address (e.g. PO BOX) then the following would be used: POST_OFFICE_NAME, PRIMARY_NAME1
Output Fields with Parsed Data
As mentioned above, there are caveats as to when the output fields are populated using the data from the postal address file. The following fields will be populated if the address component was parsed and identified, but the reference data does not contains the level of detail to correct it properly.
- PRIMARY_NAME2_FULL, which contains:
- SECONDARY_ADDRESS, which consists of:
- FLOOR_ DESCRIPTION
Address Line Remainder, Last Line Remainder and Extra data
An explanation has been given to describe how the data is populated and why based upon identification of address components and the level of detail found within the postal address file. This section will build on those concepts to give further insight into the address cleansing process to explain further where the input data goes if data could not be parsed or identified.
The ADDRESS_LINE_REMAINDER[1-4] output fields should typically be used in the following situation: address data that was not able to be identified that was found on the same input field as data that was parsed and identified will be output to these fields. This is done so that the address data on input is preserved. The global address cleanse transform will not delete data that couldn’t be leveraged during the assignment process. These fields should be referenced to further enhance your output if needed.
The LASTLINE_REMAINDER[1-4] function just as the ADDRESS_LINE_REMAINDER[1-4] output fields, but lastline data that was not able to be identified will be output within these fields so that all input data is preserved that cannot be corrected and/or standardized.
The EXTRA[1-12] output fields will be populated if the entire line found within the input field was not able to have any individual address components be parsed or identified.
An example of these two concepts would be:
B1/42/44 SECTOR 6 LANE
OPP. VASHI RAILWAY STATION
Generated Global Address Cleanse Output:
- COUNTRY: India
- LOCALITY1: Navi Mumbai
- LOCALITY2: Vashi
- REGION: MH
- POSTCODE: 400703
- PRIMARY_NAME: Sector 6
- PRIMARY_NUMBER: B1/42/44
- FIRM: Wanbury Limited
- POINT OF REFERENCE: Opp. Vashi Railway Station
- ADDRESS_LINE_REMINDER1: LANE
- EXTRA1: BSEL TECHPARK
Informational Codes (INFO_CODE)
The global address cleanse transform has a few selected output fields that will describe the cleansing process. One in particular is called the INFO_CODE. ASSIGNMENT_LEVEL and ASSIGNMENT_TYPE can also be used – more information can be found in the documentation at http://help.sap.com.
The INFO_CODE field is populated with a 4 digit number: 3010, 3000, 3070, 2020, etc. This field can be analyzed and any necessary post-processing specific to your custom solution can also be performed on the output data if needed.