For the past few days, i was working on the task of cleaning the location table from the a certain dataset (location of machines in Mexico). For this task we had to apply this workflow:
- Export the data from HANA database into an acceptable format to Open Refine (.csv): The problem here was more of an administrative than technical one, there was a limit on the data we can export from HANA; to overcome that we had to write a JAVA Script to connect to HANA and export the data. For, that we re-used the Java classes used in remix. The output was a 2MB .csv file
- Import the data to open refine, create a new project and parse the csv correctly (semi-automatically done by open refine, we just have to define few parameters like what line to parse as column headers and so)
- Cleaning the data
- Export the data as xls for review in Excel and back to .csv to be imported back to HANA
Cleaning the data
The first data imported to Refine looked like this
The problems with the data we mainly:
- Unclean strings, Strings where stuffed with leading and trailing empty spaces and dots
- Same entities were presented using different names; for example Avenue 2 was presented by AV 2, AV2, Av.2 and so on
- Missing information:
- Missing existing fields in rows, for example missing cities for some entries
- Missing extra information need for the application like the longitude and latitude
- Inaccurate information: and this was the main concern, as we were not sure if the existing data was entered accurately and is sound, for example we noticed that most of the entries 99+% of them had either MX or DF as regions; for some rows it looked like the Street address was stuffed for neighborhood and city as well.
With these challenges there was another obstacle; understanding the Mexican hierarchy for addresses; the results returned from geo-location APIs like Google’s had a specific format, data was retrieved and divided into:
- country -> country information, for us it was always Mexico
- administrative_area_level_1 -> corresponds to region in our table
- locality -> corresponds to city in our case
- sub_locality -> if the administrative_area_level_1 was D.F then this is treated as the city
- neighborhood -> corresponds to county in our database which was changed later back to neighborhood
- route -> corresponds to the address 1 in our database.
we note that address2 and address3 are extra information used to locate the machine and are not going to be cleaned by us.
- We started cleaning the data by removing white spaces stuffed in strings, this was done by applying selecting the column we wish to clean -> edit cells -> common transform and then apply the command:
trim(value) //value is automatically selected from the value of the cell
- now, we want to check the different representation of values in the cells, to do so, we want to facet the entries. For each column we select Facet -> Text Facet. The following screenshots represents the result given when applied over the address1 column, we had around 1800 entries; and since we cant check those manually we needed to cluster the entries so that we can group similar results together. The screenshots below show the facets and the clusters screens.
we notice that we have many similar clusters that correspond to the same entity, so we manually start matching them and check the different results by checking all the available algorithms
- Key collision: fingerprint, ngram-fingerprint, metafone3, cologne-phonetic
- Nearest neighbor: levenshtein, PPM
now we wanted to start adding the data we want by fetching it from Google Geocoding API; but for the limits on the API we can only do few requests a day, so we decided to pick the rows that are missing the city value and execute the algorithm on them.To filter out only empty cities (empty citites from the imported data are denoted by a dot (.), so we go to the city column and filter text by the dot (.)Now we select the address1 column and select edit column-> add column by fetching urls. We want to optimize the query sent to Google so that we can the most accurate result, so we decided to query by the street name, city and country to be sure that we get the most accurate result possible. The command is:
"http://maps.google.com/maps/api/geocode/json?sensor=false&address=" + escape(value+'+'+cells["NEIGHBORHOOD"].value + '+' + cells["COUNTRY"].value, "url")
we set several parameters like the new column name, Throttle delay and what to do when an error occurs and then run the script. The result will be a JSON file filled in the new column specified. and now we have to start parsing the new json and filling in the new data.
- We start with the easy straightforward fields, the full address, longitude and latitude, for all the values from now on we go to the column that contains the json and select add column based on this column
- for the full address we execute:
- for the longitude we execute:
- for the full address we execute:
- for latitude we execute:
- for all the values except for the city we execute the following command but with changing the parameter that corresponds to the desired field in the json
//to bring the locality
join(forEach(cells["GMAP"].value.parseJson().results.address_components,v, if(v.types=="locality", v.short_name,"")),"")
//to bring the address1
join(forEach(cells["GMAP"].value.parseJson().results.address_components,v, if(v.types=="route", v.short_name,"")),"")
- now for the city column, we have a condition that we want to apply, so we execute the following command:
if(join( forEach(cells["GMAP"].value.parseJson().results.address_components,v, if(v.types=="administrative_area_level_1", v.short_name,"")),"") == "D.F.", join( forEach(cells["GMAP"].value.parseJson().results.address_components,v, if(v.types=="sublocality", v.short_name,"")),""),join( forEach(value.parseJson().results.address_components,v, if(v.types=="locality", v.short_name,"")),""))
Now after we finish our data will look like:
By looking at the above data, we notice that there are differences between the results returned from Google and our previous data, for example we notice an increase in the number of cities from 4 to 25 different ones only by cleaning 2000 rows ! in addition to that we added the lng.lat values which are very useful for fetching useful information in the future, as well as the full address which can be used for display purposed (better than the combination of the address fields).
This whole process needed lots of manual interaction, like removing the white spaces, checking out the clusters. I felt that lots of these tasks can be automated with pre-configured options. Moreover, the whole process of parsing the json was troublesome, GREL (the language used by Refine is not mature, there are no variables in the normal sense functions should be chained so that the value from one goes into the other and so); so having that done in java would allow for more complex operations to be made; for example when we find D.F then we choose sub-locality, but what if we do not have a sub-locality defined ? then we have to go back and choose the locality.