Skip to Content
Author's profile photo Former Member

Fuzzy Matching Dirty Data

Everyone knows how awesome HANA is at being a database, but it also has some pretty cool features that come along with it.  One of those is the built-in fuzzy search capabilities.  I recently used the fuzzy search feature to match up names of products we receive from customers to a standardized name of that product that would be used for reporting.  For an individual product, this is a pretty trivial task, but I needed to do it for thousands of products and it needed to be accurate.

My first step was figuring out what the best search parameters for my data would be. Luckily, we already had a fairly significant lookup table of product names from customers and matching standardized names that I could use to tweak my search parameters.  I also needed to be able to do my fuzzy search on potentially thousands of product names.  This necessitated me committing the evil deed of writing a cursor (don’t hurt me!) so I could run the fuzzy search on each product name.

create procedure sp_fuzzy_foo () 
  declare cursor c_products for 
      select product_name, std_product_name 
      from product_name_lookup; 
  for c_row as c_products do 
      insert into fuzzy_foo (product_name, std_product_name, fuzzy_product_name, score) 
      select  c_row.product_name,  
                score() as score  
      from std_product_names std 
      where contains(std.product_name, c_row.product_name, fuzzy(0.5, 
  end for; 

This procedure allowed me to compare the results of the fuzzy search to our already matched up list of product names from customers.  For my data, this set of parameters gave me a nearly 90% match rate when the fuzzy score was above 0.8 between the standardized product name and the product name the fuzzy search returned. It also highlighted some anomalies in the lookup table for 10% that did not match which suggests that it is even more accurate than the 90% match rate it returned.  While playing with the search parameters, I found one of the neatest aspects of the fuzzy search feature is term mappings.  My “dirty data” from the customers used a ton of acronyms that our standardized product names did not use.  The term mappings allowed me to tell the fuzzy search any time it sees “DBL” to map it to “Double”.  The more terms I added to the term mapping table, the more accurate it became. Once I had the search parameters down, I needed to do it on my product names from customers that didn’t have a matching standardized product name in our lookup table.  I created another procedure that performed these steps.

  1. Join the product names from the customers to the lookup table to find any already matched up names and insert into a “matched” table.  The products inserted in this step received a score of 1 as if they matched perfectly in the fuzzy search.
  2. Call the fuzzy search procedure on product names that are not in the lookup table and insert those results into a “fuzzy matched” table.
  3. Insert the top scoring fuzzy matched product for each product name from the customer into the “matched” table if it had a fuzzy score greater than 0.8.
  4. Insert any product with a score less than 0.8 into the “matched” table with a score of 0.
  5. For any fuzzy matched product with a score less than 0.8, insert those results in a “recommended” table.

With the output of this procedure, we have another lookup table for our standardized product names that will include the fuzzy matched products.  We also have a table with recommended product names that we can use as a starting point to add to our original lookup table.  The results are not perfect and there are some products that might receive multiple matching product names with the same top score, but it’s better than nothing and eliminates a ton of manual labor.

Assigned Tags

      You must be Logged on to comment or reply to a post.
      Author's profile photo Former Member
      Former Member

      This is a great approach.

      Have you compared this solution with the SAP DS DataQuality transformations?

      I was thinking in a similar solution, with the fuzzy search capabilities.

      Author's profile photo Former Member
      Former Member
      Blog Post Author

      I haven't got into much SAP DS work yet, but that is in my plans for the near future.  I will try it out there and see what happens.

      Author's profile photo G. Cheung
      G. Cheung

      Thanks for sharing Robert! Just the pointer I was looking for, to tackle a similar challenge.

      Author's profile photo Former Member
      Former Member
      Blog Post Author