Ever looked for matches that were less than 100% perfect? Key based matches not working for you? Until a few years ago, fuzzy matching was the only answer we had. Today, data science algorithms are breaching the boundaries of the “possible”. Find out how we used algorithms typically applied to document similarity to solve the traditional fuzzy matching problem using SAP Data Intelligence & SAP Analytics Cloud.
Recently we were approached by a municipality in India for an easy way to match profiles across official covid-19 case lists. The govt officers were eyeballing the data and manually finding records of individuals that overlapped between the positive case lists and the quarantine / de-quarantine lists. This was of course time consuming and a typical clarion call for automation. The traditional approach for this is a process called fuzzy matching, which would involve matching the profiles on the basis of Name, Gender, Age and Address much like a join, but where the constraints are relaxed to have a match that is less than 100% pure. It’s similarity to a join, in the context of a string based comparison, implies processing time will expand exponentially with the size of the database.
Until some years ago, this would have been the unavoidable drudgery that we would have to plough on through. But not anymore. The idea of TF-IDF (term frequency, inverse document frequency) stems from the need to highlight how important a word is to a document in a collection or a corpus. For ex. In documents to do with wild life conservation, you can expect a set of keywords to occur more often than say in documents to do with supply chain. But in large documents the frequency of occurrence can simply be due to its size. So we find a way to control for document size, and define TD-IDF as the number of times a word appears in a document, offset by number of documents in the corpus that contain the word. This is a technique typically applied in searches for document similarity in its general form. We can easily extend it to user profiles, especially in the case of names and addresses.
We were handed 3 csv files containing records of covid-19 positive, quarantined and de-quarantined cases respectively. There were certain columns which were present in all 3 csv files but due to the manual nature of data collection they were mostly in different formats. Had data been collected by a single vendor in the same format, the job of collating all this data from these sources would have been a simple join on the common columns. Since real world scenarios are far from ideal, we need to jump through hoops to get the files ready before modelling.
Clean and homogenize data, so the disparate files can blend seamlessly.
Cosine similarity is a very popular technique used to find similarity between two vectors. In the current scenario, we find the similarity between two string vectors. The function used to calculate cosine similarity is shown below.
In this approach, we compare a record from one set with all the records in the other set and obtain a similarity score. The function below takes tf-idf matrix that was created earlier as input.
In this blog, we have shown a step by step approach on how we cleaned the data, built models using it and how we later aggregated the results from the models so that it can then be presented to the stakeholders. In Part 2 of this blog post, we describe how these models were scheduled and deployed in SAP Data Intelligence.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
11 | |
10 | |
10 | |
9 | |
8 | |
7 | |
7 | |
7 | |
7 | |
6 |