Search
Search

We have been seeing the search engine of Google and Yahoo for long time. As a latest trend we have also observed the spelling corrections performed by the search engine in the input fields. In this article I cover the concept of information retrieval for best match using the vector model. This can be used in the applications for searching or spelling correction.

Each document is defined by set of words / index terms / tokens. These token can be matched with the query tokens or index terms for searching. In a Boolean method one can imagine 1 or 0 for the presence and absence of the index term and try finding the matches, which means the document is predicted to be relevant or non-relevant. The index term which matches can be weighted as 1 and score is aggregation of number of terms present in the document. The issue with the Boolean model is that each term is equally weighted and it does not distinguish repeated usage of terms in document as more important, also if the term is unique in the given document again it has its own importance. This brings us to the problem to find importance / weight of   each index term to calculate the score of matching. This can be learned from the existing documents using the TF-IDF(Token Frequency-Inverse Document Frequency) method in Vector Model. Lets us first understand some terminologies used in this regard:

TF (Token Frequency): Number of time an index term appears in the document is called token frequency.

Normalize the index term by dividing it with max frequency. TFi = TFi / max TFl

Term Frequency: Number document in which the given term exist. (mostly referred as small ‘n’, capital ‘N’ representing the total number of documents).

IDF (Inverse document frequency): It computes the relevance of token across the document. Logarithm of the terms occurrence across the document is taken. In which case if the term occurs in all documents than the inverse token frequency is zero (log 1), the term does not distinguish any document. This helps in finding the terms which are rare in a document and gives importance in comparison with others.

IDF = log (N/ni.)

Wd – Weight of each token or index terms is = TF * IDF

Once the weights are computed for the entire token or index terms across all documents we are ready to consume it for querying or searching new string. It uses the similarity function (cos theta) to find the score of each document.

User can pass the threshold to find out the matches above the threshold.

Similarity Function:

Sim(q, d) = ∑ (Wq * Wd) / √ ∑ (Wq* Wq) * √ ∑ (Wd* Wd)

Where,

Wq – Weight of the query string.

Wd – Weight of the index term for the document in comparison.

Overall like any other machine learning algorithm it is used in two steps:

1. Learning:  Learn the weights of document index terms.
2. Querying: Use the similarity function to find the relevant documents.

In the Java implementation of tf-idf algorithm, I have used multi-threading & partitioning concept to compute. Given the data size above 1000 documents, it creates multiple threads by partitioning the data in ‘n’ segments and parallely processes the document term frequency of each term.

While querying the same philosophy of partitioning and parallel processing is applied. First we split the document into equal fraction and then multiple threads execute parallel to find the score or relevance. Once all the threads are completed it aggregates and returns the relevant documents.

This was a good solution to handle large number of documents while searching. This can be used in various scenarios:

1. In procurement matching the commodity using the description.
2. Finding the right component or sub-component for the message using the message description.
3. Part number matching; find the parts from the unstructured textual (description) data.

Etc..

This can be used in various scenarios as a solution to find the matches directly. In other cases, consider millions of documents or descriptions. These descriptions are classified in various category or hierarchy. In this case we can use TF-IDF to detect the category or hierarchy as the first measure to reduce the search space of document at node level itself. Afterwards we need to apply much more expensive (performance wise) algorithm to find most relevant search results from the node mapped documents. For example in the spend analytics we have commodities such as “Home Improvement”, “Stationery”, etc…

In “Stationery” commodity we have pens, pencils, stickers, markers etc…. In this scenario when a description of the product expense is received first we try mapping to the commodity it belongs to using the TF-IDF and similarity measure. Once we get the best match commodity with good threshold we further process (apply advance algorithms) the sub-category level (pen, pencils, markers etc…) to find the most relevant match.

The implementation supports ‘ngram’ token concept too. User can create tokens for document in two ways. To explain the usage let me take up an example document string ‘Unable to create file’.

Token as word is represented as {‘Unable’, ‘create’, ‘file’}.  In the ‘ngram’ concept where n=3 same would be represented as {‘Una’, ‘nab’, ‘abl’,  ‘ble’,  ‘cre’,  ‘rea’, ‘eat’, ‘ate’, ‘fil’,  ‘ile’}.

Advantage of using ‘ngram’ approach is that we have good chances to get relevant results even with spelling mistakes in queries or even in documents. This is not possible in full word as token, in which the word is completely missed from match due to spelling mistake. In the ‘ngram’ approach with n=3 there are good chances that other token of same word gets the match and we still will get the relevant document.

To utilize the TF-IDF java implementation, add the jar to your project and make use of the main class ‘InvertedIndex’. I look forward to your feedback on the library and usage scenarios and on improving it!

Link to the jar file: TF-IDF Jar File

To report this post you need to login first.