Hello Experts,


This blog is about one of the feature that SAP HANA provides, FUZZY SEARCH.


Now the question arises, what is Fuzzy search?!… So, Fuzzy search is the technique of finding strings that match a pattern approximately (rather than exactly). It is a type of search that will find matches even when users misspell words or enter only partial words for the search. It is also known as approximate string matching.


According to Fuzzy Search Reference guide, Fuzzy Search is a fast and fault-tolerant search feature for SAP HANA. The term “fault-tolerant search” means that a database query returns records even if the search term (the user input) contains additional or missing characters, or other types of spelling error.


Fuzzy search can be used in various applications, like:

  • Fault-tolerant check for Misspelled words and typos
  • Fault-tolerant search in text columns
  • Fault-tolerant search in structured database content
  • Fault-tolerant check for duplicate records

The best real world example of such fault-tolerant search is when you type “The United States of Amerika” in the Google Search, it automatically displays result for “The United States of America”.

In SAP HANA, Fuzzy Search can be called by using the CONTAINS() predicate with the FUZZY() option in the WHERE clause of a SELECT statement.

The basic SYNTAX is:

       


SELECT * FROM <tablename> WHERE CONTAINS (<column_name>, <search_string>, FUZZY (x))


Where, x is an argument that defines fuzzy threshold. It ranges from 0.0 to 1.0 and defines the level of error tolerance for the search. A search with FUZZY(x) returns all values that have a fuzzy score greater than or equal to x.

Fuzzy Search can only be applied for:

  • Column Table
  • Attribute View
  • Also on SQL views (created with the CREATE VIEW statement), and on joins of multiple tables and views, in some cases

       

          having column types as:

    • String (VARCHAR, NVARCHAR)
    • Text (TEXT, SHORTTEXT, FULLTEXT INDEX)
    • DATE

The CONTAINS() predicate can be used in the WHERE clause of a SELECT statement. It performs:

  1. A free style search on multiple columns
  2. A full-text search on one column containing large documents
  3. A search on one database column containing structured data

The type of search it performs depends on its arguments.

The SCORE() Function

The fuzzy search algorithm calculates a fuzzy score for each comparison, the SCORE() function can be used to retrieve the score. This is a numeric value between 0.0 and 1.0.

The score defines the similarity between the user input and the records returned by the search. A score of 1.0 means the strings are identical. A score of 0.0 means that there is no similarity. The higher the score, the more similar a record is to the search input.

We can request the score in the SELECT statement by using the SCORE() function. You can sort the results of a query by score in descending order to get the best records first (the best record is the record that is most similar to the user input). When more than one CONTAINS() is given in the WHERE clause or multiple columns is used in a SELECT statement, the score is calculated as a weighted average of the scores of all columns.

For example, consider we have a column table with two fields (ID integer, TXT TEXT) having values like different variations of word ‘hello world’. Then the Fuzzy Search for word ‘hello’ with Score will return the following:


SELECT TO_DECIMAL(SCORE(),3,2) AS score, * FROM <table_name> WHERE CONTAINS(txt, 'Hello', FUZZY(0.8))
ORDER BY score DESC;

/wp-content/uploads/2015/08/1_779630.png

Here, the words ‘hello’ and ‘Hello’ are having the score as 1, since the string matches completely. Whereas, word ‘ello’ is having the lowest score.


We can specify additional search options that change the default behavior of the fuzzy search as an additional string parameter for the FUZZY() function.

There are so many possible combinations of search options available. Lets try out combination of FUZZY() with similarCalculationMode.

Step 1. Create 1 column table as:


create column table <table_name>(
ID integer,
TXT varchar(20));


Step 2. Run following commands to Insert values into the table:



insert into <table_name> values(1,'hello');
insert into <table_name> values(3,'hell');
insert into <table_name> values(4,'hel');
insert into <table_name> values(5,'ello');
insert into <table_name> values(7,'hello world');
insert into <table_name> values(8,'hell world');
insert into <table_name> values(14,'helloworld');
insert into <table_name> values(15,'hellworld');
insert into <table_name> values(16,'HelloWorld');
insert into <table_name> values(17,'HELLO');
insert into <table_name> values(21,'world');
insert into <table_name> values(22,'word');


Step 3. Perform string search with option similarCalculationMode



SELECT TO_DECIMAL(SCORE(),3,2) AS score, * FROM <table_name>
WHERE CONTAINS(txt, 'Hello', FUZZY(0.8,'similarCalculationMode=compare'))
ORDER BY score DESC;

    

     We will get the output as:

     /wp-content/uploads/2015/08/2_779631.png

     Here, the FUZZY() compares all the strings in the table with the search string and gives the best matching results having SCORE() greater than 0.8


SELECT TO_DECIMAL(SCORE(),3,2) AS score, * FROM <table_name>
WHERE CONTAINS(txt, 'Hello', FUZZY(0.8,'similarCalculationMode=search'))
ORDER BY score DESC;

    

     We will get the output as:

     /wp-content/uploads/2015/08/3_779632.png

     Here, the FUZZY() searches all the strings in the table with the search string and gives the best matching results having SCORE() greater than 0.8. Notice the difference between search and compare here. As the part of result it also includes the strings composed of two words.


SELECT TO_DECIMAL(SCORE(),3,2) AS score, * FROM <table_name>
WHERE CONTAINS(txt, 'Hello', FUZZY(0.8,'similarCalculationMode=substringsearch'))
ORDER BY score DESC;

    

     We will get the output as:

     /wp-content/uploads/2015/08/4_779642.png

     Here, the FUZZY() searches all the strings in the table having substring ‘hello’ as the search string and gives the best matching results having SCORE() greater than 0.8.


Similarly, we can try other Available Properties of FUZZY SEARCH mentioned in the reference guide. Also, we can try different combinations of these properties to get the best possible result as per the requirement.

I hope you liked my first blog.

Happy Learning! 🙂

Thanks,

Pragati Gupta

To report this post you need to login first.

16 Comments

You must be Logged on to comment or reply to a post.

  1. Anindita Bhowmik

    Hi Pragati,

    Could you help me sort out numeric searches using fuzzy search for example,

    SELECT * FROM “<tablename>

    WHERE CONTAINS (<column of datatype string>,[1-9], FUZZY (0.8));

    where I want to search all numbers and alphanumeric numbers from the string column <column of datatype string> .

    In the scenario,

    Value_1: swxgwg 98789 sxkjxcg

    Value_2: xg8967

    Value_3: .244

    Value_4: dkwchwc

    only Value_2 and Value_3 would be identified.

    Thanks and Regards,

    Anindita Bhowmik

    (0) 
  2. rao sara

    What is Fuzzy Search?

    Also known as approximate string matching.
    Fuzzy search is the technique of finding strings that match a pattern approximately (rather than exactly).
    It is a type of search that will find matches even when users misspell words or enter in only partial words for the search.

    purpose:

    With the help of Fuzzy Search Misspellings and typos still provide relevant results.

    More Information: http://www.saphanacentral.com/p/how-to-use-fuzzy-search-in-sap-hana.html 

    (0) 

Leave a Reply