Skip to Content

FUZZY SEARCH

     FUZZY SEARCH not only can apply on one column, it can also apply on multiple columns. Let’s see an example, firstly we create a table named companies:

    


CREATE COLUMN TABLE companies
(
id INTEGER PRIMARY KEY,
companyname SHORTTEXT(200) FUZZY SEARCH INDEX ON,
contact SHORTTEXT(100) FUZZY SEARCH INDEX ON
);




     Then, insert the following values:


INSERT INTO companies VALUES (1, 'SAP Corp', 'Mister Master');
INSERT INTO companies VALUES (2, 'SAP in Walldorf Corp', 'Master Mister');
INSERT INTO companies VALUES (3, 'ASAP', 'Nister Naster');
INSERT INTO companies VALUES (4, 'ASAP Corp', 'Mixter Maxter');
INSERT INTO companies VALUES (5, 'BSAP orp', 'Imster Marter');
INSERT INTO companies VALUES (6, 'IBM Corp', 'M. Master');




    Next, we use “and” to perform multiple fuzzy search:


SELECT SCORE() AS score, * FROM companies
WHERE CONTAINS(companyname, 'IBM',
FUZZY(0.7,'textSearch=compare,bestMatchingTokenWeight=0.7'))
AND CONTAINS(contact, 'Master',
FUZZY(0.7,'textSearch=compare,bestMatchingTokenWeight=0.7'))
ORDER BY score DESC




/wp-content/uploads/2014/06/1_481166.png

     The score of this method is the means of all searched columns’ scores. We can also use freestyle search on multiple columns:


SELECT SCORE() AS score, * FROM companies
WHERE CONTAINS((companyname,contact), 'IBM Master', FUZZY(0.7))
ORDER BY score DESC;




/wp-content/uploads/2014/06/2_481167.png

We can see that the scores of the two search methods above are different. The reason is that freestyle search generally use TF/IDF to calculate the score, besides it can’t support ‘textSearch=compare’, ‘bestMatchingTokenWeight=0.7’ and other parameters.

similarCalculationMode

Option similarCalculationModecontrols how the similarity of two strings (or, for TEXT attributes, terms) is calculated. Basically, the similarity of two strings is defined by the number of common characters, wrong characters, additional characters in the search string, and additional characters in the reference string. The following calculation modes exist:

 

Mode

Impact on wrong charac

ters

Impact on additional

characters in search

Impact on additional

characters in table

search high high low
compare moderate high high
symmetricsearch high moderate moderate
substringsearch high high low

Examples:


TRUNCATE TABLE "TEST"."SEARCH_TEXT";
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('HANAGEEK');
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('HANAGEEKER');
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('HANAGEK');
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('HANAGAEK');




     Firstly, we use compare mode to perform the search:


SELECT SCORE() AS SCORE,*
FROM "TEST"."SEARCH_TEXT"
WHERE CONTAINS(CONTENT,'HANAGEEK',FUZZY(0.5,'similarCalculationMode=compare'))
ORDER BY SCORE DESC;




/wp-content/uploads/2014/06/3_481168.png

Then, we use search model:


SELECT SCORE() AS SCORE,*
FROM "TEST"."SEARCH_TEXT"
WHERE CONTAINS(CONTENT,'HANAGEEK',FUZZY(0.5,'similarCalculationMode=search'))
ORDER BY SCORE DESC;




4.png

Perform the search with compare model:

6.png

Then we use spellCheckFactor model to perform the search:


SELECT SCORE() AS SCORE,*
FROM "TEST"."SEARCH_TEXT"
WHERE CONTAINS(CONTENT,'cafe',FUZZY(0.5,'spellCheckFactor=0.8'))
ORDER BY SCORE DESC


/wp-content/uploads/2014/06/8_481180.png

We can conclude that scenarios “A” only has effects on strings. This is because all characters are replaced by lowercase characters without any diacritics before the fuzzy comparison takes place. This is called standardization. It is therefore possible to get a 100% match when comparing two unequal terms, because the standardization process returned two identical terms. 

interScriptMatching

 

Option interScriptMatching=on is used to find Latin transliterations of non-Latin search terms, and vice versa.

Often users cannot enter data using the ‘original’ characters of a foreign language. As a result, a transliteration to the Latin alphabet is used to enter this data. A German user who wants to create a new Chinese business partner for example, types in the city name as ‘Shanghai’ instead of using the Chinese characters (‘上海‘). Later, a Chinese user who searches for the business partner in ‘上海‘ does not find the data because the search term ‘上海‘ and the city name ‘Shanghai’ stored in the database do not have any characters in common.

To support search requirements as in the example above, search option interScriptMatching can be used. When interScriptMatching=on is used, it is possible to find Latin transliterations of non-Latin search terms, and vice versa. The behavior of fuzzy search changes as follows:

1. Search with Latin characters

(a)Latin alphabet database entries are searched as usual

(b)Non-Latin alphabet database entries are searched using a Latin transliteration of the original data

2. Search with non-Latin characters

(a)Latin alphabet database entries are searched using a Latin transliteration of the search term

(b)Non-Latin alphabet database entries are searched using the original search term

Examples:


CREATE COLUMN TABLE interscript
(
str NVARCHAR(100) PRIMARY KEY
);
INSERT INTO interscript VALUES ('Shanghai');
INSERT INTO interscript VALUES ('上海');
INSERT INTO interscript VALUES ('Beijing');
INSERT INTO interscript VALUES ('北京');
INSERT INTO interscript VALUES ('背景');


Firstly, we perform the search without the interScriptMatching model:


SELECT TO_DECIMAL(SCORE(),3,2), * FROM interscript WHERE CONTAINS(str, 'shanghai', FUZZY(0.7)) ORDER BY SCORE() DESC;

/wp-content/uploads/2014/06/9_481182.png

Then, we use interScriptMatching model to search:


SELECT TO_DECIMAL(SCORE(),3,2), * FROM interscript WHERE CONTAINS(str, 'shanghai',
FUZZY(0.7,'interScriptMatching=on')) ORDER BY SCORE() DESC

/wp-content/uploads/2014/06/10_481183.png

If we search ‘beijing’ the words ‘北京’ and ‘背景’ all will be searched out.


SELECT TO_DECIMAL(SCORE(),3,2), * FROM interscript WHERE CONTAINS(str, 'beijing',
FUZZY(0.7,'interScriptMatching=on')) ORDER BY SCORE() DESC

/wp-content/uploads/2014/06/11_481185.png

Be the first to leave a comment
You must be Logged on to comment or reply to a post.