FUZZY SEARCH

Former Member · ‎06-22-2014

FUZZY SEARCH not only can apply on one column, it can also apply on multiple columns. Let’s see an example, firstly we create a table named companies:



CREATE COLUMN TABLE companies

(

id INTEGER PRIMARY KEY,

companyname SHORTTEXT(200) FUZZY SEARCH INDEX ON,

contact SHORTTEXT(100) FUZZY SEARCH INDEX ON

);

Then, insert the following values:



INSERT INTO companies VALUES (1, 'SAP Corp', 'Mister Master');

INSERT INTO companies VALUES (2, 'SAP in Walldorf Corp', 'Master Mister');

INSERT INTO companies VALUES (3, 'ASAP', 'Nister Naster');

INSERT INTO companies VALUES (4, 'ASAP Corp', 'Mixter Maxter');

INSERT INTO companies VALUES (5, 'BSAP orp', 'Imster Marter');

INSERT INTO companies VALUES (6, 'IBM Corp', 'M. Master');

Next, we use “and” to perform multiple fuzzy search:



SELECT SCORE() AS score, * FROM companies

WHERE CONTAINS(companyname, 'IBM',

FUZZY(0.7,'textSearch=compare,bestMatchingTokenWeight=0.7'))

AND CONTAINS(contact, 'Master',

FUZZY(0.7,'textSearch=compare,bestMatchingTokenWeight=0.7'))

ORDER BY score DESC

The score of this method is the means of all searched columns’ scores. We can also use freestyle search on multiple columns:



SELECT SCORE() AS score, * FROM companies

WHERE CONTAINS((companyname,contact), 'IBM Master', FUZZY(0.7))

ORDER BY score DESC;

We can see that the scores of the two search methods above are different. The reason is that freestyle search generally use TF/IDF to calculate the score, besides it can’t support 'textSearch=compare', 'bestMatchingTokenWeight=0.7' and other parameters.

similarCalculationMode

Option similarCalculationModecontrols how the similarity of two strings (or, for TEXT attributes, terms) is calculated. Basically, the similarity of two strings is defined by the number of common characters, wrong characters, additional characters in the search string, and additional characters in the reference string. The following calculation modes exist:

Mode	Impact on wrong charac ters	Impact on additional characters in search	Impact on additional characters in table
search	high	high	low
compare	moderate	high	high
symmetricsearch	high	moderate	moderate
substringsearch	high	high	low

Examples:



TRUNCATE TABLE "TEST"."SEARCH_TEXT";

INSERT INTO "TEST"."SEARCH_TEXT" VALUES('HANAGEEK');

INSERT INTO "TEST"."SEARCH_TEXT" VALUES('HANAGEEKER');

INSERT INTO "TEST"."SEARCH_TEXT" VALUES('HANAGEK');

INSERT INTO "TEST"."SEARCH_TEXT" VALUES('HANAGAEK');

Firstly, we use compare mode to perform the search:



SELECT SCORE() AS SCORE,*

FROM "TEST"."SEARCH_TEXT"

WHERE CONTAINS(CONTENT,'HANAGEEK',FUZZY(0.5,'similarCalculationMode=compare'))

ORDER BY SCORE DESC;

Then, we use search model:



SELECT SCORE() AS SCORE,*

FROM "TEST"."SEARCH_TEXT"

WHERE CONTAINS(CONTENT,'HANAGEEK',FUZZY(0.5,'similarCalculationMode=search'))

ORDER BY SCORE DESC;

We can conclude that, compare search model has more effects compared to search model When the original string has additional characters than the search string; search model has more effects than compare model when wrong characters exists. So, when we want to search “SAP” not “SPA”, we should we search model:



TRUNCATE TABLE "TEST"."SEARCH_TEXT";

INSERT INTO "TEST"."SEARCH_TEXT" VALUES('SAP');

INSERT INTO "TEST"."SEARCH_TEXT" VALUES('SAP HANA');

INSERT INTO "TEST"."SEARCH_TEXT" VALUES('SPA');

Perform the search with search model:



SELECT SCORE() AS SCORE,*

FROM "TEST"."SEARCH_TEXT"

WHERE CONTAINS(CONTENT,'SAP',FUZZY(0.5,'similarCalculationMode=search'))

ORDER BY SCORE DESC;

Perform the search with compare model:

spellCheckFactor

Option spellCheckFactordefines the score for strings that are not identical but get a fuzzy score of 1.0.

There are two use cases for option spellCheckFactor:

(A) This option allows you to set the score for terms that are not fully equal but that would be a 100% match because of the internal character standardization used by the fuzzy search. For example, the terms 'Café' and 'cafe' give a score of 1.0 although the terms are not equal. For some users it might be necessary to distinguish between the terms. The decision whether two terms are equal is based on the term representation stored in the column dictionary. Option spellCheckFactortherefore works differently on string and text columns, as described in the following sections.

(B) The fuzzy search can return a 100% match for terms that are not identical but cannot be differentiated by the fuzzy-string-compare algorithm. For example, the fuzzy search cannot differentiate between the terms 'abaca' and 'acaba'. In this case, the spellCheckFactorcan be used to avoid a score of 1.0.

If (A) and (B) are not needed by an application, you can set the spellCheckFactor to 1.0 to disable the feature.

spellCheckFactor has different effects on string and text. We create a varchar type table SEARCH_TEXT_CHAR and text type table SEARCH_TEXT separately. Then insert the same words Café and cafe:



TRUNCATE TABLE "TEST"."SEARCH_TEXT";

INSERT INTO "TEST"."SEARCH_TEXT" VALUES('Café');

INSERT INTO "TEST"."SEARCH_TEXT" VALUES('cafe');

INSERT INTO "TEST"."SEARCH_TEXT_CHAR" VALUES('Café');

INSERT INTO "TEST"."SEARCH_TEXT_CHAR" VALUES('cafe');

We search “cafe” separately:



SELECT SCORE() AS SCORE,*

FROM "TEST"."SEARCH_TEXT"

WHERE CONTAINS(CONTENT,'cafe',FUZZY(0.5))

ORDER BY SCORE DESC;

Then we use spellCheckFactor model to perform the search:



SELECT SCORE() AS SCORE,*

FROM "TEST"."SEARCH_TEXT"

WHERE CONTAINS(CONTENT,'cafe',FUZZY(0.5,'spellCheckFactor=0.8'))

ORDER BY SCORE DESC

We can conclude that scenarios “A” only has effects on strings. This is because all characters are replaced by lowercase characters without any diacritics before the fuzzy comparison takes place. This is called standardization. It is therefore possible to get a 100% match when comparing two unequal terms, because the standardization process returned two identical terms.

interScriptMatching

Option interScriptMatching=on is used to find Latin transliterations of non-Latin search terms, and vice versa.

Often users cannot enter data using the 'original' characters of a foreign language. As a result, a transliteration to the Latin alphabet is used to enter this data. A German user who wants to create a new Chinese business partner for example, types in the city name as 'Shanghai' instead of using the Chinese characters ('上海'). Later, a Chinese user who searches for the business partner in '上海' does not find the data because the search term '上海' and the city name 'Shanghai' stored in the database do not have any characters in common.

To support search requirements as in the example above, search option interScriptMatching can be used. When interScriptMatching=on is used, it is possible to find Latin transliterations of non-Latin search terms, and vice versa. The behavior of fuzzy search changes as follows:

1. Search with Latin characters

(a)Latin alphabet database entries are searched as usual

(b)Non-Latin alphabet database entries are searched using a Latin transliteration of the original data

2. Search with non-Latin characters

(a)Latin alphabet database entries are searched using a Latin transliteration of the search term

(b)Non-Latin alphabet database entries are searched using the original search term

Examples:



CREATE COLUMN TABLE interscript

(

str NVARCHAR(100) PRIMARY KEY

);

INSERT INTO interscript VALUES ('Shanghai');

INSERT INTO interscript VALUES ('上海');

INSERT INTO interscript VALUES ('Beijing');

INSERT INTO interscript VALUES ('北京');

INSERT INTO interscript VALUES ('背景');

Firstly, we perform the search without the interScriptMatching model:



SELECT TO_DECIMAL(SCORE(),3,2), * FROM interscript WHERE CONTAINS(str, 'shanghai', FUZZY(0.7)) ORDER BY SCORE() DESC;

Then, we use interScriptMatching model to search:



SELECT TO_DECIMAL(SCORE(),3,2), * FROM interscript WHERE CONTAINS(str, 'shanghai',

FUZZY(0.7,'interScriptMatching=on')) ORDER BY SCORE() DESC

If we search ‘beijing’ the words ‘北京’ and ‘背景’ all will be searched out.



SELECT TO_DECIMAL(SCORE(),3,2), * FROM interscript WHERE CONTAINS(str, 'beijing',

FUZZY(0.7,'interScriptMatching=on')) ORDER BY SCORE() DESC

FUZZY SEARCH

similarCalculationMode

spellCheckFactor

interScriptMatching

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win