FUZZY SEARCH not only can apply on one column, it can also apply on multiple columns. Let’s see an example, firstly we create a table named companies:

    


CREATE COLUMN TABLE companies
(
id INTEGER PRIMARY KEY,
companyname SHORTTEXT(200) FUZZY SEARCH INDEX ON,
contact SHORTTEXT(100) FUZZY SEARCH INDEX ON
);




     Then, insert the following values:


INSERT INTO companies VALUES (1, 'SAP Corp', 'Mister Master');
INSERT INTO companies VALUES (2, 'SAP in Walldorf Corp', 'Master Mister');
INSERT INTO companies VALUES (3, 'ASAP', 'Nister Naster');
INSERT INTO companies VALUES (4, 'ASAP Corp', 'Mixter Maxter');
INSERT INTO companies VALUES (5, 'BSAP orp', 'Imster Marter');
INSERT INTO companies VALUES (6, 'IBM Corp', 'M. Master');




    Next, we use “and” to perform multiple fuzzy search:


SELECT SCORE() AS score, * FROM companies
WHERE CONTAINS(companyname, 'IBM',
FUZZY(0.7,'textSearch=compare,bestMatchingTokenWeight=0.7'))
AND CONTAINS(contact, 'Master',
FUZZY(0.7,'textSearch=compare,bestMatchingTokenWeight=0.7'))
ORDER BY score DESC




/wp-content/uploads/2014/06/1_481166.png

     The score of this method is the means of all searched columns’ scores. We can also use freestyle search on multiple columns:


SELECT SCORE() AS score, * FROM companies
WHERE CONTAINS((companyname,contact), 'IBM Master', FUZZY(0.7))
ORDER BY score DESC;




/wp-content/uploads/2014/06/2_481167.png

    We can see that the scores of the two search methods above are different. The reason is that freestyle search generally use TF/IDF to calculate the score, besides it can’t support ‘textSearch=compare’, ‘bestMatchingTokenWeight=0.7’ and other parameters.

similarCalculationMode

Option similarCalculationModecontrols how the similarity of two strings (or, for TEXT attributes, terms) is calculated. Basically, the similarity of two strings is defined by the number of common characters, wrong characters, additional characters in the search string, and additional characters in the reference string. The following calculation modes exist:


Mode

Impact on wrong charac

ters

Impact on additional

characters in search

Impact on additional

characters in table

search

high

high

low

compare

moderate

high

high

symmetricsearch

high

moderate

moderate

substringsearch

high

high

low

Examples:


TRUNCATE TABLE "TEST"."SEARCH_TEXT";
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('HANAGEEK');
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('HANAGEEKER');
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('HANAGEK');
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('HANAGAEK');




     Firstly, we use compare mode to perform the search:


SELECT SCORE() AS SCORE,*
FROM "TEST"."SEARCH_TEXT"
WHERE CONTAINS(CONTENT,'HANAGEEK',FUZZY(0.5,'similarCalculationMode=compare'))
ORDER BY SCORE DESC;




/wp-content/uploads/2014/06/3_481168.png

Then, we use search model:


SELECT SCORE() AS SCORE,*
FROM "TEST"."SEARCH_TEXT"
WHERE CONTAINS(CONTENT,'HANAGEEK',FUZZY(0.5,'similarCalculationMode=search'))
ORDER BY SCORE DESC;




4.png

    We can conclude that, compare search model has more effects compared to search model When the original string has additional characters than the search string; search model has more effects than compare model when wrong characters exists. So, when we want to search “SAP” not “SPA”, we should we search model:


TRUNCATE TABLE "TEST"."SEARCH_TEXT";
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('SAP');
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('SAP HANA');
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('SPA');




Perform the search with search model:


SELECT SCORE() AS SCORE,*
FROM "TEST"."SEARCH_TEXT"
WHERE CONTAINS(CONTENT,'SAP',FUZZY(0.5,'similarCalculationMode=search'))
ORDER BY SCORE DESC;




/wp-content/uploads/2014/06/5_481173.png

Perform the search with compare model:

6.png

spellCheckFactor

Option spellCheckFactordefines the score for strings that are not identical but get a fuzzy score of 1.0.

There are two use cases for option spellCheckFactor:

(A) This option allows you to set the score for terms that are not fully equal but that would be a 100% match because of the internal character standardization used by the fuzzy search. For example, the terms ‘Café’ and ‘cafe’ give a score of 1.0 although the terms are not equal. For some users it might be necessary to distinguish between the terms. The decision whether two terms are equal is based on the term representation stored in the column dictionary. Option spellCheckFactortherefore works differently on string and text columns, as described in the following sections.

(B) The fuzzy search can return a 100% match for terms that are not identical but cannot be differentiated by the fuzzy-string-compare algorithm. For example, the fuzzy search cannot differentiate between the terms ‘abaca’ and ‘acaba’. In this case, the spellCheckFactorcan be used to avoid a score of 1.0.

If (A) and (B) are not needed by an application, you can set the spellCheckFactor to 1.0 to disable the feature.

spellCheckFactor has different effects on string and text. We create a varchar type table SEARCH_TEXT_CHAR and text type table SEARCH_TEXT separately. Then insert the same words Café and cafe:


TRUNCATE TABLE "TEST"."SEARCH_TEXT";
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('Café');
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('cafe');
INSERT INTO "TEST"."SEARCH_TEXT_CHAR" VALUES('Café');
INSERT INTO "TEST"."SEARCH_TEXT_CHAR" VALUES('cafe');




We search “cafe” separately:


SELECT SCORE() AS SCORE,*
FROM "TEST"."SEARCH_TEXT"
WHERE CONTAINS(CONTENT,'cafe',FUZZY(0.5))
ORDER BY SCORE DESC;







/wp-content/uploads/2014/06/7_481179.png

Then we use spellCheckFactor model to perform the search:


SELECT SCORE() AS SCORE,*
FROM "TEST"."SEARCH_TEXT"
WHERE CONTAINS(CONTENT,'cafe',FUZZY(0.5,'spellCheckFactor=0.8'))
ORDER BY SCORE DESC


/wp-content/uploads/2014/06/8_481180.png

We can conclude that scenarios “A” only has effects on strings. This is because all characters are replaced by lowercase characters without any diacritics before the fuzzy comparison takes place. This is called standardization. It is therefore possible to get a 100% match when comparing two unequal terms, because the standardization process returned two identical terms. 

interScriptMatching


Option interScriptMatching=on is used to find Latin transliterations of non-Latin search terms, and vice versa.

Often users cannot enter data using the ‘original’ characters of a foreign language. As a result, a transliteration to the Latin alphabet is used to enter this data. A German user who wants to create a new Chinese business partner for example, types in the city name as ‘Shanghai’ instead of using the Chinese characters (‘上海‘). Later, a Chinese user who searches for the business partner in ‘上海‘ does not find the data because the search term ‘上海‘ and the city name ‘Shanghai’ stored in the database do not have any characters in common.

To support search requirements as in the example above, search option interScriptMatching can be used. When interScriptMatching=on is used, it is possible to find Latin transliterations of non-Latin search terms, and vice versa. The behavior of fuzzy search changes as follows:

1. Search with Latin characters

(a)Latin alphabet database entries are searched as usual

(b)Non-Latin alphabet database entries are searched using a Latin transliteration of the original data

2. Search with non-Latin characters

(a)Latin alphabet database entries are searched using a Latin transliteration of the search term

(b)Non-Latin alphabet database entries are searched using the original search term

Examples:


CREATE COLUMN TABLE interscript
(
str NVARCHAR(100) PRIMARY KEY
);
INSERT INTO interscript VALUES ('Shanghai');
INSERT INTO interscript VALUES ('上海');
INSERT INTO interscript VALUES ('Beijing');
INSERT INTO interscript VALUES ('北京');
INSERT INTO interscript VALUES ('背景');


Firstly, we perform the search without the interScriptMatching model:


SELECT TO_DECIMAL(SCORE(),3,2), * FROM interscript WHERE CONTAINS(str, 'shanghai', FUZZY(0.7)) ORDER BY SCORE() DESC;

/wp-content/uploads/2014/06/9_481182.png

Then, we use interScriptMatching model to search:


SELECT TO_DECIMAL(SCORE(),3,2), * FROM interscript WHERE CONTAINS(str, 'shanghai',
FUZZY(0.7,'interScriptMatching=on')) ORDER BY SCORE() DESC

/wp-content/uploads/2014/06/10_481183.png

If we search ‘beijing’ the words ‘北京’ and ‘背景’ all will be searched out.


SELECT TO_DECIMAL(SCORE(),3,2), * FROM interscript WHERE CONTAINS(str, 'beijing',
FUZZY(0.7,'interScriptMatching=on')) ORDER BY SCORE() DESC

/wp-content/uploads/2014/06/11_481185.png

To report this post you need to login first.

Be the first to leave a comment

You must be Logged on to comment or reply to a post.

Leave a Reply