CREATE COLUMN TABLE companies
(
id INTEGER PRIMARY KEY,
companyname SHORTTEXT(200) FUZZY SEARCH INDEX ON,
contact SHORTTEXT(100) FUZZY SEARCH INDEX ON
);
INSERT INTO companies VALUES (1, 'SAP Corp', 'Mister Master');
INSERT INTO companies VALUES (2, 'SAP in Walldorf Corp', 'Master Mister');
INSERT INTO companies VALUES (3, 'ASAP', 'Nister Naster');
INSERT INTO companies VALUES (4, 'ASAP Corp', 'Mixter Maxter');
INSERT INTO companies VALUES (5, 'BSAP orp', 'Imster Marter');
INSERT INTO companies VALUES (6, 'IBM Corp', 'M. Master');
SELECT SCORE() AS score, * FROM companies
WHERE CONTAINS(companyname, 'IBM',
FUZZY(0.7,'textSearch=compare,bestMatchingTokenWeight=0.7'))
AND CONTAINS(contact, 'Master',
FUZZY(0.7,'textSearch=compare,bestMatchingTokenWeight=0.7'))
ORDER BY score DESC
SELECT SCORE() AS score, * FROM companies
WHERE CONTAINS((companyname,contact), 'IBM Master', FUZZY(0.7))
ORDER BY score DESC;
Option similarCalculationModecontrols how the similarity of two strings (or, for TEXT attributes, terms) is calculated. Basically, the similarity of two strings is defined by the number of common characters, wrong characters, additional characters in the search string, and additional characters in the reference string. The following calculation modes exist:
Mode | Impact on wrong charac ters | Impact on additional characters in search | Impact on additional characters in table |
search | high | high | low |
compare | moderate | high | high |
symmetricsearch | high | moderate | moderate |
substringsearch | high | high | low |
TRUNCATE TABLE "TEST"."SEARCH_TEXT";
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('HANAGEEK');
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('HANAGEEKER');
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('HANAGEK');
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('HANAGAEK');
SELECT SCORE() AS SCORE,*
FROM "TEST"."SEARCH_TEXT"
WHERE CONTAINS(CONTENT,'HANAGEEK',FUZZY(0.5,'similarCalculationMode=compare'))
ORDER BY SCORE DESC;
SELECT SCORE() AS SCORE,*
FROM "TEST"."SEARCH_TEXT"
WHERE CONTAINS(CONTENT,'HANAGEEK',FUZZY(0.5,'similarCalculationMode=search'))
ORDER BY SCORE DESC;
TRUNCATE TABLE "TEST"."SEARCH_TEXT";
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('SAP');
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('SAP HANA');
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('SPA');
SELECT SCORE() AS SCORE,*
FROM "TEST"."SEARCH_TEXT"
WHERE CONTAINS(CONTENT,'SAP',FUZZY(0.5,'similarCalculationMode=search'))
ORDER BY SCORE DESC;
Option spellCheckFactordefines the score for strings that are not identical but get a fuzzy score of 1.0.
spellCheckFactor has different effects on string and text. We create a varchar type table SEARCH_TEXT_CHAR and text type table SEARCH_TEXT separately. Then insert the same words Café and cafe:
TRUNCATE TABLE "TEST"."SEARCH_TEXT";
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('Café');
INSERT INTO "TEST"."SEARCH_TEXT" VALUES('cafe');
INSERT INTO "TEST"."SEARCH_TEXT_CHAR" VALUES('Café');
INSERT INTO "TEST"."SEARCH_TEXT_CHAR" VALUES('cafe');
SELECT SCORE() AS SCORE,*
FROM "TEST"."SEARCH_TEXT"
WHERE CONTAINS(CONTENT,'cafe',FUZZY(0.5))
ORDER BY SCORE DESC;
SELECT SCORE() AS SCORE,*
FROM "TEST"."SEARCH_TEXT"
WHERE CONTAINS(CONTENT,'cafe',FUZZY(0.5,'spellCheckFactor=0.8'))
ORDER BY SCORE DESC
We can conclude that scenarios “A” only has effects on strings. This is because all characters are replaced by lowercase characters without any diacritics before the fuzzy comparison takes place. This is called standardization. It is therefore possible to get a 100% match when comparing two unequal terms, because the standardization process returned two identical terms.
Option interScriptMatching=on is used to find Latin transliterations of non-Latin search terms, and vice versa.
Often users cannot enter data using the 'original' characters of a foreign language. As a result, a transliteration to the Latin alphabet is used to enter this data. A German user who wants to create a new Chinese business partner for example, types in the city name as 'Shanghai' instead of using the Chinese characters ('上海'). Later, a Chinese user who searches for the business partner in '上海' does not find the data because the search term '上海' and the city name 'Shanghai' stored in the database do not have any characters in common.
To support search requirements as in the example above, search option interScriptMatching can be used. When interScriptMatching=on is used, it is possible to find Latin transliterations of non-Latin search terms, and vice versa. The behavior of fuzzy search changes as follows:
(a)Latin alphabet database entries are searched as usual
(b)Non-Latin alphabet database entries are searched using a Latin transliteration of the original data
(a)Latin alphabet database entries are searched using a Latin transliteration of the search term
(b)Non-Latin alphabet database entries are searched using the original search term
Examples:
CREATE COLUMN TABLE interscript
(
str NVARCHAR(100) PRIMARY KEY
);
INSERT INTO interscript VALUES ('Shanghai');
INSERT INTO interscript VALUES ('上海');
INSERT INTO interscript VALUES ('Beijing');
INSERT INTO interscript VALUES ('北京');
INSERT INTO interscript VALUES ('背景');
SELECT TO_DECIMAL(SCORE(),3,2), * FROM interscript WHERE CONTAINS(str, 'shanghai', FUZZY(0.7)) ORDER BY SCORE() DESC;
SELECT TO_DECIMAL(SCORE(),3,2), * FROM interscript WHERE CONTAINS(str, 'shanghai',
FUZZY(0.7,'interScriptMatching=on')) ORDER BY SCORE() DESC
SELECT TO_DECIMAL(SCORE(),3,2), * FROM interscript WHERE CONTAINS(str, 'beijing',
FUZZY(0.7,'interScriptMatching=on')) ORDER BY SCORE() DESC
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
13 | |
10 | |
10 | |
7 | |
6 | |
5 | |
5 | |
5 | |
4 | |
4 |