FUZZY SEARCH (一)
We have a Chinese version of this document.
SAP HANA supports several of text search methods. For example, “like” is well known and supported by all main database also supported by SAP HANA. Besides, SAP HANA also supports fuzzy search function. Fuzzy Search is a fast and fault-tolerant search feature for SAP HANA. A fuzzy search returns records even if the search term contains additional or missing characters or other types of spelling errors.
For example, we have table SEARCH_TEXT with three words “HANAGEEK”,”HANAGEEKER”,”HANAGEK”:
If we use fuzzy search functions to search HANAGEEK, all the three words will be returned no matter it contains additional or missing characters.
Fuzzy search is only available for column tables and attribute views and supports the following SQL types and column store types.
String types support a basic fuzzy string search. The values of a column are compared with the user input, using the fault-tolerant fuzzy string comparison.
When working with string types, the fuzzy string comparison always compares the full strings. If searching with ‘SAP’, for example, a record such as ‘SAP Deutschland AG & Co. KG’ gets a very low score, because only a very small part of the string is equal (3 of 27 characters match).
We insert the value ‘SAP Deutschland AG & Co. KG’ into table SEARCH_TEXT and then perform the fuzzy search:
INSERT INTO SEARCH_TEXT VALUES('SAP Deutschland AG & Co. KG');
We can see that the record contains “SAP” is not be queried out. It means the record “SAP Deutschland AG & Co. KG” has a low score. If we change the column type to TEXT, will the results change? Let’s keep on learning.
Text types support a more sophisticated kind of fuzzy search. Texts are tokenized (split into terms) and the fuzzy comparison is done term by term.
For example, when searching with ‘SAP’, a record such as ‘SAP Deutschland AG & Co. KG’ gets a high score, because the term ‘SAP’ exists in both texts.
We will take the same example for comparison. Firstly, we create a table SEARCH_TEXT_TMP with a TEXT type column:
CREATE COLUMN TABLE SEARCH_TEXT_TMP( CONTENT TEXT ); INSERT INTO SEARCH_TEXT_TMP VALUES('SAP Deutschland AG & Co. KG');
Then we perform the fuzzy search, we can see that the record will a relative high score:
SELECT SCORE(),* FROM SEARCH_TEXT_TMP WHERE CONTAINS(CONTENT, 'SAP', FUZZY(0.2));
As shown in the above results, the date 2000-10-02 has one more errors compared with the search date 2000-01-10. So, its’ score will be 0, other dates will all be searched out.
Score Calculation for Date Distance
The maximum allowed distance between dates can be defined using the search option ‘maxDateDistance’, which defines a number of days。The default for this option is 0, meaning that the feature is disabled. If we set this option to 5, then the date 5 days before or after the search date is validate.
(1) The identical date gets a score of 1.0.
(2) The date that is maxDateDistance days away from the search input gets a score that equals the fuzzySimilarity parameter (0.95 in the example above).
(3) The score of dates between the identical date and maxDateDistance is calculated as a linear function between the two dates defined above. In other words, for each day the score is reduced by ((1-fuzzySimilarity) / maxDateDistance).
(4) For dates outside the range of maxDateDistance, the score is 0.0.
For example, we insert dates between 2000-01-05 and 2000-01-15 and then perform the following search:
SELECT SCORE() AS score,TO_VARCHAR (CONTENT, 'YYYY-MM-DD') FROM SEARCH_DATE WHERE CONTAINS(CONTENT,'2000-01-10',FUZZY(0.95,'maxDateDistance=5')) ORDER BY score DESC
Search with FUZZY(0.0)
It is not possible to search for all values that have a fuzzy score greater than or equal to 0. This would return all values of a column and would result in large result sets. A search with FUZZY(0.0) therefore returns all values that have a fuzzy score greater than 0. Let’s see the following example:
TRUNCATE TABLE SEARCH_TEXT; INSERT INTO SEARCH_TEXT VALUES('HANAGEEK'); INSERT INTO SEARCH_TEXT VALUES('SAP'); SELECT SCORE() AS SCORE,* FROM SEARCH_TEXT WHERE CONTAINS(CONTENT,'HANAGEEK',FUZZY(0.0)) ORDER BY SCORE DESC;
We can see that, string SAP didn’t return. It means, the score of String SAP is 0 and FUZZY(0.0) only return those strings whose score is greater than 0.