SAP Sybase IQ UDA: Full Text Search by Examples
The Unstructured Data Analytics (UDA) option extent the capabilities of SAP Sybase IQ to perform Data Mining. Here you will find examples of queries that are possible using the UDA option. You will need a UDA license to run these queries.
For these examples I created a table as follow:
CREATE TABLE DBA.MyReport (
MyRptID integer NOT NULL,
MyRptText CLOB NULL,
PRIMARY KEY(MyRptID)
);
The table was populated with 1.7 millions of records, the MyRptText column was populated with clinical laboratory results saved in text format. The database as created with CASE RESPECT.
Two (2) indexes and one(1) text configuration object were created, a text index using the default_char configuration object, and a NGRAM text index using a new text configuration object.
Create text index RptTextIdx on DBA.MyReport
(MyRptText) configuration default_char;
— Create a new configuration object
CREATE TEXT CONFIGURATION myFuzzyTextConfig FROM default_char;
— Change to NGRAM and set maxlength to 5
ALTER TEXT CONFIGURATION myFuzzyTextConfig
TERM BREAKER NGRAM;
ALTER TEXT CONFIGURATION myFuzzyTextConfig
MAXIMUM TERM LENGTH 5;
CREATE TEXT INDEX myFuzzyTextIdx
ON DBA.MyReport (MyRptText) CONFIGURATION myFuzzyTextConfig;
All the queries were successfully executed, the results are not included in this document. No performance data is provided, but all of the queries executed in 1 minute or less.
Use Case 1: Find the 10 most relevant records relative to a term using the calculated score from the CONTAINS clause.
select MyRptID, ct.score, MyRptText
FROM MyReport CONTAINS (MyReport.MyRptText, ‘Difficile | DIFFICILE’) as ct
where ct.score > 10
order by ct.score desc;
Use Case 2: Search records containing a phrase (can also be done with LIKE)
select count(*)
FROM MyReport
CONTAINS (MyReport.MyRptText, ‘Clostridium difficile toxin A/B DETECTED’ ) as ct;
select MyRptID, MyRptText
FROM MyReport
CONTAINS (MyReport.MyRptText, ‘Clostridium difficile toxin A/B DETECTED’ ) as ct;
Use Case 3: Search records using Boolean expression
select *
FROM MyReport
CONTAINS (MyReport.MyRptText, ‘”Clostridium difficile toxin A/B DETECTED” and “ER”‘ ) as ct;
select *
FROM MyReport CONTAINS (MyReport.MyRptText,‘(Clostridium | difficile | DIFFICILLE) AND DETECTED’) as ct;
select *
FROM MyReport CONTAINS (MyReport.MyRptText,‘((Clostridium | difficile | DIFFICILLE) AND DETECTED) AND NOT (” by PCR”)’) as ct;
select *
FROM MyReport CONTAINS (MyReport.MyRptText,‘((Clostridium | difficile | DIFFICILLE) AND DETECTED) AND (” by PCR”)’) as ct;
Use case 4: Find terms in proximity
select *
FROM MyReport CONTAINS (MyReport.MyRptText,‘difficile NEAR blood’) as ct;
select *
FROM MyReport CONTAINS (MyReport.MyRptText,‘difficile NEAR negative’) as ct;
select *
FROM MyReport CONTAINS (MyReport.MyRptText,‘cholesterol’) as ct
order by ct.score desc;
select *
FROM MyReport CONTAINS (MyReport.MyRptText,‘gallbladder BEFORE cholesterol’) as ct
order by ct.score desc;
select MyRptID, ct.score, MyRptText
FROM MyReport CONTAINS (MyReport.MyRptText,‘cholesterol NEAR ultrasound’) as ct
order by ct.score desc;
select MyRptID, ct.score, MyRptText
FROM MyReport CONTAINS (MyReport.MyRptText,‘cholesterol NEAR[2,5] ultrasound’) as ct
order by ct.score desc;
Use case 5: Find terms using prefixes
select count(*)
FROM MyReport CONTAINS (MyReport.MyRptText,‘Urin*’) as ct;
select count(*)
FROM MyReport CONTAINS (MyReport.MyRptText,‘Urin* AND NOT Nega*’) as ct;
select count(*)
FROM MyReport CONTAINS (MyReport.MyRptText,‘Urin* AND Aty* AND NOT Nega* ‘) as ct;
select count(*)
FROM MyReport CONTAINS (MyReport.MyRptText,‘Urin* AND Aty* AND NOT Nega* AND blood ‘) as ct;
select MyRptID, ct.score, MyRptText
FROM MyReport CONTAINS (MyReport.MyRptText,‘Urin* AND Aty* AND NOT Nega* AND blood ‘) as ct
order by ct.score desc;
Use case 6: Search for term similar to a given term (FUZZY SEARCH)
select count(*) FROM MyReport
CONTAINS (MyReport.MyRptText, ‘FUZZY “difficile”‘ ) as ct;
select count(*) FROM MyReport
CONTAINS (MyReport.MyRptText, ‘difficile AND FUZZY “DETECTED”‘ ) as ct;
select count(*) FROM MyReport
CONTAINS (MyReport.MyRptText, ‘difficile AND FUZZY “DETECTED” AND NOT DETECTED’ ) as ct;
select top 100 * FROM MyReport
CONTAINS (MyReport.MyRptText, ‘difficile AND FUZZY “DETECTED” AND NOT DETECTED’ ) as ct
order by ct.score desc;
Use case 7: Find the number of indexed values in which a term appears
select * from sa_text_index_vocab(‘RptTextIdx’,‘MyReport’,‘DBA’)
where term like ‘%[Cc]holesterol%’;
select * from sa_text_index_vocab(‘RptTextIdx’,‘MyReport’,‘DBA’)
where term like ‘%[Cc][Hh][Oo][Ll][Ee][Ss][Tt][Ee][Rr][Oo][Ll]%’;
select * from sa_text_index_vocab(‘RptTextIdx’,‘MyReport’,‘DBA’)
where term like ‘%[dD][Ii][Ff][Ff][Ii][Cc][Ii][Ll][Ee]%’;
These are only a few examples of the type of queries that can be used on a CLOB column using the UDA option; This is not a complete list of possibilities.
Great write-up on this topic!
Thanks Mark.
Hi Guys!
I've been having problems with the fuzzy search, or any CONTAINS use.
I have a table with a VARCHAR(256) column, every CONTAINS search return "CONTAINS has NULL search term.". I tried the code written here, it worked with most CONTAINS examples used, but not the fuzzy. Looks like the NGRAM breaker is the problem.
Do you know why is it?