SAP Sybase IQ: Text Mining and Case Ignore Property
The SAP Sybase IQ Unstructured Data Analytic (UDA) option
extent the capabilities of SAP Sybase IQ to do text analysis (data mining). This
option allow the creation of Character Large Objects (CLOB) and Binary Large
Objects columns that are used to store and manipulate binary documents (like MS
Excel, MS word, etc) and long text columns (filtered content of the binary
objects).
To obtain insight from those CLOB, we need to index those
columns and use string functions to retrieve, compare and extract
information.
A case sensitive databases can:
- Add complexity to the mining process by means of
requiring complex queries predicates and, - Give place to omissions due to possible upper
and lower case characters combination (erroneous or not).
There are several options that can be use to minimize the
impact of case sensitivity during data mining, let see some of them:
- Use every possible combination of upper and
lower case in the predicates of your queries (a lot of possibilities, not
recommended). - Use function in the predicate of the queries to
convert the content of the column to upper or lower case before using a
comparison operator. - Convert the pre filtered text to upper or lower
case before storing it on the CLOB column; use the same case in all the
predicate of your queries. - Create the database with the CASE IGNORE option;
this option can not be changed after the database is been created.
Select * from MyUser.Mytable
Where lcase(mycolum) like ‘%term%’
This work well for string columns that are not CLOB; the LCASE, UCASE, LOWER and UPPER function are not supported
on CLOB data type columns.
If the SAP Sybase IQ database will be primary used for data mining
and the case can be ignored, it is recommended to create the database with CASE
IGNORE property; by default all SAP Sybase IQ database are created with CASE
RESPECT property.