SAP HANA: Using Custom dictionary with CGUL rules for Text Analytics
In this document,I wanted to share my thoughts on below mentioned about Text Analytics in HANA: ( SAP HANA SP10 )
If we are building a custom dictionary, you would generally have two kinds of searches to be performed :
1) Fixed string search
Ex: Searching for keywords or group of words like “good people” , “people”,”Middle East”,”real number”
2) Dynamic Pattern search
Ex: Searching for patterns like <8 digit>-<1 digit> , words starting with $ like $30000,$50000
Which i was trying to use in my work as well and asked here How to achieve Pattern matching using custom di… | SCN
So when trying to check if you got to use RegEx or CGUL rules, you will have many parameters to think about like maintenance, performance etc.., One of most found use cases involving creating of text analytics dictionary would be generally on columns of data type clob or string where you got more data to be analyzed ( Ex: Twitter data or Emails data )
And Regex doesn’t work on CLOB data type if the length of the data is more like the one mentioned in the below threads:
From this website : The American Presidency Project
I have loaded 3 records from 3 speeches for election by Bern Sanders,Trump in 2016 and JFK in 1960 in the below mentioned table:
SAP Defined dictionary:
If we want to quickly analyse on the sentiment of the speeches, you could probably use the one of the dictionary given by SAP like EXTRACTION_CORE_VOICEOFCUSTOMER
CREATE FULLTEXT INDEX "SPEECH_SUMMARY" ON "ELECTION_SPEECH" ("SPEECH_SUMMARY") CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER' TEXT ANALYSIS ON TOKEN SEPARATORS ''
Here some sample charts on the index table created i.e : $TA_SPEECH_SUMMARY
Fixed string Search:
But as mentioned at the starting of the document, if you would want to have a fixed search on the key words like ‘President’,’John’,’Sanders’,’Trump’,’America’, ‘ISIS’, ‘good people’ ,’Middle East’.
For this we can build a custom dictionary and further categorise the words into 2 categories i.e. Person_name and Random like shown below:
Once the hdbtextdict is completed. We need to create a hdbconfig file and we can also add in the config file in the pre-processing section to filter or to limit only to the search tokens requested in dictionary.
CREATE FULLTEXT INDEX "SPEECH_SUMMARY" ON "ELECTION_SPEECH" ("SPEECH_SUMMARY") CONFIGURATION '<PackageName>::election_speech.hdbtextconfig' TEXT ANALYSIS ON TOKEN SEPARATORS '';
Sample output of the index which contains only the interested search of tokens as shown below:
Count of records in index:
In this way we can not only categorize the interested words into groups but also limit the index size so that we don’t really store the uninterested data as you can see above the count decreased to 127 records.
dynamic pattern Search:
So far so good. We are able to tokenize the multiple words and single words as well. But there can be some requirement like the one which I had in my work to search by patterns and then to store in the index.
As mentioned at the starting of document just to explain the technical flow, let us try to search for the tokens starts with $ to know how many times the money has been mentioned and details of the same.
We need to create a below mentioned .hdbtextrule ( CGUL rules ) which looks similar to RegEx.
For anything starting $ and followed by numbers of length 1 to 30 would now start showing up as category “MONEY”
You would need to add this rule into config file as shown below:
<property name="ExtractionRules" type="string-list"> <string-list-value>TEST.test::election_speech.hdbtextrule</string-list-value> </property>
Once you rebuild the index you would see the new output showing the “MONEY” as well as shown below:
This is a simple example on how can we leverage the text analytics in HANA and to understand CGUL rules .
Hope this document is helpful to you guys
I would also like to thank the helpful tutorial made by Anthony Waite and the open SAP team here
Do have a look on the next blog on the similar topic if interested :