Hello Folks,

In this document,I wanted to share my thoughts on below mentioned about Text Analytics in HANA: ( SAP HANA SP10 )


If we are building a custom dictionary, you would generally have  two kinds of searches to be performed :


1) Fixed string search

Ex: Searching for keywords or group of words like  “good people” , “people”,”Middle East”,”real number”


2) Dynamic Pattern search

Ex: Searching for patterns like <8 digit>-<1 digit>  , words starting with $ like $30000,$50000

Which i was trying to use in my work as well and asked here How to achieve Pattern matching using custom di… | SCN

So when trying to check if you got to use RegEx or CGUL rules, you will have many parameters to think about like maintenance, performance etc..,  One of most found use cases involving creating of text analytics dictionary would be generally on columns of data type clob or string where you got more data to be analyzed ( Ex: Twitter data or Emails data  )

And Regex doesn’t work on CLOB data type if the length of the data is more like the one mentioned in the below threads:

Re: Search using regular expression over CLOB/NCLOB datatype

18.PNG

Sample Scenario:

From this website : The American Presidency Project

I have loaded 3 records from 3 speeches for election by Bern Sanders,Trump in 2016 and JFK in 1960 in the below mentioned table:

16.PNG

SAP Defined dictionary:

If we want to quickly analyse on the sentiment of the speeches, you could probably use the one of the dictionary given by SAP like EXTRACTION_CORE_VOICEOFCUSTOMER


CREATE FULLTEXT INDEX "SPEECH_SUMMARY" ON "ELECTION_SPEECH"
("SPEECH_SUMMARY") CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER'
TEXT ANALYSIS ON
TOKEN SEPARATORS ''



28.PNG

Here some sample charts on the index table created i.e : $TA_SPEECH_SUMMARY

20.PNG22.PNG

Fixed string Search:

But as mentioned at the starting of the document, if you would want to have a fixed search on the key words like ‘President’,’John’,’Sanders’,’Trump’,’America’, ‘ISIS’, ‘good people’ ,’Middle East’.

For this we can build a custom dictionary and further categorise the words into 2 categories i.e. Person_name and Random like shown below:

23.PNG

Once the hdbtextdict is completed. We need to create a hdbconfig file and we can also add in the config file in the pre-processing section to filter or to limit only to the search tokens requested in dictionary.

25.PNG


CREATE FULLTEXT INDEX "SPEECH_SUMMARY" ON "ELECTION_SPEECH"
("SPEECH_SUMMARY")
CONFIGURATION '<PackageName>::election_speech.hdbtextconfig'
TEXT ANALYSIS ON
TOKEN SEPARATORS '';



Sample output of the index which contains only the interested search of tokens as shown below:

26.PNG

Count of records in index:

27.PNG

In this way we can not only categorize the interested words into groups but also limit the index size so that we don’t really store the uninterested data as you can see above the count decreased to 127 records.

dynamic pattern Search:

So far so good. We are able to tokenize the multiple words and single words as well. But there can be some requirement like the one which I had in my work to search by patterns and then to store in the index.

As mentioned at the starting of document just to explain the technical flow, let us try to search for the tokens starts with $ to know how many times the money has been mentioned and details of the same.

We need to create a below mentioned .hdbtextrule ( CGUL rules ) which looks similar to RegEx.

30.PNG

For anything starting $ and followed by numbers of length 1 to 30 would now start showing up as category “MONEY”

You would need to add this rule into config file as shown below:


    <property name="ExtractionRules" type="string-list">
     <string-list-value>TEST.test::election_speech.hdbtextrule</string-list-value>
    </property>



Once you rebuild the index you would see the new output showing the “MONEY” as well as shown below:

29.PNG

This is a simple example on how can we leverage the text analytics in HANA and to understand CGUL rules .

Hope this document is helpful to you guys

I would also like to thank the helpful tutorial made by Anthony Waite and the open SAP team here

Which helped me a lot.Text Analytics with SAP HANA Platform – Anthony Waite, Yolande Meessen, Bill Miller, and Michael Wiesner

Do have a look on the next blog on the similar topic if interested :

SAP HANA:&amp;nbsp; Understanding regular expressio… | SCN

Yours

Krishna Tangudu

To report this post you need to login first.

5 Comments

You must be Logged on to comment or reply to a post.

  1. chandan praharaj

    nice doc on HANA TA. It helped me, get more interested towords this topic. Need more use cases like this from you. Thanks for all your effort.  🙂

    (0) 
  2. Suman Karanam

    Hi Krishna,

    Thanks for the blog. I had similar requirements and I observed that CGUL doesn’t support all regular expression syntax.

    I wanted to extract last 3 letter of all the token words [A-Za-z]{3}$ works in regular expression but no in CGUL. any suggestions?

    (0) 

Leave a Reply