To Be or Not To Be: HANA Text Analysis CGUL rules ...

Former Member · ‎05-23-2014

To set the context lets do things in HANA TA without the CGUL rules first.

1. Lets create a small table with texts

So lets create a Table which looks something like this:

Now lets create two texts in it:

insert into "S_JTRND"."TA_TEST" values(1,'EN','TO BE','');

insert into "S_JTRND"."TA_TEST" values(2,'EN','NOT TO BE','');

So now the table entries look like:

2. Text Analysis Via Dictionary

Now lets say we want to do text Analysis where we Say

if the text is "TO BE" it is to be treated as POSITIVE_CONTEXT
if the text is "NOT TO BE" it is to be treated as NEGATIVE_CONTEXT

Lets create a dictionary to have these two values:

So in XSJS Project we create a english-Contextdict.hdbtextdict and content will be as follows(also attached):

<entity_category name="POSITIVE_CONTEXT">

<entity_name standard_form="TO BE">

</entity_name>

</entity_category>

<entity_category name="NEGATIVE_CONTEXT">

<entity_name standard_form="NOT TO BE">

</entity_name>

</entity_category>

</dictionary>

Now we use the dictionary above to create a configuration file(also attached):

So, pick content from any .hdbtextconfig and add the path to the above dictionary in it:

<string-list-value>JTRND.TABlog.dictonary::english-Contextdict.hdbtextdict</string-list-value>

</property>

</configuration>

3. Create Full text index on the Table using this configuration

CREATE FULLTEXT INDEX "IDX_CONTEXT" ON "S_JTRND"."TA_TEST" ("TEXT")

LANGUAGE COLUMN "LANG"

CONFIGURATION 'JTRND.TABlog.cfg::JT_TEST_CFG' ASYNC

LANGUAGE DETECTION ('en','de')

PHRASE INDEX RATIO 0.000000

FUZZY SEARCH INDEX OFF

SEARCH ONLY OFF

FAST PREPROCESS OFF

TEXT MINING OFF

TEXT ANALYSIS ON;

Check the TA results:

Note* for NOT TO BE, we did not get both POSTIVE(for substring TO BE) AND NEGATIVE, altough this is good, its a fluke, as TA did take the longest string matching and hence for NOT TO BE, and its sub String TO BE we got a Negative, but this could create problems.

Now moving on, lets add more to this context, lets add text NOT-TO BE as also a possibility of NEGATIVE_CONTEXT, infact NOT, followed by, TO BE,in same sentence is to be a NEGATIVE_CONTEXT.

Without changing anything lets insert some more values and see how they look:

insert into "S_JTRND"."TA_TEST" values(3,'EN','NOT-TO BE','');

insert into "S_JTRND"."TA_TEST" values(4,'EN','NOT, TO BE','');

insert into "S_JTRND"."TA_TEST" values(5,'EN','NOT, Negates TO BE','');

Check the TA results:

So you see we now have a problem, Also we could have NOT, -, NEG etc as possible predecessors before TO BE to point that its a NEGATIVE_CONTEXT

Solution 1: Lets have synonyms of NOT as one category, TO BE as "CONTEXT" category, and in post processing of TA lets see if we have TA_TYPE value of CONTEXT and NEGATIVE in same sentence then its a NEGATIVE CONTEXT,

But wouldnt it be great if index could do this on its own?

CGUL Rules save the day:

So here we go:

4. CREATE A .rul file

CONTEXT.rul(also attached) containing following rule:

#group NEGATIVE_CONTEXT (scope="sentence") : { <NOT> <>*? <TO> <>*? <BE> }

We need to compile this rule to get a .fsm file and put it on server under ...lexicon/lang (oos for this blog, I have attached the complied file here)

Now enhance you configuration file with reference to this fsm file.

<string-list-value>JTRND.TABlog.dictonary::english-Contextdict.hdbtextdict</string-list-value>

</property>

<property name="ExtractionRules" type="string-list">

<string-list-value>CONTEXT.fsm</string-list-value>

</property>

</configuration>

5. Restart the indexserver process so that the newly compiled rule file is picked up by the system.

6. Recreate the index using the same statement as above and check the TA table:

So, as you see the highlighted values come from the rule and mark extracted NEGATIVE CONTEXT, below I kept the dictionary value which wrongly identified the POSITIVE_CONTEXT for comparison, this should ideally not be handled by dictionaries.

So, in this context: To Be or Not To Be: HANA Text Analysis CGUL rules indeed has the answer!!

Hope this helps,

Bricks and Bats are Welcome

To Be or Not To Be: HANA Text Analysis CGUL rules has the answer

SAP PI for Beginners

ABAP 7.40 Quick Reference

Fiori: technical installation and configuration of one app from A - Z