Skip to Content

To set the context lets do things in HANA TA without the CGUL rules first.

1. Lets create a small table with texts

So lets create a Table which looks something like this:

TableDefinition.PNG

Now lets create two texts in it:

insert into “S_JTRND”.”TA_TEST” values(1,’EN’,’TO BE’,”);

insert into “S_JTRND”.”TA_TEST” values(2,’EN’,’NOT TO BE’,”);

So now the table entries look like:

TableData.PNG

2. Text Analysis Via Dictionary

Now lets say we want to do text Analysis where we Say

  1. if the text is “TO BE” it is to be treated as POSITIVE_CONTEXT
  2. if the text is “NOT TO BE” it is to be treated as NEGATIVE_CONTEXT

Lets create a dictionary to have these two values:

So in XSJS Project we create a english-Contextdict.hdbtextdict and content will be as follows(also attached):

<dictionary xmlns=”http://www.sap.com/ta/4.0“>

  <entity_category name=”POSITIVE_CONTEXT”>

    <entity_name standard_form=”TO BE”>

      <variant name=”TO BE” />

    </entity_name>

    </entity_category>

  <entity_category name=”NEGATIVE_CONTEXT”>

    <entity_name standard_form=”NOT TO BE”>

      <variant name=”NOT TO BE” />

    </entity_name>

    </entity_category>

</dictionary>

Now we use the dictionary above to create a configuration file(also attached):

So, pick content from any .hdbtextconfig and add the path to the above dictionary in it:

  <configuration name=”SAP.TextAnalysis.DocumentAnalysis.Extraction.ExtractionAnalyzer.TF” based-on=”CommonSettings”>

  <property name=”Dictionaries” type=”string-list”>

  <string-list-value>JTRND.TABlog.dictonary::english-Contextdict.hdbtextdict</string-list-value>

    </property>

  </configuration>

3. Create Full text index on the Table using this configuration

CREATE FULLTEXT INDEX “IDX_CONTEXT” ON “S_JTRND”.”TA_TEST” (“TEXT”)

  LANGUAGE COLUMN “LANG”

  CONFIGURATION ‘JTRND.TABlog.cfg::JT_TEST_CFG’ ASYNC

  LANGUAGE DETECTION (‘en’,’de’)

  PHRASE INDEX RATIO 0.000000

  FUZZY SEARCH INDEX OFF

  SEARCH ONLY OFF

  FAST PREPROCESS OFF

  TEXT MINING OFF

  TEXT ANALYSIS ON;

Check the TA results:

TA_1.PNG

Note* for NOT TO BE, we did not get both POSTIVE(for substring TO BE) AND NEGATIVE, altough this is good, its a fluke, as TA did take the longest string matching and hence for NOT TO BE, and its sub String TO  BE we got a Negative, but this could create problems.

Now moving on, lets add more to this context, lets add text NOT-TO BE as also a possibility of NEGATIVE_CONTEXT, infact NOT, followed by, TO BE,in same sentence is to be a NEGATIVE_CONTEXT.

Without changing anything lets insert some more values and see how they look:

insert into “S_JTRND”.”TA_TEST” values(3,’EN’,’NOT-TO BE’,”);

insert into “S_JTRND”.”TA_TEST” values(4,’EN’,’NOT, TO BE’,”);

insert into “S_JTRND”.”TA_TEST” values(5,’EN’,’NOT, Negates TO BE’,”);

Check the TA results:

TA_2.PNG

So you see we now have a problem, Also we could have NOT, -, NEG etc as possible predecessors before TO BE to point that its a NEGATIVE_CONTEXT

Solution 1: Lets have synonyms of NOT as one category, TO BE as “CONTEXT” category, and in post processing of TA lets see if we have TA_TYPE value of CONTEXT and NEGATIVE in same sentence then its a NEGATIVE CONTEXT,

But wouldnt it be great if index could do this on its own?

CGUL Rules save the day:

So here we go:

4. CREATE A .rul file

CONTEXT.rul(also attached) containing following rule:

#group NEGATIVE_CONTEXT (scope=”sentence”) : { <NOT> <>*? <TO> <>*? <BE> }

We need to compile this rule to get a .fsm file and put it on server under …lexicon/lang (oos for this blog, I have attached the complied file here)

Now enhance you configuration file with reference to this fsm file.

<configuration name=”SAP.TextAnalysis.DocumentAnalysis.Extraction.ExtractionAnalyzer.TF” based-on=”CommonSettings”>

  <property name=”Dictionaries” type=”string-list”>

  <string-list-value>JTRND.TABlog.dictonary::english-Contextdict.hdbtextdict</string-list-value>

    </property>

    </property>

  <property name=”ExtractionRules” type=”string-list”>

      <string-list-value>CONTEXT.fsm</string-list-value>

    </property>

   

  </configuration>

5. Restart the indexserver process so that the newly compiled rule file is picked up by the system.

indexServerProcessRestart.PNG

6. Recreate the index using the same statement as above and check the TA table:

TA_3.PNG

So, as you see the highlighted values come from the rule and mark extracted NEGATIVE CONTEXT, below I kept the dictionary value which wrongly identified the POSITIVE_CONTEXT for comparison, this should ideally not be handled by dictionaries.

So, in this context: To Be or Not To Be: HANA Text Analysis CGUL rules indeed has the answer!!

Hope this helps,

Bricks and Bats are Welcome

To report this post you need to login first.

Be the first to leave a comment

You must be Logged on to comment or reply to a post.

Leave a Reply