HANA Text Analysis with Custom Dictionaries


Prerequisites:

  • How to create a developer workspace in HANA Studio.
  • How to create & share a project  in HANA Studio
  • Run HANA Text Analysis on a table

With release of HANA SPS07, a lot of new features are available. One of the main features is the support for custom dictionaries in Text Analysis. By default HANA comes with three configurations for text analysis:

  • Core Extraction
  • Linguistic Analysis
  • Voice of Customer

One of the main issues you can come across while working on HANA Text Analysis is defining your own custom configurations for Text Analysis engine to work upon.  In the following lines, you will find how to create your own custom dictionary, so you could benefit more from HANA text analysis capabilities.

Scenario:

Assume that your company manufactures laptops and have recently launched some new laptops series. You want to know if the consumers out there who have bought the machine are facing any problems or not. The consumers will be definitely tweeting, posting, blogging about the product on the social media.

You are now harvesting massive amount of unstructured data through social media, blogs, forums, e-mails and other mediums. The main motivation behind this will be to gain customer perception about the products (laptops). You may want to receive early warning of product defects and shortfalls and listen to channel and market-specific customer concerns and delights.

With HANA SPS07 we can create custom dictionaries which can be used to detect word/term/phrase occurrences which may not be detected while we run Text Analysis without any custom dictionary.

You need to follow the following steps to get started with custom dictionaries:

1. Create the source XML file

I have created some dummy data in a table with “ID” and “TEXT” columns.

User_tweets table structure

ID

TEXT

1

The #lenovo T540 laptop’s latch are very loose.

2

my laptop’s mic is too bad. It can’t record any voice. will not be buying #lenovo in near future

3

LCD display is gone for my T520. Customer care too is pathetic.

4

T530 performance is awesome. Only problem I am facing is with microphone. 🙁

The mycustomdict.xml file has the following structure:

<?xml version=”1.0″ encoding=”UTF-8″?>

<dictionary name=”LAPTOP_COMPONENTS”>

   <entity_category name=”Internal Parts”>

      <entity_name standard_form=”Inverter Board”>

            <variant name =”InverterBoard”/>

            <variant name =”InvertrBoard”/>

      </entity_name>

      <entity_name standard_form=”LCD Cable”>

            <variant name =”lcdcable”/>

            <variant name =”cable lcd”/>

      </entity_name>

   </entity_category>

</dictionary>

Please refer to the following guide http://help.sap.com/hana/SAP_HANA_Text_Analysis_Extraction_Customization_Guide_en.pdf  to know more about the creation of the source xml file to build custom dictionaries.

Using the above mentioned custom dictionary, HANA text analysis engine will detect “inverter board” & “LCD Cable” as entities of type internal parts of a Laptop.

2. Compiling the mycustomdict.xml file to a .nc file

First copy the XML file to your HANA machine using some FTP client.

I have copied the mycustomdict.xml to  /home/root/customDict folder

You can find the dictionary complier “tf-ncc” in your HANA installation at:

/<INSTALLATION_DIR>/<SID>/HDB<INSTANCE_NO>/exe/dat_bin_dir

Text analysis configuration files can be found at the following path:

/<INSTALLATION_DIR>/<SID>/SYS/global/hdb/custom/config/lexicon/lang

Run the complier on the source mycustomdict.xml file:

export LD_LIBRARY_PATH = <INSTALLATION_DIR>/<SID>/SYS/exe/hdb:/<INSTALLATION_DIR>/<SID>/SYS/exe/hdb/ dat_bin_dir

/<INSTALLATION_DIR>/<SID>/HDB<INSTANCE_NO>/exe/hdb/dat_bin_dir/tf-ncc -d /<INSTALLATION_DIR>/<SID>/SYS/global/hdb/custom/config/lexicon/lang -o /<INSTALLATION_DIR>/<SID>/SYS/global/hdb/custom/config/lexicon/lang/mycustomdict.nc /home/root/customDict/mycustomdict.xml

After executing the above command a file named mycustomdic.nc will be generated in the

/<INSTALLATION_DIR>/<SID>/SYS/global/hdb/custom/config/lexicon/lang folder which will be later used by the text analysis engine.


3. Create custom HANA Text Analysis configuration file

After compiling the xml file, we need to create a custom text analysis configuration to refer to the compiled .nc file we created in the previous step. The configuration file specify the text analysis

processing steps to be performed, and the options to use for each step.

In HANA studio create a workspace and then create and share a project.  Under this project create a new file with extension “hdbtextconfig”. Copy all the contents of one of the predefined configurations delivered by SAP as mentioned above. They are located in the HANA repository package: “sap.hana.ta.config”. For this scenario, I have copied the contents of the configuration file “EXTRACTION_CORE_VOICEOFCUSTOMER”.

Creating a Text Analysis Configuration: Section 10.1.3.2.1 of the HANA developer guide SPS07: http://help.sap.com/hana/SAP_HANA_Developer_Guide_en.pdf

After copying, modify the “Dictionaries” node under configuration node name “SAP.TextAnalysis.DocumentAnalysis.Extraction.ExtractionAnalyzer.TF” and add a child node for <string-list-value>

<string-list-value>mycustomdict.nc</string-list-value>


/wp-content/uploads/2013/12/config_352982.png

Now save, commit and activate the .hdbtextconfig file. After activation, now we can run Text Analysis engine using the custom configuration. To run text analysis, run the following SQL command:

CREATE FULLTEXT INDEX <indexname> ON <tablename> CONFIGURATION ‘<custom_configuration_file>’

TEXT ANALYSIS ON;

The fulltext index will be created as “TA_<indexname>”.  For our scenario table the output of the fulltext index table is:

/wp-content/uploads/2013/12/fulltext_index_352906.png

As you can see the Text Analysis engine have indentified LCD, latch, Mic as internal parts. The above results can be used for data mining or analytical purposes.

To report this post you need to login first.

23 Comments

You must be Logged on to comment or reply to a post.

  1. David Wu

    Hi,

    My question is how to append the names of  some places in the dictionary, but HANA can still extract the words that not belong to the names ?

    Thanks.

    (0) 
  2. Abhi Pandey

    Hi Vishnu – Nice blog and a quick question

    We are on SPS 6, Rev 69 (last one before SPS 7).

    Still I was able to follow you blog – all the directories exist as indicated by you

    I reached the point of compiling the .xml file -> .nc file using tf-ncc. I tried various combinations, but the compiler kept giving me an error File not found ./cgc1

    Do you know what this error means – is this file available only in SPS 07.

    Thanks

    -abhi

    (0) 
  3. Abhi Pandey

    I was able to get past the previous error by copying the cgc1 file from Rev69 directory to the lang directory.

    The compile worked fine.

    However, the full text does not return any values

    I followed the steps:

    a. Created a Project, assigned to workspace (shared)

    b. In project explorer view Copied the file EXTRACTION_CORE_VOICEOFCUSTOMER

    c. Added the line referring to my dictionary

    d. Activated File (TD.hdbtextconfig)

    I see the activated object under the project

    In FullText search, I refer to ‘TD’ but no records generated

    Also, I copied the EXTRACTION_CORE_VOICEOFCUSTOMER, and made NO change.

    Activated File —> No records created

    I am not able to commit the Project because Change Management is not Active in our system…

    Or do I need to do something else.

    Please help!!!

    -abhi

    We are on SP6 , Rev 69

    (0) 
    1. Vishnu Kumar Jakhoria Post author

      Hi Abhi,

      When you are copying the file ‘EXTRACTION_CORE_VOICEOFCUSTOMER‘ to your folder, after copying rename that file and then do your changes. While running the text analysis on the source table execute the statment:

      CREATE FULLTEXT INDEX <indexname> ON <tablename> CONFIGURATION ‘<custom_configuration_file>’

      TEXT ANALYSIS ON

      In the above statement for the ‘custom_configuration_file‘ option please provide the full path of your custom configuration file.

      Regards,

      Vishnu

      (0) 
  4. John Appleby

    Very interesting. I’ve used the standard dictionaries and found them to be lacking, especially for product-specific stuff.

    How did you get on in reality with creating the dictionaries? How do you know what words to put in etc.?

    Would be very interesting if HANA were to be able to learn based on dictionaries.

    (0) 
    1. Vishnu Kumar Jakhoria Post author

      Hi Jhon,

      Thanks for you comment 🙂 . It totally depends upon the use case what words to put in. Here I have taken parts of a product. Some scenario may include movie names other may include product names/part numbers. etc.

      Thanks,

      Vishnu

      (0) 
  5. Abhi Pandey

    Vishnu – Thanks!  I was able to configure and implement custom dictionaries in SPS6

    I saw Jon’s post. And I agree that there needs to be some intelligence in creating custom dictionaries. Here is something I would like to be able to do:

    a. Identify a handful of seed keywords

    b. Run fuzzy search with a score threshold. This will identify search lines resembling the seed keyword

    c. Extract tokens matching the seed token —–> This is the ????????

    d. Write a JAVA program to compile all the matching tokens

    e. Use an XML XSD generator to automatically generate the XML custom dictionary

    So, does anyone know how to identify what tokens were matched in the fuzzy search

    Thanks

    -abhi

    (0) 
    1. Sai Giridhar Varanasi

      Hi Ashlin,

      I faced the same issue as you were. The XML was totally fine. But, I observed writing

      <dictionary xmlns=”http://www.sap.com/ta/config/4.0″> in the second line will solve your issue.

      Thanks

      Sai Giridhar Varanasi

       

      (0) 
  6. Franziskus Heep

    Hi,

    after I typed in following statement:

    CONFIGURATION ‘EXTRACTION_CORE_VOICEOFCUSTOMER’

    I get the following error message:

    SAP DBTech JDBC: [257]: sql syntax error: incorrect syntax near “CONFIGURATION”: line 1 col 1 (at pos 1)

    What can be the reason for that ?

     

     

    (0) 

Leave a Reply