Skip to Content
Author's profile photo Former Member

HANA Text Analysis with Custom Dictionaries

HANA Text Analysis with Custom Dictionaries


  • How to create a developer workspace in HANA Studio.
  • How to create & share a project  in HANA Studio
  • Run HANA Text Analysis on a table

With release of HANA SPS07, a lot of new features are available. One of the main features is the support for custom dictionaries in Text Analysis. By default HANA comes with three configurations for text analysis:

  • Core Extraction
  • Linguistic Analysis
  • Voice of Customer

One of the main issues you can come across while working on HANA Text Analysis is defining your own custom configurations for Text Analysis engine to work upon.  In the following lines, you will find how to create your own custom dictionary, so you could benefit more from HANA text analysis capabilities.


Assume that your company manufactures laptops and have recently launched some new laptops series. You want to know if the consumers out there who have bought the machine are facing any problems or not. The consumers will be definitely tweeting, posting, blogging about the product on the social media.

You are now harvesting massive amount of unstructured data through social media, blogs, forums, e-mails and other mediums. The main motivation behind this will be to gain customer perception about the products (laptops). You may want to receive early warning of product defects and shortfalls and listen to channel and market-specific customer concerns and delights.

With HANA SPS07 we can create custom dictionaries which can be used to detect word/term/phrase occurrences which may not be detected while we run Text Analysis without any custom dictionary.

You need to follow the following steps to get started with custom dictionaries:

1. Create the source XML file

I have created some dummy data in a table with “ID” and “TEXT” columns.

User_tweets table structure




The #lenovo T540 laptop’s latch are very loose.


my laptop’s mic is too bad. It can’t record any voice. will not be buying #lenovo in near future


LCD display is gone for my T520. Customer care too is pathetic.


T530 performance is awesome. Only problem I am facing is with microphone. 🙁

The mycustomdict.xml file has the following structure:

<?xml version=”1.0″ encoding=”UTF-8″?>

<dictionary name=”LAPTOP_COMPONENTS”>

   <entity_category name=”Internal Parts”>

      <entity_name standard_form=”Inverter Board”>

            <variant name =”InverterBoard”/>

            <variant name =”InvertrBoard”/>


      <entity_name standard_form=”LCD Cable”>

            <variant name =”lcdcable”/>

            <variant name =”cable lcd”/>




Please refer to the following guide  to know more about the creation of the source xml file to build custom dictionaries.

Using the above mentioned custom dictionary, HANA text analysis engine will detect “inverter board” & “LCD Cable” as entities of type internal parts of a Laptop.

2. Compiling the mycustomdict.xml file to a .nc file

First copy the XML file to your HANA machine using some FTP client.

I have copied the mycustomdict.xml to  /home/root/customDict folder

You can find the dictionary complier “tf-ncc” in your HANA installation at:


Text analysis configuration files can be found at the following path:


Run the complier on the source mycustomdict.xml file:

export LD_LIBRARY_PATH = <INSTALLATION_DIR>/<SID>/SYS/exe/hdb:/<INSTALLATION_DIR>/<SID>/SYS/exe/hdb/ dat_bin_dir

/<INSTALLATION_DIR>/<SID>/HDB<INSTANCE_NO>/exe/hdb/dat_bin_dir/tf-ncc -d /<INSTALLATION_DIR>/<SID>/SYS/global/hdb/custom/config/lexicon/lang -o /<INSTALLATION_DIR>/<SID>/SYS/global/hdb/custom/config/lexicon/lang/ /home/root/customDict/mycustomdict.xml

After executing the above command a file named will be generated in the

/<INSTALLATION_DIR>/<SID>/SYS/global/hdb/custom/config/lexicon/lang folder which will be later used by the text analysis engine.

3. Create custom HANA Text Analysis configuration file

After compiling the xml file, we need to create a custom text analysis configuration to refer to the compiled .nc file we created in the previous step. The configuration file specify the text analysis

processing steps to be performed, and the options to use for each step.

In HANA studio create a workspace and then create and share a project.  Under this project create a new file with extension “hdbtextconfig”. Copy all the contents of one of the predefined configurations delivered by SAP as mentioned above. They are located in the HANA repository package: “sap.hana.ta.config”. For this scenario, I have copied the contents of the configuration file “EXTRACTION_CORE_VOICEOFCUSTOMER”.

Creating a Text Analysis Configuration: Section of the HANA developer guide SPS07:

After copying, modify the “Dictionaries” node under configuration node name “SAP.TextAnalysis.DocumentAnalysis.Extraction.ExtractionAnalyzer.TF” and add a child node for <string-list-value>



Now save, commit and activate the .hdbtextconfig file. After activation, now we can run Text Analysis engine using the custom configuration. To run text analysis, run the following SQL command:

CREATE FULLTEXT INDEX <indexname> ON <tablename> CONFIGURATION ‘<custom_configuration_file>’


The fulltext index will be created as “TA_<indexname>”.  For our scenario table the output of the fulltext index table is:


As you can see the Text Analysis engine have indentified LCD, latch, Mic as internal parts. The above results can be used for data mining or analytical purposes.

Assigned Tags

      You must be Logged on to comment or reply to a post.
      Author's profile photo Henrique Pinto
      Henrique Pinto


      Author's profile photo Former Member
      Former Member
      Blog Post Author

      Thank you Henrique!

      Author's profile photo Former Member
      Former Member

      Nice too! 🙂

      Author's profile photo Former Member
      Former Member


      My question is how to append the names of  some places in the dictionary, but HANA can still extract the words that not belong to the names ?


      Author's profile photo Former Member
      Former Member
      Blog Post Author

      Hi David,

      You can go through this guide:

      HANA will extract the words that doesn't belong to the names but only those words will be shown for the categories you have created in the dictionary for the columns TA_TYPE and TA_NORMALIZED.


      Author's profile photo Vivek Singh Bhoj
      Vivek Singh Bhoj

      Nice blog Vishnu



      Author's profile photo Rama Shankar
      Rama Shankar

      Good blog - thanks!

      Rama Shankar

      Author's profile photo Former Member
      Former Member

      Hi Vishnu - Nice blog and a quick question

      We are on SPS 6, Rev 69 (last one before SPS 7).

      Still I was able to follow you blog - all the directories exist as indicated by you

      I reached the point of compiling the .xml file -> .nc file using tf-ncc. I tried various combinations, but the compiler kept giving me an error File not found ./cgc1

      Do you know what this error means - is this file available only in SPS 07.



      Author's profile photo Former Member
      Former Member

      I was able to get past the previous error by copying the cgc1 file from Rev69 directory to the lang directory.

      The compile worked fine.

      However, the full text does not return any values

      I followed the steps:

      a. Created a Project, assigned to workspace (shared)

      b. In project explorer view Copied the file EXTRACTION_CORE_VOICEOFCUSTOMER

      c. Added the line referring to my dictionary

      d. Activated File (TD.hdbtextconfig)

      I see the activated object under the project

      In FullText search, I refer to 'TD' but no records generated

      Also, I copied the EXTRACTION_CORE_VOICEOFCUSTOMER, and made NO change.

      Activated File ---> No records created

      I am not able to commit the Project because Change Management is not Active in our system...

      Or do I need to do something else.

      Please help!!!


      We are on SP6 , Rev 69

      Author's profile photo Former Member
      Former Member
      Blog Post Author

      Hi Abhi,

      When you are copying the file 'EXTRACTION_CORE_VOICEOFCUSTOMER' to your folder, after copying rename that file and then do your changes. While running the text analysis on the source table execute the statment:

      CREATE FULLTEXT INDEX <indexname> ON <tablename> CONFIGURATION ‘<custom_configuration_file>’


      In the above statement for the 'custom_configuration_file' option please provide the full path of your custom configuration file.



      Author's profile photo John Appleby
      John Appleby

      Very interesting. I've used the standard dictionaries and found them to be lacking, especially for product-specific stuff.

      How did you get on in reality with creating the dictionaries? How do you know what words to put in etc.?

      Would be very interesting if HANA were to be able to learn based on dictionaries.

      Author's profile photo Former Member
      Former Member
      Blog Post Author

      Hi Jhon,

      Thanks for you comment 🙂 . It totally depends upon the use case what words to put in. Here I have taken parts of a product. Some scenario may include movie names other may include product names/part numbers. etc.



      Author's profile photo Shaik Imtiyaz Shariff
      Shaik Imtiyaz Shariff

      Good one Vishnu

      Author's profile photo Former Member
      Former Member

      Vishnu - Thanks!  I was able to configure and implement custom dictionaries in SPS6

      I saw Jon's post. And I agree that there needs to be some intelligence in creating custom dictionaries. Here is something I would like to be able to do:

      a. Identify a handful of seed keywords

      b. Run fuzzy search with a score threshold. This will identify search lines resembling the seed keyword

      c. Extract tokens matching the seed token -----> This is the ????????

      d. Write a JAVA program to compile all the matching tokens

      e. Use an XML XSD generator to automatically generate the XML custom dictionary

      So, does anyone know how to identify what tokens were matched in the fuzzy search



      Author's profile photo Former Member
      Former Member

      Excellent Vishnu! I wanted something like this in my Social Media Analytics solution I am planning to develop 🙂


      Kunal Gandhi

      Author's profile photo Kumar Mayuresh
      Kumar Mayuresh

      Excellent work Vishnu.

      Is it possible to do SPEECH RECOGNITION with SAP HANA and the spoken words should get stored in HANA tables ?

      Awaiting your response.



      Author's profile photo Krishna Tangudu
      Krishna Tangudu

      Hi Kumar,

      Did u look at this Siri meets HANA


      Krishna Tangudu

      Author's profile photo Kumar Mayuresh
      Kumar Mayuresh

      Thanks Krishna for sharing the link 🙂

      Author's profile photo Tahir Hussain Babar
      Tahir Hussain Babar

      What a great article ! Do you have information on customising HANA own dictionaries, rather than adding custom ones ?

      Author's profile photo Former Member
      Former Member

      Hi Vishnu

      A wonderful article!

      Can you please answer to my query posted in this thread:

      Issue with creation of custom dictionary configurations for text analysis over HANA Trial Instance

      I have been facing this trouble since a couple of days.



      Author's profile photo Former Member
      Former Member

      I faced this issue @ compile time...

      please let me know the reason...


      There was no error in xml...

      Author's profile photo Sai Giridhar Varanasi
      Sai Giridhar Varanasi

      Hi Ashlin,

      I faced the same issue as you were. The XML was totally fine. But, I observed writing

      <dictionary xmlns=""> in the second line will solve your issue.


      Sai Giridhar Varanasi


      Author's profile photo Rajaganapathi Rangdale Srinivasa Rao
      Rajaganapathi Rangdale Srinivasa Rao

      Hi Former Member ,


      I'm getting the same error when trying to use "uninflected", "uninflected_language" key word in the dictionary tag.

      Did you use the same? How did u over come this issue.? As mentioned below, i tried having them in separate line but didn't work.  My XML is working totally fine.

      Author's profile photo Former Member
      Former Member

      It could be the char as mentioned ... "

       Sai Giridhar Varanasi. please look at his comment . it should help
      Author's profile photo Former Member
      Former Member


      after I typed in following statement:


      I get the following error message:

      SAP DBTech JDBC: [257]: sql syntax error: incorrect syntax near "CONFIGURATION": line 1 col 1 (at pos 1)

      What can be the reason for that ?