Custom Dictionaries – SAP HANA Text Analysis
In this blog, I’ll discuss how to create custom dictionaries in SAP HANA. To implement certain custom use cases, customers have to implement their own dictionaries for performing Text Analysis.
Use Case: A company has ‘n’ number of products in product portfolio which are not covered completely by standard configuration. In such case, they can create a custom dictionary with an entity category Product and add all the products names in the portfolio.
SAP HANA is shipped with several predefined, standard text analysis configurations. Such configurations are available in “sap.hana.ta.config” repository package as shown in Figure 1. For more details on standard Text Analysis Configurations, refer to the blog [ https://blogs.sap.com/2018/02/01/sap-hana-text-analysis-3/ ].
Figure 1: Standard Text Analysis Configurations
Steps involved in the implementation of Custom dictionaries
- Create a Custom dictionary
- Update the Configuration File or create a Custom Configuration File specifying Custom Dictionary
***********************************************************************************
Step 1: Create Custom Dictionary
Dictionary contains a number of user-defined entity types, each of which further contain any number of entities of standard and variant types. In simple terms, dictionary stores name variations in a structured manner to be accessible through the extraction process. Dictionaries are language-independent, and can be created for all 34 supported languages.
Dictionary files must be in XML format and follow the specified syntax below:
<?xml version="1.0" encoding="UTF-8" ?>
<dictionary xmlns="http://www.sap.com/ta/4.0">
<entity_category name=“<Category_Name">
<entity_name standard_form=“<Entity_Name">
<variant name=“Variant_Name"/>
</entity_name>
</entity_category> ...
</dictionary>
Three parameters that need to be specified while creating the dictionary:
- Category name
- Standard form of an entity: This is complete or precise form of a given entity
- Variant names for an entity: This is less standard form of a given entity
Figure 2 below shows the custom dictionary created for performing Custom Text Analysis.
Figure2: Custom Dictionary File
***********************************************************************************
Step 2: Update the Configuration File or create a Custom Configuration File specifying Custom Dictionary
Custom Text Analysis Configurations can be used to perform custom text analysis using custom text analysis dictionaries and extraction rule set. Create your own custom text analysis configuration files with “.hdbtextconfig” file extension. Configuration files are also in XML format.
Below is a Piece of Code that shows the sequence of Text Analysis Steps in XML Format.
<configuration name=“…AggregateAnalyzer.Aggregator">
<property name="Analyzers" type="string-list">
<string-list-value>…FormatConversionAnalyzer.FC</string-list-value>
<string-list-value>…StructureAnalyzer.SA</string-list-value>
<string-list-value>…LinguisticAnalyzer.LX</string-list-value>
<string-list-value>…ExtractionAnalyzer.TF</string-list-value>
<string-list-value>….GrammaticalRoleAnalyzer.GRA</string-list-value>
</property> </configuration>
In this configuration section, following analyzers are available:
- “FormatConversionAnalyzer” is used for performing document conversion
- “StructureAnalyzer” is used for de-tagging and language detection. This performs mark-up removal, whitespace normalization and language detection
- “LinguisticAnalyzer” is used to perform Linguistic Analysis which includes tokenization, identification of word base forms (stems) and tagging part of speech
- “ExtractionAnalyzer” is an optional parameter which is used for entity/relation extraction
- „GrammaticalRoleAnalyzer“ is also an optional parameter used to identify functional relationships between elements
In our example, custom text analysis configuration is managed within SAP HANA repository. Figure 3 below shows the property sections highlighted with enabled the custom dictionaries, and inclusion of custom dictionary path.
Figure 3: Custom Configuration File
Dictionary is created in “sap.hana.ta.dict” repository package and Text Analysis configuration is created in “sap.hana.ta.config” repository package as seen in Figure 4 below.
Figure 4: Repository Path
*************************************************************************************
This custom form is used to extract basic entities from the text and entities of interest including people, places, firms, URLs, and other common terms.
CREATE COLUMN TABLE "EXT_CORE"
( ID INTEGER PRIMARY KEY,
STRING NVARCHAR(200) );
INSERT INTO "EXT_CORE" VALUES (1, 'Ruby likes working at SAP');
INSERT INTO "EXT_CORE" VALUES (2, 'Rohan dislikes soccer');
INSERT INTO "EXT_CORE" VALUES (3, 'Rohan really likes football');
INSERT INTO "EXT_CORE" VALUES (4, 'Australia won 74 Gold in Commonwealth Games India');
INSERT INTO "EXT_CORE" VALUES (5, 'India won 38 Gold in Games 2010');
CREATE FULLTEXT INDEX EXT_CORE_INDEX ON "EXT_CORE" ("STRING")
CONFIGURATION 'sap.hana.ta.config::Cust_Extraction_Core'
TEXT ANALYSIS ON;
SELECT * FROM "$TA_EXT_CORE_INDEX"
Figure 5 below shows the rule as Entity Extraction in TA_RULE column with new category names and available basic entities.
Figure5: Custom Configuration – Entity Extraction
In summary, we covered detailed steps on how to create and implement custom dictionaries in SAP HANA for performing Text Analysis in certain custom use cases.
I hope that this is the right place for my comment.
The first configuration option "EnableCustomDictionaries" shown in the Fig. #3, that is part of LinguisticAnalyzer's section is not related to the subject of this blog entry. Users can create customized additions to such linguistic analysis steps as stemming, determining part of speech, finding multi word tokens and abbreviations by putting them in relevant files with suffix "-cd" (for example "english-std.multiword-cd". It is meant for rather advanced tinkering and I'm not even sure at this moment how it can be accomplished in HANA. Setting "EnableCustomDictionaries" to "true" or "false" tells Linguistic Analyzer whether to use those files, but is will not affect custom entity extraction, which is the topic discussed here. Seeing it mentioned in this context I realize now that the name for that option can be confusing, unfortunately.
It is the other option, "Dictionaries" that you need to use. That one is part of Extraction Analyzer's configuration.
Hi Tomasz,
I have updated the comment in the screenshot.
Option is available for enabling the custom dictionaries. In this case, example of extraction analyzer is considered for explaining the topic.
Regards, Esha
Hi Esha,
Thanks for your blogs. It's helping me a lot in my journey with HANA database.
I'm trying to create custom dictionaries for text analysis in my Hana-db instance, just like you created on this blog, but I'm using Hana services in Cloud Foundry environment on SAP Cloud Platform and I don't know to do deploy my created custom dictionary. Do you have any idea in how can I do that via Hana database explorer?
Regards, Tairon
Thanks for the blog. It helped!