SAP HANA Custom Dictionary

Former Member · ‎06-22-2014

We have a Chinese version of this document.

We often need to recognize specific names, products … correctly in some projects based on SAP HANA text analysis. SAP HANA segmentation engine may not be able to correctly recognize new words. Suppose, we need to recognize and extract words “上网卡”,“上海中学”,“乔布斯”correctly.

Firstly we also need to create table SEGMENTATION_TEST and add full-text index on column content:



CREATE COLUMN TABLE "TEST"."SEGMENTATION_TEST" (

  "URL" VARCHAR(200),

  "CONTENT" NCLOB,

  "LANGU" VARCHAR(10),

  PRIMARY KEY ("URL")

);

CREATE FULLTEXT INDEX FT_INDEX

ON SEGMENTATION_TEST(CONTENT) TEXT ANALYSIS

ON CONFIGURATION 'LINGANALYSIS_FULL'

LANGUAGE COLUMN "LANGU";

Then we insert the words “上网卡”,“上海中学”,“乔布斯” into table SEGMENTATION_TEST:



INSERT INTO "TEST"."SEGMENTATION_TEST"(URL,CONTENT,LANGU)

VALUES('XXX.XXX.XXX','上网卡','zh');

INSERT INTO "TEST"."SEGMENTATION_TEST"(URL,CONTENT,LANGU)

VALUES('XXX.XXX.XXX2','上海中学','zh');

INSERT INTO "TEST"."SEGMENTATION_TEST"(URL,CONTENT,LANGU)

VALUES('XXX.XXX.XXX3','乔布斯','zh');

Then we query table $TA_FT_INDEX and results as shown below:

We can see that three words all can’t be recognized by SAP HANA segmentation engine correctly.

To solve the above problem, SAP HANA provides custom dictionary. For those words not in the default dictionary, we can add them in the custom dictionary so as to recognize them correctly.

Chinese custom dictionary file of SAP HANA is simplified-chinese-std.sample-cd, the file path is \usr\sap\XXX\SYS\global\hdb\custom\config\lexicon\lang\ simplified-chinese-std.sample-cd. XXX is you HANA instance name. There are some examples in the file:



<?xml encoding="euc-cn" ?>

<!--?Copyright 2013 SAP AG. All rights reserved.

SAP and the SAP logo are registered trademarks of SAP AG in Germany and other countries. Business Objects and the Business Objects logo are registered trademarks of Business Objects S.A., which is an SAP company.

-->

<!-- Sample tagger-lexicon client dictionary -->

<explicit-pair-list>

<!-- Common Nouns -->

<item key = "海缆"       analysis = "海缆[Nn]"></item>         

<item key = "船艏"       analysis = "船艏[Nn]"></item>

<!-- Proper Names -->

<item key="张忠谋"                 analysis = "张忠谋[Nn-Prop]"></item>

<item key="奇摩"             analysis = "奇摩[Nn-Prop]"></item>

</explicit-pair-list>

Two types of noun are supported by SAP HANA. We can add the identifier [Nn-Prop] to tag the words to names and add [Nn] to tag the words to common noun. The final file after we add the three words as shown below:



<?xml encoding="euc-cn" ?>

<!--?Copyright 2013 SAP AG. All rights reserved.

SAP and the SAP logo are registered trademarks of SAP AG in Germany and other countries. Business Objects and the Business Objects logo are registered trademarks of Business Objects S.A., which is an SAP company.

-->

<!-- Sample tagger-lexicon client dictionary -->

<explicit-pair-list>

<!-- Common Nouns -->

<item key = "海缆"       analysis = "海缆[Nn]"></item>         

<item key = "船艏"       analysis = "船艏[Nn]"></item>

<item key = "上网卡"       analysis = "上网卡[Nn]"></item>

<item key = "上海中学"       analysis = "上海中学[Nn]"></item>

<!-- Proper Names -->

<item key="张忠谋"                 analysis = "张忠谋[Nn-Prop]"></item>

<item key="奇摩"             analysis = "奇摩[Nn-Prop]"></item>

<item key="乔布斯"                 analysis = "乔布斯[Nn-Prop]"></item>

</explicit-pair-list>

Then we truncate the table SEGMENTATION_TEST and also insert the three words, results shows as below:

We can see that “上网卡”, ”上海中学” and “乔布斯” have been recognized correctly.

We now use configuration EXTRACTION_CORE instead of LINGANALYSIS_FULL. As explained in the previous blog, EXTRACTION_CORE can identify groups, names …etc. We first drop the previous full-text index and then add the full-text index with configuration “EXTRACTION_CORE”:



DROP FULLTEXT INDEX "TEST"."FT_INDEX";

CREATE FULLTEXT INDEX FT_INDEX

ON SEGMENTATION_TEST(CONTENT) TEXT ANALYSIS

ON CONFIGURATION 'EXTRACTION_CORE'

LANGUAGE COLUMN "LANGU";

Then we insert the following words and query the results:



INSERT INTO "TEST"."SEGMENTATION_TEST"(URL,CONTENT,LANGU)

VALUES('XXX.XXX.XXX3','乔布斯','zh');

INSERT INTO "TEST"."SEGMENTATION_TEST"(URL,CONTENT,LANGU)

VALUES('XXX.XXX.XXX4','海缆','zh');

INSERT INTO "TEST"."SEGMENTATION_TEST"(URL,CONTENT,LANGU)

VALUES('XXX.XXX.XXX5','张忠谋','zh');

As shown in the above picture, words “乔布斯” and ”张忠谋” are recognized as personal names but word “海缆” is recognized as NOUN_GROUP. The reason we have explained before, we add the identifier [Nn-Prop] to words “乔布斯” and ”张忠谋”, so they will be tagged to personal names, but the words “海缆” with identifier [Nn] will be tagged to NOUN_GROUP.

SAP HANA Custom Dictionary

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win