Skip to Content

     在一些基于SAP HANA的文本分析应用中,我们经常需要对特定的人名、产品等名词进行正确的识别。SAP HANA自带的分词组件可能无法正确识别新词,例如我们现在需要对“上网卡”、“上海中学”和“乔布斯”三个名词进行准确识别和提取。我们使用上篇blog建好的全文索引表SEGMENTATION_TEST进行测试:

    


CREATE COLUMN TABLE "TEST"."SEGMENTATION_TEST" (
  "URL" VARCHAR(200),
  "CONTENT" NCLOB,
  "LANGU" VARCHAR(10),
  PRIMARY KEY ("URL")
);
CREATE FULLTEXT INDEX FT_INDEX
ON SEGMENTATION_TEST(CONTENT) TEXT ANALYSIS
ON CONFIGURATION 'LINGANALYSIS_FULL'
LANGUAGE COLUMN "LANGU";
INSERT INTO "TEST"."SEGMENTATION_TEST"(URL,CONTENT,LANGU)
VALUES('XXX.XXX.XXX','上网卡','zh');
INSERT INTO "TEST"."SEGMENTATION_TEST"(URL,CONTENT,LANGU)
VALUES('XXX.XXX.XXX2','上海中学','zh');
INSERT INTO "TEST"."SEGMENTATION_TEST"(URL,CONTENT,LANGU)
VALUES('XXX.XXX.XXX3','乔布斯','zh');

     查询结果如下:

/wp-content/uploads/2014/03/1_411631.png

可以看出HANA没有能够正确的分出我们想要的结果。

    面对上述问题,我们可以通过使用SAP HANA提供的自定义分词词典来解决。

SAP HANA用户自定义词典文件为simplified-chinese-std.sample-cd,目录为\usr\sap\XXX\SYS\global\hdb\custom\config\lexicon\lang\simplified-chinese-std.sample-cd其中XXX是你SAP HANA的实例名。文件示例内容如下所示:


<?xml encoding="euc-cn" ?>
<!--?Copyright 2013 SAP AG. All rights reserved.
SAP and the SAP logo are registered trademarks of SAP AG in Germany and other countries. Business Objects and the Business Objects logo are registered trademarks of Business Objects S.A., which is an SAP company.
-->
<!-- Sample tagger-lexicon client dictionary -->
<explicit-pair-list>
<!-- Common Nouns -->
<item key = "海缆"       analysis = "海缆[Nn]"></item>         
<item key = "船艏"       analysis = "船艏[Nn]"></item>
<!-- Proper Names -->
<item key="张忠谋"                 analysis = "张忠谋[Nn-Prop]"></item>
<item key="奇摩"             analysis = "奇摩[Nn-Prop]"></item>
</explicit-pair-list>

     文件中支持两种类型的自定义词典,普通名词和人名,我们只需在自定义的人名analysis属性中加入[Nn-Prop]标识即可,然后将其它类型名词analysis属性加入[Nn]标识。最终我们更改过后的文件如下所示:


<?xml encoding="euc-cn" ?>
<!--?Copyright 2013 SAP AG. All rights reserved.
SAP and the SAP logo are registered trademarks of SAP AG in Germany and other countries. Business Objects and the Business Objects logo are registered trademarks of Business Objects S.A., which is an SAP company.
-->
<!-- Sample tagger-lexicon client dictionary -->
<explicit-pair-list>
<!-- Common Nouns -->
<item key = "海缆"       analysis = "海缆[Nn]"></item>         
<item key = "船艏"       analysis = "船艏[Nn]"></item>
<item key = "上网卡"       analysis = "上网卡[Nn]"></item>
<item key = "上海中学"       analysis = "上海中学[Nn]"></item>
<!-- Proper Names -->
<item key="张忠谋"                 analysis = "张忠谋[Nn-Prop]"></item>
<item key="奇摩"             analysis = "奇摩[Nn-Prop]"></item>
<item key="乔布斯"                 analysis = "乔布斯[Nn-Prop]"></item>
</explicit-pair-list>

     接着我们继续插入同样的数据,结果如下所示:

/wp-content/uploads/2014/03/2_411638.png

如上图所示,“上网卡”、“上海中学”和“乔布斯”已经正确的提取出来了。

         之前我们使用的全文索引配置是LINGANALYSIS_FULL,如果我们使用EXTRACTION_CORE,即提取实体名,如组织、场所和人名

等。


CREATE FULLTEXT INDEX FT_INDEX
ON SEGMENTATION_TEST(CONTENT) TEXT ANALYSIS
ON CONFIGURATION 'EXTRACTION_CORE'
LANGUAGE COLUMN "LANGU";
  然后插入数据:
INSERT INTO "TEST"."SEGMENTATION_TEST"(URL,CONTENT,LANGU)
VALUES('XXX.XXX.XXX3','乔布斯','zh');
INSERT INTO "TEST"."SEGMENTATION_TEST"(URL,CONTENT,LANGU)
VALUES('XXX.XXX.XXX4','海缆','zh');
INSERT INTO "TEST"."SEGMENTATION_TEST"(URL,CONTENT,LANGU)
VALUES('XXX.XXX.XXX5','张忠谋','zh');

查询结果如下:

/wp-content/uploads/2014/03/3_411639.png


     如上图所示,名词“乔布斯”和“张忠谋”在定义自定义词典时,我们加了限定属性[Nn-Prop]标识为人名,所以提取出来的类型TA_TYPEPERSON。而名词“海缆”则是普通名词,所以提取出来的类型TA_TYPE为普通名词组NOUN_GROUP

本文的测试案例所使用的SAP HANA版本为SAP HANA SP07 Revision 70.00

想获取更多SAP HANA学习资料或有任何疑问,请关注新浪微博@HANAGeek!我们欢迎你的加入!

To report this post you need to login first.

Be the first to leave a comment

You must be Logged on to comment or reply to a post.

Leave a Reply