Hi everyone, Happy Lunar New Year!

It’s now Chinese new year or Spring Festival, the most important festival in China. Usually we stay together with our families and friends in the holiday, just like Christmas in many other countries. At the beginning of Chinese new year, people text some greetings to relatives, friends and colleagues traditionally, e.g., 新年快乐(Happy new year), 恭喜发财(Congratulations for prosperity, wishing you prosperity, something like that). I’ve sent lots of greetings and also received lots of greetings. As we know, SAP HANA has text analysis feature, so an idea came to my mind, can SAP HANA extract and analyze these greetings? 😕 I just did some tests and failed. But the good news is that we can do some customization in SAP HANA text analysis and let SAP HANA recognize these greetings. So in this post, I’ll share with you how to realize that. We really want some greetings from SAP HANA. 😉

Something interesting… Ram or Sheep or Goat?

Before we start the technical stuff, I wanna show you something interesting… Yesterday when I opened Chrome, the following Google doodle appeared. 😆 Oh, yeah, it’s the year of ram (see Chinese zodiac for more details). Google also celebrated Chinese new year with us, haha. Unfortunately, the doodle is gone. If you want to see it again, you can visit Lunar New Year 2015 and Google Doodle Rings in Chinese Lunar New Year. The interesting thing is that there’s a debate what’s the real animal ram, sheep or goat? If you’re interested, please have a look at Whatever Floats Your Goat: The 2015 Lunar New Year Animal Is Up For Debate : Code Switch : NPR

1.PNG

Greetings from SAP HANA – Customizing EXTRACTION_CORE

In this section, we expect some greetings from SAP HANA. First of all, let’s try something without the customization. NOTICE: All tests are based on SAP HANA SPS 09 Rev. 91.


DROP SCHEMA TA CASCADE;
CREATE SCHEMA TA;
SET SCHEMA TA;
CREATE COLUMN TABLE TA_TABLE (
  ID INTEGER PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
  CONTENT NVARCHAR(200),
  LANG NVARCHAR(2)
);
INSERT INTO TA_TABLE (CONTENT, LANG) VALUES ('新年快乐', 'ZH');
INSERT INTO TA_TABLE (CONTENT, LANG) VALUES ('恭喜发财', 'ZH');
CREATE FULLTEXT INDEX TA_INDEX ON TA_TABLE (CONTENT)
CONFIGURATION 'EXTRACTION_CORE'
LANGUAGE COLUMN LANG
TEXT ANALYSIS ON;
SELECT * FROM "$TA_TA_INDEX";











As you can see from the above SQLs, we first created the source table with a language column and inserted two greetings, 新年快乐(Happy new year), 恭喜发财(Congratulations for prosperity, wishing you prosperity, something like that). Then we created a full text index on the source table and related column with the configuration “EXTRACTION_CORE” (See all configurations from Text Analysis – SAP HANA Text Analysis Developer Guide – SAP Library) to make the text analysis happen. But the result of text analysis is nothing (see structure of $TA table from Structure of the $TA Table – SAP HANA Text Analysis Developer Guide – SAP Library), since there is no “GREETING” in the predefined entity types.

2.PNG

So what can we do now? You can find the solution from SAP HANA Text Analysis Extraction Customization Guide – SAP Library which shows you how to customize the text analysis extraction. Want a video? Here you go. SAP HANA Academy – Text Analysis: 10. Custom Dictionaries – YouTube Now let’s do it by ourselves.

Step 1: Create a XS project and share it with repo

/wp-content/uploads/2015/02/3_1_650216.png

Step 2: Create the .hdbtextdict file, in this file you define your own TA_TYPE(entity_category) and TA_TOKEN(entity_name). We defined “GREETING” for entity_category, “新年快乐” and “恭喜发财” for entity_name.

/wp-content/uploads/2015/02/3_2_650217.png

GREETING.hdbtextdict


<?xml version="1.0" encoding="UTF-8"?>
<dictionary xmlns="http://www.sap.com/ta/4.0">
  <entity_category name="GREETING">
  <entity_name standard_form="新年快乐">
  </entity_name>
  <entity_name standard_form="恭喜发财">
  </entity_name>
  </entity_category>
</dictionary>










Step 3: Create the .hdbtextconfig file, in this file you need to include your .hdbtextdict file in step 2. Here we first copy the content of SAP HANA’s standard EXTRACTION_CORE (sap.hana.ta.config::EXTRACTION_CORE) and include our custom dictionary.

/wp-content/uploads/2015/02/3_3_650218.png

EXTRACTION_CORE_CUSTOM.hdbtextconfig


...
    <!-- List of repository objects containing Text Analysis extraction dictionaries. -->
    <property name="Dictionaries" type="string-list">
        <string-list-value>TACustom::GREETING.hdbtextdict</string-list-value>
    </property>
...









Step 4: Create our full text index using the custom .hdbtextconfig in step 3.


DROP FULLTEXT INDEX TA_INDEX;
CREATE FULLTEXT INDEX TA_INDEX ON TA_TABLE (CONTENT)
CONFIGURATION 'TACustom::EXTRACTION_CORE_CUSTOM'
LANGUAGE COLUMN LANG
TEXT ANALYSIS ON;
SELECT * FROM "$TA_TA_INDEX";









Now we can receive greetings from SAP HANA. 🙂

/wp-content/uploads/2015/02/4_1_650222.png

There are a lot of greetings in Chinese new year, you can enrich your .hdbtextdict file as many as you want.

Sentiment from SAP HANA – Customizing EXTRACTION_CORE_VOICEOFCUSTOMER

Now we’ve already got greetings from SAP HANA, what about the sentiment analysis of greetings? From curiosity I tried both English and Chinese for the sentiment analysis. Now let’s have a look.


INSERT INTO TA_TABLE (CONTENT, LANG) VALUES ('Happy new year', 'EN');
INSERT INTO TA_TABLE (CONTENT, LANG) VALUES ('Congratulations for prosperity', 'EN');
SELECT * FROM TA_TABLE;








5.PNG

As you can see, first I added two greetings in English. “新年快乐” means “Happy new year”, while “恭喜发财” means “Congratulations for prosperity”. Then we made the text analysis with the configuration “EXTRACTION_CORE_VOICEOFCUSTOMER” which detects the voice of customer.


DROP FULLTEXT INDEX TA_INDEX;
CREATE FULLTEXT INDEX TA_INDEX ON TA_TABLE (CONTENT)
CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER'
LANGUAGE COLUMN LANG
TEXT ANALYSIS ON;
SELECT * FROM "$TA_TA_INDEX";







From the result of text analysis, we can find SAP HANA succeeded to extract sentiment in both English greetings, but failed to detect sentiment in Chinese. In addition, SAP HANA can extract what the sentiment exactly is (in red box) and the topic of the sentiment (in blue box). There is an improvement in SAP HANA SPS09, that’s the TA_PARENT column. With TA_PARENT, you can bind the sentiment and topic easily.

  • For “Happy new year”, “Happy” is a weak positive sentiment. So happy for what? For “new year”.
  • For “Congratulations for prosperity”, “Congratulations” is also a weak positive sentiment. So congratulations for what? For “prosperity”.

/wp-content/uploads/2015/02/6_1_650231.png

It seemed the sentiment analysis supports English perfectly, so what about for Chinese? Don’t worry, like customizing EXTRACTION_CORE in the previous section, we can also customize the configuration EXTRACTION_CORE_VOICEOFCUSTOMER. Since Chinese is a non-whitespace language, we need to follow Sentiment Analysis Customization in Nonwhitespace Languages. What we want to extract is similar with English as follows.

  • For “新年快乐”, “新年” means “new year” and “快乐” means “happy”.
  • For “恭喜发财”, “恭喜” means “congratulations” and “发财” means “prosperity” or something like that.

So, we can simply map these characters to sentiments and topics. Let’s give it a shot.

Step 1: Create the .hdbtextdict file, the XML structure is identical to the previous GREETING.hdbtextdict. However, in order to do the sentiment analysis, we can only use the following five entity_category:

  • CustomTopic
  • CustomPositive
  • CustomNegative
  • CustomNeutral
  • CustomProblem

/wp-content/uploads/2015/02/7_2_650332.png

GREETING_VOC.hdbtextdict


<?xml version="1.0" encoding="UTF-8"?>
<dictionary xmlns="http://www.sap.com/ta/4.0">
  <entity_category name="CustomTopic">
  <entity_name standard_form="新年">
  </entity_name>
  <entity_name standard_form="发财">
  </entity_name>
  </entity_category>
  <entity_category name="CustomPositive">
  <entity_name standard_form="快乐">
  </entity_name>
  <entity_name standard_form="恭喜">
  </entity_name>
  </entity_category>
</dictionary>



Step 2: Create the .hdbtextconfig file. Similar with the previous .hdbtextconfig file, in this step we first copy the content of SAP HANA’s standard EXTRACTION_CORE_VOICEOFCUSTOMER (sap.hana.ta.config::EXTRACTION_CORE_VOICEOFCUSTOMER) and include our custom dictionary in step 1.

/wp-content/uploads/2015/02/7_3_650342.png

EXTRACTION_CORE_VOC_CUSTOM.hdbtextconfig


...
<!-- List of Text Analysis extraction dictionaries for Sentiment Analysis. -->
    <property name="Dictionaries" type="string-list">
...
      <string-list-value>TACustom::GREETING_VOC.hdbtextdict</string-list-value>
    </property>
...



Step 3: Create our full text index using the custom .hdbtextconfig in step 2.


DROP FULLTEXT INDEX TA_INDEX;
CREATE FULLTEXT INDEX TA_INDEX ON TA_TABLE (CONTENT)
CONFIGURATION 'TACustom::EXTRACTION_CORE_VOC_CUSTOM'
LANGUAGE COLUMN LANG
TEXT ANALYSIS ON;
SELECT * FROM "$TA_TA_INDEX";



/wp-content/uploads/2015/02/8_1_650343.png

Although we can find “CustomTopic” and “CustomPositive” in TA_TYPE column, the result is partially correct and not satisfactory. What we want are well-defined topics and sentiments just like the result in English. So why is the result different from English? The reason is that we just defined the custom dictionary, but for sentiment analysis, there are other factors like the extraction rules. However, most greetings in Chinese do not obey the rule. For simplicity, we can just add one character in both Chinese greetings to meet the extraction rule without changing its meaning.


INSERT INTO TA_TABLE (CONTENT, LANG) VALUES ('新年很快乐', 'ZH');
INSERT INTO TA_TABLE (CONTENT, LANG) VALUES ('恭喜你发财', 'ZH');
SELECT * FROM "$TA_TA_INDEX";



/wp-content/uploads/2015/02/9_1_650344.png

Now we have the same result with greetings in English. SAP HANA detected sentiment in both Chinese greetings and each greeting has a topic and positive sentiment in TA_TYPE. That’s it!

Conclusion

In this post, we’ve customized two text analysis configurations in SAP HANA, i.e., EXTRACTION_CORE and EXTRACTION_CORE_VOICEOFCUSTOMER as well. With customizing text analysis extraction, we received greetings and positive sentiments from SAP HANA in Chinese new year.


Hope you enjoyed reading my blog. Happy Chinese new year and wish you prosperity! 新年快乐,恭喜发财! 🙂

/wp-content/uploads/2015/02/happynewyear_650445.gif

/wp-content/uploads/2015/02/gongxi_650446.gif

Image source

To report this post you need to login first.

3 Comments

You must be Logged on to comment or reply to a post.

Leave a Reply