SAP HANA TA – Text Analysis
This Blog is to facilitate colleagues/knowledge workers who are interested to know about SAP HANA TA as an overview or to further enhance their existing knowledge
My attempt is to only supplement and simplify on some of the existing concepts , I have tried to cover only the key topics as this is an expansive space , I have divided this blog into 2 sections:
Section I – Introduction of SAP HANA TA and development of custom use case in Eclipse
Section II – Develop an application using the capabilities of SAP HANA TA and SAP UI5 in SAP WEB IDE
Section I – Introduction
Text Data Processing – NLP
Text Data Processing can automatically identify the input text language and select language-specific dictionaries and rules for analysis to extract entities, including people, dates, places, organizations and so on, in the languages.
It also looks for patterns, activities, events, and relationships among entities and enables their extraction. Extracting such information from text tells you what the text is about — this information can be used within applications for information management, data integration, and data quality; business intelligence; query, analytics and reporting; search, navigation, document and content management; among other usage scenarios.
With SAP HANA platform, you can gain real insights from your unstructured textual data. The platform provides search, text analysis, and text mining functionality for unstructured text sources. Learn how full natural-language processing capabilities support linguistic analysis and entity and relationship extraction for your enterprise in-memory data. In addition, you will apply statistical algorithms that enable you to detect patterns in large document collections, including key term identification and document categorization.
Having a grasp of essentials is difficult enough but having access to everything you need to know that’s nearly impossible or is it ? what if there was someone in your company who knew everything , Someone you could ask anything you needed to know someone who could explain not just what but also the why , now just imagine if you could communicate with ERP like you communicate with this all knowing friend sounds like sci-fi ?
It is possible with Text Analysis with SAP HANA
80% of enterprise relevant information originates in unstructured data .With SAP HANA you can access huge volumes of data including unstructured text data from different sources , SAP HANA enables native full text analysis so you can look for unstructured text much like you’d use a search engine on the internet but it’s not only a keyword search , TA powered by SAP HANA applies full linguistic and statistical techniques to make sure the entities returned to you are correct so you get the right answer even if the spelling is wrong
Text Analysis with SAP HANA also gives structured to unstructured textual content. It extracts and processes unstructured text data from various file formats and classifies entities Such as people , companies data and times you can also identity domain facts such as Sentiments, Topics and Requests
Combined structured and unstructured data , query it analyze it , visualize it , and thanks to SAP HANA in memory technology You can do it in real time !
Welcome to the future , Forget the complex IT systems of the past SAP HANA helps you simplify your IT landscape . Since both the DB and analytics run on 1 single platform there is no need to duplicate data anymore ,
No more converting , pre-aggregating or indexing plus you significantly reduce Cost of ownership with SAP HANA information access tool kit you can also create your own Search apps for mobile devices or browser based deployments and design any no of user interfaces to fit your needs .
So Everything you want to know is right at your fingertips .Go beyond today’s technological boundaries and make sci-fi real
In Less than second you can find the most critical piece of information hidden deep in the noise of internal and external databases , you can analyze social media chatter and send personalized messages to your customers to offer the right product at the right time or you can automate contact center replies to text and mail service requests.Unleash the full power of unstructured data ,just ask your all-knowing friend
Illustration below shows unstructured DATA
Text Analysis with SAP HANA
Text Analysis powered by SAP HANA applies full linguistic and statistical techniques to extract and classify unstructured text into entities and domains. Additionally, SAP HANA provides native, full-text search capabilities on text.
Note – you may have to set this – UTF-8
Change the default encoding in Eclipse to UTF-8 (window -> preferences -> general -> workspace) and then delete the project from Eclipse. Then check it back out from the HANA repository.This worked for me.
To use your own text analysis extraction dictionaries and extraction rules, you need to create a custom text analysis configuration
Text analysis is the most detailed level of document analysis available within SAP HANA. It uses both linguistic and semantic analysis to extract entities and facts from unstructured text and store them in index-specific tables. These tables are the primary focus of our discussion, comprising the first section below, in which we apply text analysis to the hotel reviews dataset to extract linguistic details and customer sentiment. We follow this in the second section with a look at the built-in SAP HANA configurations that define how text analysis fills the output tables. Finally, we briefly consider custom configurations, dictionaries, and rule sets and how they can be applied to common situations such as the hotel review dataset.
Creating a full-text index with parameter TEXT ANALYSIS ON triggers the creation of a table named $TA_<indexname> containing linguistic or semantic analysis results. Table $TA lives in the same schema as the source data and is the foundation for all text analysis modeling and reporting.
The set of columns in the table $TA is always the same regardless of the text analysis configuration used with the full-text index. These are described in more detail in table 1 below
|Key columns from source table||The first columns in table $TA are a direct copy of the key columns from the source table. This joins back to the source table to retrieve additional data, related attributes, original text, etc.|
|TA_RULE||The rule that created the output. Generally, this is either LXP for linguistic analysis or Entity Extraction for entity and fact analysis.|
|TA_COUNTER||A unique sequential ID for each token extracted from the document. The counter generally matches the order in which the tokens appear in the document.|
|TA_TOKEN||The term, entity, or fact extracted from the document. In linguistic analysis, this is the tokenized term before stemming and normalization.In entity and fact extraction, it is the segment of the original text identified as an entity or sub-entity.|
|TA_LANGUAGE||The language of the document.|
|TA_TYPE||The type of the token. In linguistic analysis, this is the part of speech. In semantic analysis, it is the entity type or fact.Example values: ’noun’, ‘conjunction’, ‘LOCALITY’, ‘PERSON’, ‘MinorProblem’, ‘StrongPositiveSentiment’|
|TA_NORMALIZED||The normalized version of the token. Inflection is maintained, but capitalization and diacritics are removed. This column is null for entity extraction.|
|TA_STEM||The stemmed version of the token. This field is fully un inflected and normalized. If the stem is identical to the token, this column is null. It is also null for entity extraction.|
|TA_PARAGRAPH||The paragraph in the document that contains the token.|
|TA_SENTENCE||The sentence in the document that contains the token.|
|TA_CREATED_AT||Creation time of the record.|
|TA_OFFSET||Character offset from the beginning of the document.|
|TA_PARENT||The TA_COUNTER value of the parent of this token. This is usually seen in fact extraction where a fact consists of multiple sub-entities. This field is null for linguistic analysis.|
Table 1 Set of Columns in Table $TA
The contents of the $TA table vary significantly depending on whether the full-text index was created using a linguistic configuration or an entity extraction configuration. To demonstrate this difference, we created two indexes on the same set of hotel reviews. The first, created with the SQL statement below, only performs linguistic analysis.
CREATE FULLTEXT INDEX LING_FULL_REVIEW_INDEX
ON TA.HOTEL_REVIEWS (REVIEW)
TEXT ANALYSIS ON
We then examine the results, as shown in Fig 1, by selecting from the table TA.“$TA_LING_FULL_REVIEW_INDEX”. This is the most detailed level of linguistic analysis available, including normalization and stemming. These stems, along with the part-of-speech information in the TA_TYPE column, are the most distinctive aspect of linguistic analysis, as they aren’t available in entity extraction.
Figure 1 Table $TA for Linguistic Analysis
For the second index we used the voice-of-customer entity extraction configuration. However, because each column can only have one index at a time, we first had to drop the linguistic index by running the following code:
DROP FULLTEXT INDEX TA.LING_FULL_REVIEW_INDEX
Then we had to create the new index with:
CREATE FULLTEXT INDEX EXT_VOC_HOTEL_REVIEW
ON TA.HOTEL_REVIEWS (REVIEW)
TEXT ANALYSIS ON
Now we access the results, shown in Fig 2 , by querying the TA.“$TA_EXT_FOC_HOTEL_REVIEW” table. We selected the voice-of-customer configuration not just because it is one of the most useful, but because it makes clear the stark difference between linguistic and semantic analysis. Here we see a mix of sentiments and sub-sentiments, the latter identifiable by looking for rows with TA_PARENT populated. We also see multi-term tokens in TA_TOKEN, which are usually the result of several consecutive terms or entities matching a single rule.
Figure 2 Table $TA for Entity and Fact Extraction
Text and Sentiment Analysis
<PERSON>My NAME</PERSON> is a technology consultant of<ORGANIZATION@INDUSTRY>SAP</ORGANIZATION@INDUSTRY >
As mentioned above, SAP HANA uses configuration files to define the behavior and output of the text analysis engine. These configurations can be user-defined, but SAP includes seven built-in configurations that meet most extraction requirements without the need for customization:
Enables the most basic linguistic analysis. This tokenizes the document, but does not perform normalization or stemming. The TA_TYPE field will not identify the part of speech, and the TA_NORMALIZED and TA_STEM columns will be empty.
Normalizes and stems the tokens. The TA_TYPE field will still not contain the part of speech, but the normalized and stemmed forms will be populated.
Performs full linguistic analysis. In addition to the normalized and stemmed forms, the TA_TYPE column will be populated with parts of speech. This is the most detailed level of linguistic data available.
Extracts basic entities from the text, including people, places, firms, URLs, and other common terms.
Extracts additional entities and facts beyond the core configuration to support sentiment and request analysis. This configuration is essential because it identifies positive and negative emotions associated with tokens, allowing us to gauge opinion within a corpus to particular topics.
Provides extraction for enterprise data, such as mergers, acquisitions, organizational changes, and product releases. This configuration focuses on businesses and professional organizations and is often used to monitor public references to partners or competitors within an industry.
Extracts security-related data about public persons, events, and organizations. This data is of limited use for general analysis.
A Screenshot is shown below :
In Table $TA, we saw the results of the two most generally useful of these configurations, LINGANALYSIS_FULL and EXTRACTION_CORE_VOICEOFCUSTOMER. The only additional built-in configuration we recommend using is EXTRACTION_CORE_ENTERPRISE, as it provides some useful information about professional organizations. However, the rest of the configurations either produce more limited or less generally applicable results.
In addition to the built-in configurations, SAP HANA allows us to create custom configurations, dictionaries, and rule files.
Creating a custom configuration requires familiarity with SAP HANA development work spaces, repositories, and shared projects. A detailed explanation of these topics is beyond the scope of this work, but more information is available in the SAP HANA Developer Guide for SAP HANA Studio (http://help.sap.com/hana/SAP_HANA_Search_Developer_Guide_en.pdf).
Custom configurations let us tweak some parameters affecting extraction, but more importantly they expose the ability to use custom dictionaries and rule sets.
Custom Dictionaries and Rule Sets
In addition to custom configurations, SAP HANA supports custom dictionaries and rule sets. These files are considerably more complicated than configurations, and a deep dive into their syntax is beyond the scope of this document. Instead, we will generally discuss their purposes and most common uses.
The built-in dictionaries included with SAP HANA are quite robust for most languages. As a result, custom dictionaries are most often used to expand the built-in dictionaries with enterprise-specific entities. Broadly, we define these new entities in a hierarchy:
- Entity category
The category (e.g., ORGANIZATION) to which the entity belongs. We may optionally specify one level of subcategory.
- Entity standard form
The complete, preferred form of the entity. For example, SAP HANA text analysis. SAP HANA allows wildcards in the standard and variant forms for broader matching.
- Variant forms
Alternate forms that map to the standard form and thus to the entity category. For example, HANA text analysis or SAP text analysis. Variant forms can also be auto-generated by SAP HANA or by custom variant generation rules.
Using this hierarchy, we could define a business name as a specific entity to disambiguate it from other organizations or people. Or, if an enterprise uses different terminology in internal and external documents, those terms can be mapped to the same entity and considered as a whole.
Within our hotel reviews, we could leverage custom dictionaries to re categorize tokens in a more useful way. For example, we could create an AMENITY entity type with members “pool,” “fitness,” “room service,” and “Internet.” Or we could change “Starwood” and “Hyatt” from ORGANIZATION to COMPETITOR.
Custom Rule Sets
Custom rule sets are the most complex of the customization files. They use the custom language CGUL, which leverages tokens, linguistic attributes, entities, and regular expressions to match patterns in the text. These patterns provide semantic value by identifying relationships between tokens.
For example, we could identify public bankruptcies by looking for organizational entities followed by a verb followed by a form of the word “bankrupt.”
CGUL is powerful but complex, and it is recommended to carefully read the SAP HANA Text Analysis Extraction Customization Guide (http://help.sap.com/saphelp_hanaplatform/helpdata/en/20/31dfe5e9754d0fb09b5ca24fd0329f/frameset.htm) for details on syntax and usage before attempting any custom rules.A more detailed discussion of this topic is outside the scope of this blog.
I have explained some of the key concepts by an illustration given below ,
In the hdbtextconfig you can define the attributes as shown below :
The hdbtextrule is shown below :
I have created a custom dictionary, hdbtextdict for my example Olympics
the overall structure looks like as highlighted below:
After all the steps , please go ahead and Activate all
The next steps are shown below :
Schema structure is shown below :
View of Tables
If this count = 0 then indexing is not enabled/ isn’t working
All of the above is implemented in Eclipse (SAP HANA Development)
In this, I will explain the steps carried out to develop a sample Survey App in SAP WEB IDE using SAP UI5
I have used this Version: 170330 of WEB IDE
SAP HANA Web based Development Workbench :
Editor – Used to Create, edit, execute, debug and manage HANA Repository artifacts
Catalog – Used to Create, edit, execute and manage HANA DB SQL catalog artifacts
services.xsodata – where the HANA views have been exposed
The metadata output is shown below :
Output in JSON format for one of the Entities
Simple use case demo of the app with –> SAP HANA TA Sentiment Analysis <–
It can be further leveraged to include a Digital Assistant option as shown below , after speaking the comments are automatically shown in the search bar :
Hope you find this useful.
Dilip Mamidela , SAP SDC India , BLR