SAP HANA Text Analysis

Esha1 · ‎02-01-2018

To understand the core concept of SAP HANA Text Capabilities, we have divided the blog into two parts:

“SAP HANA FULL TEXT SEARCH” to get an understanding of EXACT, LINGUISTIC and FUZZY Text Search” https://blogs.sap.com/2018/02/01/sap-hana-text-capabilities/

“TEXT ANALYSIS” and configurations available for text analysis.

In this second blog, we cover TEXT ANALYSIS.

Text Analysis

Text Analysis is the process of analyzing unstructured data , extracting relevant information and then transforming that information into structured information that can be leveraged in different ways.

Extracting salient information out of text Capabilities range from basic tokenization, stemming, complex semantic analysis in form of entity and fact extraction.

Extracted information can be used within the applications for information management, data integration, and data quality analysis; business intelligence; query, analytics and reporting; search, navigation, document and content management etc.

Listed below are some of the use cases of Text Analysis:

Review Sites: Common use case of text analysis is capturing the sentiments out of movie reviews provided by the movie watchers and give rating to the movies based upon the text analysis.

Recruitment: Analyse the suitability of the person for a Job requirement based on text analysis of the resumes and mapping with the job requirement key words

Customer Ticket Routing: Companies use text analytics to route requests to customer service representatives who handle tickets.

Text Analysis powered by SAP HANA uses pre-processor server which applies full linguistic and statistical techniques to extract and classify unstructured text into entities and domains. Figure 1 shows the SAP HANA Architecture Text Capabilities.

Figure 1: SAP HANA Architecture Text Capabilities

SAP HANA stores the information in index-specific tables. Creating a full-text index with parameter TEXT ANALYSIS ON triggers the creation of a table named $TA_<indexname> containing linguistic or semantic analysis results.

Language modules use the following language processing technologies:

Linguistic analysis to handle natural language processing

Entity extraction to handle named entity extraction

Fact extraction to handle sentiment analysis, public sector events, and enterprise facts

Grammatical role analysis to handle functional syntactic roles in the sentence, such as subject or object. This is applied only to English

Below are types of Text Analysis Configuration Available:

Linguistic Analysis

Basic(LINGANALYSIS_BASIC)

Stems(LINGANALYSIS_STEMS)

Full(LINGANALYSIS_FULL)

Entity and Fact Extraction

Core(EXTRACTION_CORE)

Core Voice Of Customer(EXTRACTION_CORE_VOICEOFCUSTOMER)

Core Enterprise(EXTRACTION_CORE_ENTERPRISE)

Core Public Sector(EXTRACTION_CORE_PUBLIC_SECTOR)

Grammatical Role Analysis

Grammar Role Identification(GRAMMATICAL_ROLE_ANALYSIS)

***********************************************************************

First form of Linguistic Analysis is Basic Form

This form tokenizes the document, but does not perform stemming. The TA_TYPE field will not identify the part of speech, and TA_STEM columns will be empty.

CREATE COLUMN TABLE "<schema_name>"."LING_BASIC"

( ID INTEGER PRIMARY KEY,

STRING NVARCHAR(200) );



INSERT INTO "<schema_name>"."LING_BASIC" VALUES (1, 'Bob likes working at SAP');

INSERT INTO "<schema_name>"."LING_BASIC" VALUES (2, 'Bob dislikes soccer');

INSERT INTO "<schema_name >"."LING_BASIC" VALUES (3, 'Bob really likes football');



CREATE FULLTEXT INDEX LING_BASIC_INDEX ON "<schema_name>"."LING_BASIC" ("STRING")

CONFIGURATION 'LINGANALYSIS_BASIC'

TEXT ANALYSIS ON;

       	

SELECT * FROM "<schema_name>"."$TA_LING_BASIC_INDEX"

Figure 2 shows the rule as LXP in TA_RULE column and tokens generated.

Figure 2: Linguistic Analysis – Basic Form

*************************************************************************************

Second form of Linguistic Analysis is stemming

This form normalizes and stems the tokens or obtain base or dictionary forms. The TA_TYPE field will still not contain the part of speech, but the normalized and stemmed forms will be populated.

CREATE COLUMN TABLE "<schema_name>"."LING_STEMS"

( ID INTEGER PRIMARY KEY,

STRING NVARCHAR(200) );



INSERT INTO "<schema_name>"."LING_STEMS" VALUES (1, 'Bob likes working at SAP');

INSERT INTO "<schema_name>"."LING_STEMS" VALUES (2, 'Bob dislikes soccer');

INSERT INTO "<schema_name >"."LING_STEMS" VALUES (3, 'Bob really likes football');



CREATE FULLTEXT INDEX LING_STEMS_INDEX ON "<schema_name>"."LING_STEMS" ("STRING")

CONFIGURATION 'LINGANALYSIS_STEMS'

TEXT ANALYSIS ON;



SELECT * FROM "<schema_name >"."$TA_LING_STEMS_INDEX"

Figure 3 shows the rule as LXP in TA_RULE column and tokens and stem information generated.

Figure 3: Linganalysis – Steming

************************************************************************************

Third form of Linguistic Analysis is Full.

This form supports Tagging, Segmentation and Steming. In addition to the normalized and stemmed forms, the TA_TYPE column will be populated with parts of speech. This is the most detailed level of linguistic data available.

CREATE COLUMN TABLE "<schema_name >"."LING_FULL"

( ID INTEGER PRIMARY KEY,

STRING NVARCHAR(200) );



INSERT INTO "<schema_name >"."LING_FULL" VALUES (1, 'Bob likes working at SAP');

INSERT INTO "<schema_name >"."LING_FULL" VALUES (2, 'Bob dislikes soccer');

INSERT INTO "<schema_name >"."LING_FULL" VALUES (3, 'Bob really likes football');



CREATE FULLTEXT INDEX LING_FULL_INDEX ON "<schema_name >"."LING_FULL" ("STRING")

CONFIGURATION 'LINGANALYSIS_FULL'

TEXT ANALYSIS ON;



SELECT * FROM "<schema_name >"."$TA_LING_FULL_INDEX"

Figure 4 shows the rule as LXP in TA_RULE column and detail level of linguistic data.

Figure4: Linganalysis – Full

***********************************************************************************

First form is Entity and Fact Extraction is Core

This form is used to extracts basic entities from the text and entities of interest including people, places, firms, URLs, and other common terms.

CREATE COLUMN TABLE "<schema_name >"."EXT_CORE"

( ID INTEGER PRIMARY KEY,

STRING NVARCHAR(200) );



INSERT INTO "<schema_name >"."EXT_CORE" VALUES (1, 'Bob likes working at SAP');

INSERT INTO "<schema_name >"."EXT_CORE" VALUES (2, 'Bob dislikes soccer');

INSERT INTO "<schema_name >"."EXT_CORE" VALUES (3, 'Bob really likes football');



CREATE FULLTEXT INDEX EXT_CORE_INDEX ON "<schema_name >"."EXT_CORE" ("STRING")

CONFIGURATION 'EXTRACTION_CORE'

TEXT ANALYSIS ON;



SELECT * FROM "<schema_name >"."$TA_EXT_CORE_INDEX"

Figure 5 shows the rule as Entity Extraction in TA_RULE column and available basic entities.

Figure5: Entity Extraction – Core

**************************************************************************************

Second form is Entity and Fact Extraction is Core Voice of Customer

This form is to extracts additional entities and facts beyond the core configuration to support sentiment and request analysis. Also known as Sentiment Analysis which is the process of using rules to retrieve specific information about customers' sentiments and requests when processing and analyzing text. These same rules also retrieve emoticons and profanities.

CREATE COLUMN TABLE "<schema_name >"."EXT_CORE_VOC"

( ID INTEGER PRIMARY KEY,

STRING NVARCHAR(200) );



INSERT INTO "<schema_name >"."EXT_CORE_VOC" VALUES (1, 'Bob likes working at SAP');

INSERT INTO "<schema_name >"."EXT_CORE_VOC" VALUES (2, 'Bob dislikes soccer');

INSERT INTO "<schema_name >"."EXT_CORE_VOC" VALUES (3, 'Bob really likes football');



CREATE FULLTEXT INDEX EXT_CORE_VOC_INDEX ON "<schema_name >"."EXT_CORE_VOC" ("STRING")

CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER'

TEXT ANALYSIS ON;



SELECT * FROM "<schema_name >"."$TA_EXT_CORE_VOC_INDEX"

Figure 6 shows the rule as Entity Extraction in TA_RULE column and additional entities and facts available from this configuration.

Figure 6: Entity Extraction – Core Voice of Customer

*************************************************************************************

Third form is Entity and Fact Extraction is for Public Sector

Public Sector Fact Extraction is used to obtain facts related to public sector such as action and travel events, Military units, person alias, appearance, attributes & relationships, Spatial references and domain specific entities.

CREATE COLUMN TABLE "<schema>"."EXT_CORE_PUBLIC"

( ID INTEGER PRIMARY KEY,

STRING NVARCHAR(200) );



INSERT INTO "<schema>"."EXT_CORE_PUBLIC" VALUES (1, 'Bob likes working at SAP');

INSERT INTO "<schema>"."EXT_CORE_PUBLIC" VALUES (2, 'Bob dislikes soccer');

INSERT INTO "<schema>"."EXT_CORE_PUBLIC" VALUES (3, 'Bob really likes football');

INSERT INTO "<schema>"."EXT_CORE_PUBLIC" VALUES (4, 'YouTube division announced yesterday that it had acquired Directr');

INSERT INTO "<schema>"."EXT_CORE_PUBLIC" VALUES (5, 'The local soccer federation hired John as acting director');



CREATE FULLTEXT INDEX EXT_CORE_PUBLIC_INDEX ON "<schema>"."EXT_CORE_PUBLIC" ("STRING") CONFIGURATION 'EXTRACTION_CORE_PUBLIC_SECTOR'

TEXT ANALYSIS ON;



SELECT * FROM "<schema>"."$TA_EXT_CORE_PUBLIC_INDEX"

Original

The local soccer federation hired John Brown as acting director.

Extracted

[Action_Hire_Active][Agent]The local soccer federation[/Agent] hired [Patient]John Brown[/Patient] as acting director.[/Action_Hire_Active]

Figure 7 shows the rule as Entity Extraction in TA_RULE column and additional entities and facts available from this configuration.

Figure7: Entity Extraction – Public Sector

*********************************************************************************

Fourth form is Entity and Fact Extraction is for Enterprises

Enterprise Fact Extraction is used to obtain facts indicating relationships and events relevant for an enterprise such as membership information, management changes, product releases, mergers and acquisitions and organisational information.

This configuration focuses on businesses and professional organizations and is often used to monitor public references to partners or competitors within an industry.

CREATE COLUMN TABLE "<schema>"."EXT_CORE_ENTERPRISE"

( ID INTEGER PRIMARY KEY,

STRING NVARCHAR(200) );



DELETE FROM "<schema>"."EXT_CORE_ENTERPRISE" where ID = 4



INSERT INTO "<schema>"."EXT_CORE_ENTERPRISE" VALUES (1, 'Bob likes working at SAP');

INSERT INTO "<schema>"."EXT_CORE_ENTERPRISE" VALUES (2, 'Bob dislikes soccer');

INSERT INTO "<schema>"."EXT_CORE_ENTERPRISE" VALUES (3, 'Bob really likes football');

INSERT INTO "<schema>"."EXT_CORE_ENTERPRISE" VALUES (4, 'YouTube division announced yesterday that it had acquired Directr');

INSERT INTO "<schema>"."EXT_CORE_ENTERPRISE" VALUES (5, 'The local soccer federation hired John as acting director');



CREATE FULLTEXT INDEX EXT_CORE_ENTERPRISE_INDEX ON "<schema>"."EXT_CORE_ENTERPRISE" ("STRING")

CONFIGURATION 'EXTRACTION_CORE_ENTERPRISE'

TEXT ANALYSIS ON;



SELECT * FROM "<schema>"."$TA_EXT_CORE_ENTERPRISE_INDEX"

Original

YouTube division announced yesterday that it had acquired Directr.

Extracted

[BuyEvent][OrganizationA]YouTube[/OrganizationA] division announced yesterday that it had [Action]acquired[/Action] [OrganizationB]Directr[/OrganizationB][/BuyEvent].

Figure8 shows the rule as Entity Extraction in TA_RULE column and additional entities and facts available from this configuration.

Figure 8: Entity Extraction – Enterprise

***********************************************************************************

Another form is Grammatical Role Analysis. This configuration provides the capability to identify Syntactical (grammatical) relationships between elements in an input sentence (e.g. Subject, Verb, Direct Object).

CREATE COLUMN TABLE "<schema>"."GRAMMATICAL_ROLE"

( ID INTEGER PRIMARY KEY,

STRING NVARCHAR(200) );



INSERT INTO "<schema>"."GRAMMATICAL_ROLE" VALUES (1, 'Bob likes working at SAP');

INSERT INTO "<schema>"."GRAMMATICAL_ROLE" VALUES (2, 'Bob dislikes soccer');

INSERT INTO "<schema>"."GRAMMATICAL_ROLE" VALUES (3, 'Bob really likes football');

INSERT INTO "<schema>"."GRAMMATICAL_ROLE" VALUES (4, 'YouTube division announced yesterday that it had acquired Directr');



CREATE FULLTEXT INDEX "<schema>"."GRAMMATICAL_ROLE_INDEX" ON "<schema>"."GRAMMATICAL_ROLE" ("STRING")

CONFIGURATION 'GRAMMATICAL_ROLE_ANALYSIS'

TEXT ANALYSIS ON;



SELECT * FROM "<schema>"."$TA_GRAMMATICAL_ROLE_INDEX" order by ID;

Figure 9 shows the rule as Entity Extraction and Grammatical Role in TA_RULE column and additional functional relationships between elements from this configuration.

Figure 9: Grammatical Role Analysis

***********************************************************************************

To summarize, text analysis is the process of analyzing unstructured data , extracting relevant information and then transforming that information into structured information that can be leveraged in different ways using different configurations.

For details on Full Text Index, go through the blog[ https://blogs.sap.com/2018/02/15/sap-hana-full-text-index/].

To create and use Custom Dictionaries, refer to Blog [ https://blogs.sap.com/2018/03/15/custom-dictionaries-sap-hana-text-analysis/ ].

SAP HANA Text Analysis

Useful documents on SCN

Evolution of ABAP

Analytics in S/4HANA - real shape of embedded analytics and beyond embedded analytics