Skip to Content

This is a blog on some of the options that are available in text analysis. Shore descriptions with example codes.

 

Text Analysis is the process of analyzing unstructured text, extracting relevant information and then transforming that information into structured information that can be leveraged in different ways.

 

Full Text Indexing:

When dealing with a small number of documents, it is possible for the full-text-search engine to directly scan the contents of the documents with each query, a strategy called “serial scanning.” This is what some rudimentary tools, such as grep, do when searching. However, when the number of documents to search is potentially large, the problem of full-text search is often divided into two tasks: indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms (often called an index). In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents. The indexer will make an entry in the index for each term or word found in a document, and possibly note its relative position within the document. Conceptually, full-text indexes support searching on columns in the same way that indexes support searching through books.

 

CREATING A FULL TEXT INDEX:

 

  CREATE FULLTEXT INDEX "nameofindex" On <SCHEMA_NAME>."<table>"("<column>")
  TEXT ANALYSIS ON
  CONFIGURATION {'EXTRACTION_CORE_VOICEOFCUSTOMER' || 'EXTRACTION_CORE' || 'LINGANALYSIS_BASIC' || 'LINGANALYSIS_STEMS' ||   
   'LINGANALYSIS_FULL'};

 

                                                                  Configurations:

 

5 Predefined configurations. GROUPED INTO 3 categories.

LINGUISTIC ANALYSIS CONFIGURATIONS:

      Linguistic analysis

  • Segmentation–the separation of input text into its elements
  • Stemming–the identification of word stems, or dictionary forms
  • Tagging–the labeling of words’ parts of speech (POS)

 

  1. LINGANALYSIS_BASIC : Segmentation (tokenization).
  2. LINGANALYSIS_STEMS : BASIC + Stemming (identifying words in dictionary based form —  like Work is the stem of working/worked/works ).
  3. LINGANALYSIS_FULL : BASIC + STEM + TAGGING ( Labeling of words POS – Verb, noun,etc).

EXTRACTION CONFIGURATIONS:  ( to extract Entities & FACTS )

  1. EXTRACTION_CORE :  Only extracts Entity types (like name , organization , city, language, etc).
  2. EXTRACTION_CORE_VOICEOFCUSTOMER : extracts Entity types and relationship between them (sentiment analysis)

CUSTOM CONFIGURATIONS:

  1. Creating a custom configuration from existing configuration: (.hdbtextconfig)

Repositories->SAP->HANA->TA->Config. We can find all  5 config files(Ling analysis & extraction)

We can

  1. Include / exclude Analyzers ,
  2. increase /decrease the sample text for analyzing language,
  3. Turn ON / OFF the POS, stemming,Tokenizing
  4. Enable custom dictionaries  Should always be set to TRUE

Activate the text config file and use the file in the CONFIGURATION part of the index.

CREATE FULLTEXT INDEX "nameofindex" On <SCHEMA_NAME>."<table>"("<column>")
TEXT ANALYSIS ON
CONFIGURATION 'sap.hana.ta.config::<CUSTOMCONFIG>'

 

EXAMPLES:

 

ENTITY EXTRACTION (SENTIMENT ANALYSIS):

DROP TABLE LANGUAGE_DETECT;

Create column table language_detect (ID smallint Primary Key, content nvarchar(50), lang varchar(2));

insert into language_detect values(2,'JOHN LOVES TO PLAY FOOTBALL','EN');

DROP FULLTEXT INDEX LANG_INDEX;

CREATE FULLTEXT INDEX LANG_INDEX ON language_detect(CONTENT)

LANGUAGE COLUMN LANG

CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER'

TEXT ANALYSIS ON;

SELECT TOP 1000 * FROM "$TA_LANG_INDEX";

 

 

LINGUISTIC ANALYSIS :

 

DROP FULLTEXT INDEX LANG_INDEX;

CREATE FULLTEXT INDEX LANG_INDEX ON language_detect(CONTENT)

LANGUAGE COLUMN LANG

CONFIGURATION 'LINGANALYSIS_FULL'

TEXT ANALYSIS ON;
 

SELECT TOP 1000 * FROM "$TA_LANG_INDEX";

 

 

  LANGUAGE DETECTION PARAMETER:

 

select * from SYS.M_TEXT_ANALYSIS_LANGUAGES -- TO FIND THE LIST OF SUPPORTED LANGUAGES.

 

Insert the following texts into a table for testing the detection:

 

 

German : Ich mag Musik

English: I love Music

Chinese (traditional) : 我愛音樂

 

insert  into language_detect values(1,'我愛音樂');
insert into language_detect values(2,'I LOVE MUSIC');
insert into language_detect values(3,'Ich mag Musik');

 

Create Index without specifying the LANGUAGE DETECTION option:

 

CREATE FULLTEXT INDEX LANG_INDEX ON language_detect(CONTENT)
TEXT ANALYSIS ON
LANGUAGE DETECTION(‘ZH’,'EN')-- Analyse and display the results only for the specified two languages
CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER' ;
 

SELECT TOP 1000 * FROM "$TA_LANG_INDEX"


If no LANGUAGE DETECTION parameter is specified, then  only the Words which are in ENGLISH (Default language) are considered.

 

LANGUAGE COLUMN:

 

Speed up the analysis by eliminating the detection of language by explicitly having the language code in a separate column.

 

DROP TABLE LANGUAGE_DETECT;
 

Create column table language_detect (ID smallint Primary Key, content nvarchar(50), lang varchar(2));
 

insert into language_detect values(2,'I LOVE MUSIC','EN');
 

insert into language_detect values(3,'Ich mag Musik','DE');
 

DROP FULLTEXT INDEX LANG_INDEX;
 

CREATE FULLTEXT INDEX LANG_INDEX ON language_detect(CONTENT)
LANGUAGE COLUMN LANG --Specified the language column as LANG (To fetch the language code from this column)
CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER'

TEXT ANALYSIS ON;

 

SELECT TOP 1000 * FROM "$TA_LANG_INDEX";


MIME TYPES:

Used to specify MIME type (Format of the content) to improve the performance.

 

Supported mime types for text analysis:

 

SELECT * FROM SYS.M_TEXT_ANALYSIS_MIME_TYPES;

DROP TABLE mime_types;
Create column table mime_types (ID smallint Primary Key, content BLOB);

 

–NOTE: BLOB will take up more memory and hence no point in using BLOB to just load plain text files.

 

–Insert the data through any program (Java/Python,etc)

 

DROP FULLTEXT INDEX MIME_INDEX;
CREATE FULLTEXT INDEX MIME_INDEX ON mime_types(CONTENT)
CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER'
TEXT ANALYSIS ON
MIME TYPE 'application/pdf'; --Specify the type of file to be loaded to the BLOB column
 

SELECT TOP 1000 * FROM "$TA_MIME_INDEX";

 

 

                                                             TOKEN SEPARATORS:

 

Specify only the separators that are needed to split the words in a set of text.

 

DROP TABLE TOKEN_SEPARATOR;
Create column table TOKEN_SEPARATOR (ID smallint Primary Key, content nvarchar(200));
 

DROP FULLTEXT INDEX TOKEN_INDEX;
CREATE FULLTEXT INDEX TOKEN_INDEX ON
TEXT ANALYSIS ON
TOKEN SEPARATORS '/\:;"''[]_';

 

insert into TOKEN_SEPARATOR VALUES (1,’fellow-student writes semi-final’); — Treats ‘-‘ as a part of the string and doesn’t tokenize.

 

SELECT TOP 1000 * FROM "$TA_TOKEN_INDEX";

 

 

                                                          SYNCHRONIZATION :

 

We can set the time/document count for synchronizing the index table with the contents.

 

DROP TABLE SYNC;
Create column table SYNC (ID smallint Primary Key, content nvarchar(200));
 

DROP FULLTEXT INDEX SYNC_INDEX;
CREATE FULLTEXT INDEX SYNC_INDEX ON SYNC(CONTENT)
TEXT ANALYSIS ON
ASYNC FLUSH EVERY 1 MINUTES; --SYNCHRONIZES ONCE EVERY MINUTE (Only once every minute, index table entries are refreshed)
 

DROP FULLTEXT INDEX SYNC_INDEX;
CREATE FULLTEXT INDEX SYNC_INDEX ON SYNC(CONTENT)
TEXT ANALYSIS ON
ASYNC FLUSH AFTER 5 DOCUMENTS; --SYNCHRONIZES ONCE AFTER EVERY 5 RECORDS
(Index table entries are refreshed once in every 5 records inserted into the content table)
 

ALTER FULLTEXT INDEX SYNC_INDEX SUSPEND QUEUE; -- SUSPEND THE ANALYSIS (Index table entries are not refreshed);
 

ALTER FULLTEXT INDEX SYNC_INDEX ACTIVATE QUEUE; -- ACTIVATE THE ANALYSIS (Index table entries are refreshed);


 

CUSTOM DICTIONARY:

Step1: Create the list of words in a file in project explorer(.hdbtextdict)

Step2: Add the file to any of the “.hdbtextconfig” file.

.hdbtextdict:

       This file should be in XML format.

<?xml version=”1.0” encoding=”UTF-8”?>
<dictionary xmlns=http://sap.com/ta/4.0>
       <entity_category name=”IPL CRICKET TEAM”>
              <entity_name standard_form="Chennai Super Kings">
                     <variant name="CSK" type="ABBREV" />
                     <variant name="C.S.K" type="ABBREV" />
                     <variant name="super kings" />
                     <variant generation type ="standard" language="english" />
              </entity_name>
       </entity_category>
</dictionary>

 

 

.hdbtextconfig :

 

Create a custom configuration file (.hdbtextconfig) having any configuration as the base.

 

We will find the following at the bottom of any configuration file:

 

–create a .hdbtextconfig  file (“iplteamconfig.hdbtextconfig” )by editing the following in the existing configuration files.

 

<property name=”Dictionaries” type=”string-list”>
       <string-list-value> sap.hana.ta.config::iplteam.hdbtextdict</ string-list-value>
</property>

 

Use this in the Configuration while defining the INDEX creation:

 

DROP FULLTEXT INDEX DICT_INDEX;
CREATE FULLTEXT INDEX DICT_INDEX ON DICT(CONTENT)
CONFIGURATION 'sap.hana.ta.config::iplteam'; --The custom configuration file name
TEXT ANALYSIS ON;

 

                                                    REQUEST EXTRACTION:

 

General Requests : general req. by customers for enhancement , improvement, etc

Contact Requests : contacts given by customer like contact me at… call me at …

 

INSERT INTO prof_test VALUES (2,' CALL ME AT 8884484855 ');
INSERT INTO prof_test VALUES (10,' An additional key would be good on this keyboard');
INSERT INTO prof_test VALUES (3,'I want your customer care to contact me as soon as possible');
 

 

DROP FULLTEXT INDEX prof_index;
CREATE FULLTEXT INDEX prof_index ON prof_test(CONTENT)
CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER'
TEXT ANALYSIS ON;
 

SELECT   * FROM "$TA_PROF_INDEX";
 

 

 

 

                                                         EMOTICONS EXTRACTION:

 

DROP TABLE prof_test;
Create column table prof_test (ID smallint Primary Key, content nvarchar(300));
INSERT INTO prof_test VALUES (2,' I CLEARED MY CERTIFICATION :-D ');
INSERT INTO prof_test VALUES (10,' THE PARTY WAS GOOD :)');
INSERT INTO prof_test VALUES (3,'I HATE TRAFFIC :(((');
 

 

DROP FULLTEXT INDEX prof_index;
CREATE FULLTEXT INDEX prof_index ON prof_test(CONTENT)
CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER'
TEXT ANALYSIS ON;
 

SELECT   * FROM "$TA_PROF_INDEX";

 

To report this post you need to login first.

Be the first to leave a comment

You must be Logged on to comment or reply to a post.

Leave a Reply