Skip to Content

Text Search and Text Analysis with SAP HANA:-

Text Analysis:-

Text Analysis is the process of analyzing unstructured text, extracting relevant information and then transforming that information into structured information that can be queried and leveraged in different ways.

TEXT1.png

TEXT2.png

Hidden facts in Text:-

80% of enterprise information originates in unstructured data, making this a huge source of information.Unstructured data provides insights into customers’ perceptions of brands, products, marketing campaigns, and the like.Text analysis also enables request extraction, a method used to extract wishes or requests for improvement from customers.

TEXT3.png

TEXT4.png

SEO (Search Engine optimization) Analytics:-

TEXT5.png

Text Data processing

TEXT6.png


TEXT7.png

SAP HANA supports in-database Text Analysis (SPS05).

The main goal of this feature is to extract meaningful information from texts.In other words, companies can now process big volumes of data sources and extract meaningful information without having to read every single sentence.

TEXT8.png

With SAP HANA’s full text analysis, text analysis goes beyond simple key word searches.

The table shows that SAP HANA’s text analysis includes entity extraction, sentiment analysis, and much more.

Various file formats, such as PDF, TXT, XML, and HTML can be loaded and analyzed in SAP HANA.

Terminology:

Normalization – transforming text into a single canonical form, e.g. “résumé” -> resume

Tokenization – decompose word sequence, e.g. “the quick brown fox” -> “the” “quick” “brown” “fox”

Stemming – reducing words to their base form, e.g. “flew” or “flying” -> “fly”

Part-of-speech tagging – e.g. “quick” -> adjective; “houses” -> noun – plural

Fuzzy Searching – approximate string searching

TEXT9.png

Text analysis with SAP HANA requires that the unstructured data is of a supported file type and gets loaded into a HANA table.

Text being loaded into HANA tables is saved in individual rows. These rows are called documents.

Each document must have an ID.

TEXT10.png

Configuration:-

Configuration tells SAP HANA which type of analysis the user wants to do.

They are saved in XML format and contain all the important text analysis options.

Users can access configurations through the HANA repository. There are five predefined configurations.

TEXT11.png


TEXT12.png

Loading the PDF documents to SAP HANA

The easiest and quickest way to load binary documents into a HANA table is by using a Python script. The user can use the same script for multiple documents. The only parameters that have to be adjusted are:

  • HANA server connection information
  • Path of the binary document
  • Schema/table name

Additional information on data provisioning of binary files into HANA can be retrieved at academy.saphana.com

The SAP HANA Acadamy provides a how-to video on loading data via python script:

Video available @ https://www.youtube.com/watch?v=CUZcDecMnxI

TEXT13.png

Unstructured data must be of supported file type and gets loaded in to a HANA table.

Unstructured data is saved in source table. Each file is a separate record and receives an ID. This ID will serve as a foreign key in the results table.

The user chooses what kind of analysis he wants to perform (e.g. sementic analysis, entity extraction, linguistic analysis).

The user creates a FULLTEXT INDEX.

Results are saved in a separate table with the prefix $TA_


TEXT14.png

EXERCISE:

Creating Table for inserting text:-

CREATE SCHEMA TXT;

CREATE COLUMN TABLE “TXT”.”DEMOTABLE”

(ID INTEGER PRIMARY KEY,

STRING nvarchar(200));

INSERT INTO “TXT”.”DEMOTABLE” VALUES (1, ‘Tom enjoys working at Accenture’);

Result Table in HANA

TEXT15.png

TEXT16.png

TEXT17.png

The TA_TYPE column specifies the type of entity extracted. For instance, PERSON usually refers to people.

Sentiments are divided into Positive-, WeakPositive-, StrongPositive-, Negative-, WeakNegative-, and StrongNegative sentiments.

TEXT18.png

To report this post you need to login first.

1 Comment

You must be Logged on to comment or reply to a post.

Leave a Reply