SAP HANA Text Analysis

Former Member · ‎01-03-2013

With the release of SPS05 SAP implemented a considerable amount of changes and made available a handful of new features, some of them long awaited by modelers/developers, like the ability to debug procedures (finally!!). But from my perspective, one of the coolest new features in SPS05 is Text Analysis. The main goal of this new feature is to extract meaningful information from texts. In other words, companies can now process big volumes of data sources and extract meaningful information without having to read every single sentence. Now, what is meaningful information? The extraction process will identify the “who”, “what”, “where”, “when” and “how much” (among other things) from unstructured data and this way will enable you to enrich your structure data.

If we want to understand how it works from a very high level perspective, we just need to refer to the SAP HANA Developer Guide:

“The text analysis provides a vast number of possible entity types and analysis rules for many industries in 20 languages. However, you do not have to deal with this complexity when analyzing your individual set of documents. The language modules included with the software contain system dictionaries and provide an extensive set of predefined entity types. The extraction process can extract entities using these lists of specific entities. It can also discover new entities using linguistic models. Extraction classifies each extracted entity by entity type and presents this metadata in a standardized format.”

The most impressive thing about Text Analysis is how easy it is to implement it. I’m going to walk you thru the steps I followed to make it work:

First of all, I created a very simple column table that looks like this:

CREATE COLUMN TABLE "PRESS_RELEASES" (

"File_Name" NVARCHAR(20),

"File_Content" BLOB ST_MEMORY_LOB,

PRIMARY KEY ("File_Name"));

In this table I’m going to store a number of PDF files that contain information from the aviation industry in the US. In order to do so, I created a very simple python script that stores the PDF files in the BLOB field (“File_Content”) of my table:

con = dbapi.connect(‘hanahost', 30015, 'SYSTEM', '********') #Open connection to SAP HANA

cur = con.cursor() #Open a cursor

file = open('doc.pdf', 'rb') #Open file in read-only and binary

content = file.read() #Save the content of the file in a variable

cur.execute("INSERT INTO PRESS_RELEASES VALUES(?,?)", ('doc.pdf',content)) #Save the content to the table

file.close() #Close the file

cur.close() #Close the cursor

con.close() #Close the connection

Now that I have my table loaded with unstructured data I’m ready to begin my text analysis process. The only thing I need to do is run the following statement:

Create FullText Index "PDF_FTI" On "PRESS_RELEASES"("File_Content")

TEXT ANALYSIS ON

CONFIGURATION 'EXTRACTION_CORE';

So what I’m doing here is creating a full text index called “PDF_FTI” (you can use any name) on the BLOB column “File_Content” of my table “PRESS_RELEASES”. I’m also saying that I would like to turn Text Analysis on and that I would like to use the configuration called ‘EXTRACTION_CORE’ (you can refer to the SAP HANA Developer Guide for the different configurations). With the execution of this script a new column table is created called $TA_PDF_FTI ($TA_<Index_Name>) that contains the result of my Text Analysis Process. The structure of this table looks like this:

Column Name	Key	Description	Data Type
File_Name	Yes	This is the primary key of my table. If you have more than one column in your primary key, the $TA table will include every single column	Same as in source table. In this case: NVARCHAR(20)
RULE	Yes	Stores the rule package that yielded the token. In my case: “Entity Extraction”	NVARCHAR(200)
COUNTER	Yes	Counts all tokens across the document	BIGINT
TOKEN	No	The token that was extracted (the “who”, “what”, “where”, “when” and “how much”)	NVARCHAR(250)
LANGUAGE	No	You can either specify a language column when you create the fulltext index or it can be derived from the text. In my case it was derived from the text and is English (en)	NVARCHAR(2)
TYPE	No	The Token Type, whether it is a “who”, a “what”, a “where”, etc.	NVARCHAR(100)
NORMALIZED	No	Stores a normalized representation of the token. This becomes relevant e.g. for German with umlauts, or ß/ss. Normalization with regards to capitalization would not be as important as to justify this column.	NVARCHAR(250)
STEM	No	Stores the linguistic stemming information, e.g. the singular nominative for nouns, or the indicative for verbs. If text analysis yields several stems, only the first stem will be stored, assuming this to be the best match.	NVARCHAR(300)
PARAGRAPH	No	The paragraph number where my token is located in the document	INTEGER
SENTENCE	No	The sentence number where my token is located in the document	INTEGER
CREATED_AT	No	Creation timestamp	TIMESTAMP

This is a Data Preview of my $TA table:

Now, let’s pretend I work in an airline, for example Delta Air Lines, and I would like to know whether the company I work for is mentioned in any of the press releases. I can easily find the answer by filtering the column TYPE = ‘ORGANIZATION/COMMERCIAL’ and TOKEN = ‘Delta Air Lines’. As you can see in the screenshot below Delta Air Lines is mentioned twice in the bts201001.pdf file (paragraphs 9 and 151), twice in bts201002.pdf (paragraphs 9 and 136) and so on.

I’m very skeptic by nature, so I would like to see if this is accurate. So I built a very simple XS JavaScript script to show the content of one of the PDF files (bts201002.pdf) that looks like this:

var conn = $.db.getConnection();

try {

var query = "Select \"File_Content\" From \"PRESS_RELEASES\" Where \"File_Name\" = 'bts201002.pdf'";

          var pstmt = conn.prepareStatement(query);

          var rs = pstmt.executeQuery();

          rs.next();

$.response.headers.set("Content-Disposition", "Content-Disposition: attachment; filename=bts201002.pdf");

          $.response.contentType = 'application/pdf';

          $.response.setBody(rs.getBlob(1));

          $.response.status = $.net.http.OK;

} catch (e) {

          $.response.setBody("Error while downloading : "+e);

          $.response.status = 500;

}

conn.close();

Like my Text Analysis Process said, Delta Air Lines is mentioned twice in this particular file and it’s right where my $TA table said it would be:

Hope you find this post useful.

Follow me on Twitter: @LukiSpa

SAP HANA Text Analysis

SAP PI for Beginners

ABAP 7.40 Quick Reference

Fiori: technical installation and configuration of one app from A - Z