Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
Former Member

With the release of SPS05 SAP implemented a considerable amount of changes and made available a handful of new features, some of them long awaited by modelers/developers, like the ability to debug procedures (finally!!). But from my perspective, one of the coolest new features in SPS05 is Text Analysis. The main goal of this new feature is to extract meaningful information from texts. In other words, companies can now process big volumes of data sources and extract meaningful information without having to read every single sentence. Now, what is meaningful information? The extraction process will identify the “who”, “what”, “where”, “when” and “how much” (among other things) from unstructured data and this way will enable you to enrich your structure data.

If we want to understand how it works from a very high level perspective, we just need to refer to the SAP HANA Developer Guide:

“The text analysis provides a vast number of possible entity types and analysis rules for many industries in 20 languages. However, you do not have to deal with this complexity when analyzing your individual set of documents. The language modules included with the software contain system dictionaries and provide an extensive set of predefined entity types. The extraction process can extract entities using these lists of specific entities. It can also discover new entities using linguistic models. Extraction classifies each extracted entity by entity type and presents this metadata in a standardized format.”

The most impressive thing about Text Analysis is how easy it is to implement it. I’m going to walk you thru the steps I followed to make it work:

First of all, I created a very simple column table that looks like this:

CREATE COLUMN TABLE "PRESS_RELEASES" (

       "File_Name" NVARCHAR(20),

       "File_Content" BLOB ST_MEMORY_LOB,

       PRIMARY KEY ("File_Name"));

In this table I’m going to store a number of PDF files that contain information from the aviation industry in the US. In order to do so, I created a very simple python script that stores the PDF files in the BLOB field (“File_Content”) of my table:

con = dbapi.connect(‘hanahost', 30015, 'SYSTEM', '********') #Open connection to SAP HANA
cur = con.cursor() #Open a cursor
file = open('doc.pdf', 'rb') #Open file in read-only and binary
content = file.read() #Save the content of the file in a variable
cur.execute("INSERT INTO PRESS_RELEASES VALUES(?,?)", ('doc.pdf',content)) #Save the content to the table
file.close() #Close the file
cur.close() #Close the cursor
con.close() #Close the connection

Now that I have my table loaded with unstructured data I’m ready to begin my text analysis process. The only thing I need to do is run the following statement:

Create FullText Index "PDF_FTI" On "PRESS_RELEASES"("File_Content")

TEXT ANALYSIS ON

CONFIGURATION 'EXTRACTION_CORE';

So what I’m doing here is creating a full text index called “PDF_FTI” (you can use any name) on the BLOB column “File_Content” of my table “PRESS_RELEASES”. I’m also saying that I would like to turn Text Analysis on and that I would like to use the configuration called ‘EXTRACTION_CORE’ (you can refer to the SAP HANA Developer Guide for the different configurations). With the execution of this script a new column table is created called $TA_PDF_FTI ($TA_<Index_Name>) that contains the result of my Text Analysis Process. The structure of this table looks like this:

Column Name

Key

Description

Data Type

File_Name

Yes

This is the primary key of my table. If you have more than one column in your primary key, the $TA table will include every single column

Same as in source table. In this case: NVARCHAR(20)

RULE

Yes

Stores the rule package that yielded the token. In my case: “Entity Extraction”

NVARCHAR(200)

COUNTER

Yes

Counts all tokens across the document

BIGINT

TOKEN

No

The token that was extracted (the “who”, “what”, “where”, “when” and “how much”)

NVARCHAR(250)

LANGUAGE

No

You can either specify a language column when you create the fulltext index or it can be derived from the text. In my case it was derived from the text and is English (en)

NVARCHAR(2)

TYPE

No

The Token Type, whether it is a “who”, a “what”, a “where”, etc.

NVARCHAR(100)

NORMALIZED

No

Stores a normalized representation of the token. This becomes relevant e.g. for German with umlauts, or ß/ss. Normalization with regards to capitalization would not be as important as to justify this column.

NVARCHAR(250)

STEM

No

Stores the linguistic stemming information, e.g. the singular nominative for nouns, or the indicative for verbs. If text analysis yields several stems, only the first stem will be stored, assuming this to be the best match.

NVARCHAR(300)

PARAGRAPH

No

The paragraph number where my token is located in the document

INTEGER

SENTENCE

No

The sentence number where my token is located in the document

INTEGER

CREATED_AT

No

Creation timestamp

TIMESTAMP

This is a Data Preview of my $TA table:

Now, let’s pretend I work in an airline, for example Delta Air Lines, and I would like to know whether the company I work for is mentioned in any of the press releases. I can easily find the answer by filtering the column TYPE = ‘ORGANIZATION/COMMERCIAL’ and TOKEN = ‘Delta Air Lines’. As you can see in the screenshot below Delta Air Lines is mentioned twice in the bts201001.pdf file (paragraphs 9 and 151), twice in bts201002.pdf (paragraphs 9 and 136) and so on.

I’m very skeptic by nature, so I would like to see if this is accurate. So I built a very simple XS JavaScript script to show the content of one of the PDF files (bts201002.pdf) that looks like this:

var conn = $.db.getConnection();
try {
var query = "Select \"File_Content\" From \"PRESS_RELEASES\" Where \"File_Name\" = 'bts201002.pdf'";
          var pstmt = conn.prepareStatement(query);
          var rs = pstmt.executeQuery();
          rs.next();
$.response.headers.set("Content-Disposition", "Content-Disposition: attachment; filename=bts201002.pdf");
          $.response.contentType = 'application/pdf';
          $.response.setBody(rs.getBlob(1));
          $.response.status = $.net.http.OK;
} catch (e) {
          $.response.setBody("Error while downloading : "+e);
          $.response.status = 500;
}
conn.close();

Like my Text Analysis Process said, Delta Air Lines is mentioned twice in this particular file and it’s right where my $TA table said it would be:

Hope you find this post useful.

Follow me on Twitter: @LukiSpa

65 Comments
Labels in this area