With the release of SPS05 SAP implemented a considerable amount of changes and made available a handful of new features, some of them long awaited by modelers/developers, like the ability to debug procedures (finally!!). But from my perspective, one of the coolest new features in SPS05 is Text Analysis. The main goal of this new feature is to extract meaningful information from texts. In other words, companies can now process big volumes of data sources and extract meaningful information without having to read every single sentence. Now, what is meaningful information? The extraction process will identify the “who”, “what”, “where”, “when” and “how much” (among other things) from unstructured data and this way will enable you to enrich your structure data.
If we want to understand how it works from a very high level perspective, we just need to refer to the SAP HANA Developer Guide:
“The text analysis provides a vast number of possible entity types and analysis rules for many industries in 20 languages. However, you do not have to deal with this complexity when analyzing your individual set of documents. The language modules included with the software contain system dictionaries and provide an extensive set of predefined entity types. The extraction process can extract entities using these lists of specific entities. It can also discover new entities using linguistic models. Extraction classifies each extracted entity by entity type and presents this metadata in a standardized format.”
The most impressive thing about Text Analysis is how easy it is to implement it. I’m going to walk you thru the steps I followed to make it work:
First of all, I created a very simple column table that looks like this:
CREATE COLUMN TABLE "PRESS_RELEASES" (
"File_Name" NVARCHAR(20),
"File_Content" BLOB ST_MEMORY_LOB,
PRIMARY KEY ("File_Name"));
In this table I’m going to store a number of PDF files that contain information from the aviation industry in the US. In order to do so, I created a very simple python script that stores the PDF files in the BLOB field (“File_Content”) of my table:
con = dbapi.connect(‘hanahost', 30015, 'SYSTEM', '********') #Open connection to SAP HANA
cur = con.cursor() #Open a cursor
file = open('doc.pdf', 'rb') #Open file in read-only and binary
content = file.read() #Save the content of the file in a variable
cur.execute("INSERT INTO PRESS_RELEASES VALUES(?,?)", ('doc.pdf',content)) #Save the content to the table
file.close() #Close the file
cur.close() #Close the cursor
con.close() #Close the connection
Now that I have my table loaded with unstructured data I’m ready to begin my text analysis process. The only thing I need to do is run the following statement:
Create FullText Index "PDF_FTI" On "PRESS_RELEASES"("File_Content")
TEXT ANALYSIS ON
CONFIGURATION 'EXTRACTION_CORE';
So what I’m doing here is creating a full text index called “PDF_FTI” (you can use any name) on the BLOB column “File_Content” of my table “PRESS_RELEASES”. I’m also saying that I would like to turn Text Analysis on and that I would like to use the configuration called ‘EXTRACTION_CORE’ (you can refer to the SAP HANA Developer Guide for the different configurations). With the execution of this script a new column table is created called $TA_PDF_FTI ($TA_<Index_Name>) that contains the result of my Text Analysis Process. The structure of this table looks like this:
Column Name | Key | Description | Data Type |
File_Name | Yes | This is the primary key of my table. If you have more than one column in your primary key, the $TA table will include every single column | Same as in source table. In this case: NVARCHAR(20) |
RULE | Yes | Stores the rule package that yielded the token. In my case: “Entity Extraction” | NVARCHAR(200) |
COUNTER | Yes | Counts all tokens across the document | BIGINT |
TOKEN | No | The token that was extracted (the “who”, “what”, “where”, “when” and “how much”) | NVARCHAR(250) |
LANGUAGE | No | You can either specify a language column when you create the fulltext index or it can be derived from the text. In my case it was derived from the text and is English (en) | NVARCHAR(2) |
TYPE | No | The Token Type, whether it is a “who”, a “what”, a “where”, etc. | NVARCHAR(100) |
NORMALIZED | No | Stores a normalized representation of the token. This becomes relevant e.g. for German with umlauts, or ß/ss. Normalization with regards to capitalization would not be as important as to justify this column. | NVARCHAR(250) |
STEM | No | Stores the linguistic stemming information, e.g. the singular nominative for nouns, or the indicative for verbs. If text analysis yields several stems, only the first stem will be stored, assuming this to be the best match. | NVARCHAR(300) |
PARAGRAPH | No | The paragraph number where my token is located in the document | INTEGER |
SENTENCE | No | The sentence number where my token is located in the document | INTEGER |
CREATED_AT | No | Creation timestamp | TIMESTAMP |
This is a Data Preview of my $TA table:
Now, let’s pretend I work in an airline, for example Delta Air Lines, and I would like to know whether the company I work for is mentioned in any of the press releases. I can easily find the answer by filtering the column TYPE = ‘ORGANIZATION/COMMERCIAL’ and TOKEN = ‘Delta Air Lines’. As you can see in the screenshot below Delta Air Lines is mentioned twice in the bts201001.pdf file (paragraphs 9 and 151), twice in bts201002.pdf (paragraphs 9 and 136) and so on.
I’m very skeptic by nature, so I would like to see if this is accurate. So I built a very simple XS JavaScript script to show the content of one of the PDF files (bts201002.pdf) that looks like this:
var conn = $.db.getConnection();
try {
var query = "Select \"File_Content\" From \"PRESS_RELEASES\" Where \"File_Name\" = 'bts201002.pdf'";
var pstmt = conn.prepareStatement(query);
var rs = pstmt.executeQuery();
rs.next();
$.response.headers.set("Content-Disposition", "Content-Disposition: attachment; filename=bts201002.pdf");
$.response.contentType = 'application/pdf';
$.response.setBody(rs.getBlob(1));
$.response.status = $.net.http.OK;
} catch (e) {
$.response.setBody("Error while downloading : "+e);
$.response.status = 500;
}
conn.close();
Like my Text Analysis Process said, Delta Air Lines is mentioned twice in this particular file and it’s right where my $TA table said it would be:
Hope you find this post useful.
Follow me on Twitter: @LukiSpa
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
8 | |
5 | |
5 | |
4 | |
4 | |
4 | |
4 | |
3 | |
3 | |
3 |