SAP HANA Text Analysis
With the release of SPS05 SAP implemented a considerable amount of changes and made available a handful of new features, some of them long awaited by modelers/developers, like the ability to debug procedures (finally!!). But from my perspective, one of the coolest new features in SPS05 is Text Analysis. The main goal of this new feature is to extract meaningful information from texts. In other words, companies can now process big volumes of data sources and extract meaningful information without having to read every single sentence. Now, what is meaningful information? The extraction process will identify the “who”, “what”, “where”, “when” and “how much” (among other things) from unstructured data and this way will enable you to enrich your structure data.
If we want to understand how it works from a very high level perspective, we just need to refer to the SAP HANA Developer Guide:
“The text analysis provides a vast number of possible entity types and analysis rules for many industries in 20 languages. However, you do not have to deal with this complexity when analyzing your individual set of documents. The language modules included with the software contain system dictionaries and provide an extensive set of predefined entity types. The extraction process can extract entities using these lists of specific entities. It can also discover new entities using linguistic models. Extraction classifies each extracted entity by entity type and presents this metadata in a standardized format.”
The most impressive thing about Text Analysis is how easy it is to implement it. I’m going to walk you thru the steps I followed to make it work:
First of all, I created a very simple column table that looks like this:
CREATE COLUMN TABLE “PRESS_RELEASES” (
“File_Name” NVARCHAR(20),
“File_Content” BLOB ST_MEMORY_LOB,
PRIMARY KEY (“File_Name”));
In this table I’m going to store a number of PDF files that contain information from the aviation industry in the US. In order to do so, I created a very simple python script that stores the PDF files in the BLOB field (“File_Content”) of my table:
con = dbapi.connect(‘hanahost', 30015, 'SYSTEM', '********') #Open connection to SAP HANA
cur = con.cursor() #Open a cursor
file = open('doc.pdf', 'rb') #Open file in read-only and binary
content = file.read() #Save the content of the file in a variable
cur.execute("INSERT INTO PRESS_RELEASES VALUES(?,?)", ('doc.pdf',content)) #Save the content to the table
file.close() #Close the file
cur.close() #Close the cursor
con.close() #Close the connection
Now that I have my table loaded with unstructured data I’m ready to begin my text analysis process. The only thing I need to do is run the following statement:
Create FullText Index “PDF_FTI” On “PRESS_RELEASES”(“File_Content”)
TEXT ANALYSIS ON
CONFIGURATION ‘EXTRACTION_CORE’;
So what I’m doing here is creating a full text index called “PDF_FTI” (you can use any name) on the BLOB column “File_Content” of my table “PRESS_RELEASES”. I’m also saying that I would like to turn Text Analysis on and that I would like to use the configuration called ‘EXTRACTION_CORE’ (you can refer to the SAP HANA Developer Guide for the different configurations). With the execution of this script a new column table is created called $TA_PDF_FTI ($TA_<Index_Name>) that contains the result of my Text Analysis Process. The structure of this table looks like this:
Column Name |
Key |
Description |
Data Type |
File_Name |
Yes |
This is the primary key of my table. If you have more than one column in your primary key, the $TA table will include every single column |
Same as in source table. In this case: NVARCHAR(20) |
RULE |
Yes |
Stores the rule package that yielded the token. In my case: “Entity Extraction” |
NVARCHAR(200) |
COUNTER |
Yes |
Counts all tokens across the document |
BIGINT |
TOKEN |
No |
The token that was extracted (the “who”, “what”, “where”, “when” and “how much”) |
NVARCHAR(250) |
LANGUAGE |
No |
You can either specify a language column when you create the fulltext index or it can be derived from the text. In my case it was derived from the text and is English (en) |
NVARCHAR(2) |
TYPE |
No |
The Token Type, whether it is a “who”, a “what”, a “where”, etc. |
NVARCHAR(100) |
NORMALIZED |
No |
Stores a normalized representation of the token. This becomes relevant e.g. for German with umlauts, or ß/ss. Normalization with regards to capitalization would not be as important as to justify this column. |
NVARCHAR(250) |
STEM |
No |
Stores the linguistic stemming information, e.g. the singular nominative for nouns, or the indicative for verbs. If text analysis yields several stems, only the first stem will be stored, assuming this to be the best match. |
NVARCHAR(300) |
PARAGRAPH |
No |
The paragraph number where my token is located in the document |
INTEGER |
SENTENCE |
No |
The sentence number where my token is located in the document |
INTEGER |
CREATED_AT |
No |
Creation timestamp |
TIMESTAMP |
This is a Data Preview of my $TA table:
Now, let’s pretend I work in an airline, for example Delta Air Lines, and I would like to know whether the company I work for is mentioned in any of the press releases. I can easily find the answer by filtering the column TYPE = ‘ORGANIZATION/COMMERCIAL’ and TOKEN = ‘Delta Air Lines’. As you can see in the screenshot below Delta Air Lines is mentioned twice in the bts201001.pdf file (paragraphs 9 and 151), twice in bts201002.pdf (paragraphs 9 and 136) and so on.
I’m very skeptic by nature, so I would like to see if this is accurate. So I built a very simple XS JavaScript script to show the content of one of the PDF files (bts201002.pdf) that looks like this:
var conn = $.db.getConnection();
try {
var query = "Select \"File_Content\" From \"PRESS_RELEASES\" Where \"File_Name\" = 'bts201002.pdf'";
var pstmt = conn.prepareStatement(query);
var rs = pstmt.executeQuery();
rs.next();
$.response.headers.set("Content-Disposition", "Content-Disposition: attachment; filename=bts201002.pdf");
$.response.contentType = 'application/pdf';
$.response.setBody(rs.getBlob(1));
$.response.status = $.net.http.OK;
} catch (e) {
$.response.setBody("Error while downloading : "+e);
$.response.status = 500;
}
conn.close();
Like my Text Analysis Process said, Delta Air Lines is mentioned twice in this particular file and it’s right where my $TA table said it would be:
Hope you find this post useful.
Follow me on Twitter: @LukiSpa
Wow!!! I do not have words to tell how perfect this article is. Lucas, if you have time, please write more articles like that 🙂
I'm really glad you liked it. I will!! 🙂
Superb. Like 😀 it.
I can create the index table $...
but no data in the table.
Could you give me some advice?
Thanks.
Hi Holger, I'm glad you liked it :-). A couple of questions:
CREATE FULLTEXT INDEX <INDEX_NAME> ON <TAB_NAME>(<COL_NAME>)
TEXT ANALYSIS ON
CONFIGURATION 'EXTRACTION_CORE'
LANGUAGE COLUMN <LANG>
Let me know how it goes!
Cheers, Lucas.
this is amazing ..thanks for sharing it . .
but as Holger mentioned, the $TA table is blank for me too..
I have system user and my document is in English ..
Could you please let me know what else I need to take care of ?
Regards
Vivek
Hi Vivek, you can take a look at the preprocessor trace file to see what is causing the $TA table to be blank. You should see an error message in the preprocessor trace file.
Hi Lucas
thanks for your input.. I looked @ the preprocessor trace file and had this info:..Can you help ? IS there a problem with configuration?
LOB END
[123820]{0}[0] 2013-01-29 20:56:42.862016 e preprocessor PreprocessorPool.cpp(00391) : PreprocessorPool::CreateConnection : Error while creating connection to server - TrexNet::Exception a protocol mismatch occured, returning to application with errorcode PREPROCESSOR_FAILED
[200111]{0}[0] 2013-01-29 23:03:29.297154 e TrexNet Request.cpp(01346) : ERROR INFO: cancelled Request on channel 225. Errors for this channel can be ignored.
[94490]{0}[0] 2013-01-29 23:03:30.050334 e TrexNet Channel.cpp(00326) : ERROR: reading from channel 225 to 0 <10.68.104.93:30303> failed with rc=7 (internal error)
[94490]{0}[0] 2013-01-29 23:03:30.050408 e TrexNetBuffe BufferedIO.cpp(01092) : channel 225 from <10.68.104.93:30303>: read from channel failed; resetting buffer
[94490]{0}[0] 2013-01-29 23:03:30.050613 e preprocessor PreprocessorPool.cpp(00391) : PreprocessorPool::CreateConnection : Error while creating connection to server - TrexNet::Exception an error occured while reading from the channel, returning to application with errorcode PREPROCESSOR_FAILED
[60634]{0}[0] 2013-01-31 19:39:24.567948 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error 'docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.', number of errors: 1
[60634]{0}[0] 2013-01-31 19:39:24.568150 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601
[60634]{0}[0] 2013-01-31 19:39:24.568213 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH - Processing Document with key 'ZVIV:PRESS_RELEASESen.File_Content:7' failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)
[60626]{0}[0] 2013-01-31 19:39:24.893542 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error 'docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.', number of errors: 1
[60626]{0}[0] 2013-01-31 19:39:24.893622 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601
[60626]{0}[0] 2013-01-31 19:39:24.893681 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH - Processing Document with key 'ZVIV:PRESS_RELEASESen.File_Content:5' failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)
[60634]{0}[0] 2013-01-31 20:02:03.464320 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error 'docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.', number of errors: 1
[60634]{0}[0] 2013-01-31 20:02:03.464408 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601
[60634]{0}[0] 2013-01-31 20:02:03.464463 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH - Processing Document with key 'ZVIV:PRESS_RELEASESen.File_Content:9' failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)
[60625]{0}[0] 2013-01-31 20:02:03.814165 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error 'docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.', number of errors: 1
[60625]{0}[0] 2013-01-31 20:02:03.814228 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601
[60625]{0}[0] 2013-01-31 20:02:03.814276 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH - Processing Document with key 'ZVIV:PRESS_RELEASESen.File_Content:8' failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)
[60625]{0}[0] 2013-01-31 20:06:24.079945 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error 'docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.', number of errors: 1
[60625]{0}[0] 2013-01-31 20:06:24.080006 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601
[60625]{0}[0] 2013-01-31 20:06:24.080059 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH - Processing Document with key 'ZVIV:PRESS_RELEASESen.File_Content:11' failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)
[60620]{0}[0] 2013-01-31 20:06:26.192253 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error 'docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.', number of errors: 1
[60620]{0}[0] 2013-01-31 20:06:26.192322 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601
[60620]{0}[0] 2013-01-31 20:06:26.192373 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH - Processing Document with key 'ZVIV:PRESS_RELEASESen.File_Content:10' failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)
[60622]{0}[0] 2013-01-31 20:11:38.499618 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error 'docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.', number of errors: 1
[60622]{0}[0] 2013-01-31 20:11:38.499683 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601
[60622]{0}[0] 2013-01-31 20:11:38.499724 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH - Processing Document with key 'SYSTEM:PRESS_RELEASESen.File_Content:2' failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)
[41419]{0}[0] 2013-01-31 22:49:22.390772 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error 'docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.', number of errors: 1
[41419]{0}[0] 2013-01-31 22:49:22.390851 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601
[41419]{0}[0] 2013-01-31 22:49:22.390938 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH - Processing Document with key 'SYSTEM:PRESS_RELEASESen.File_Content:2' failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)
[29254]{0}[0] 2013-02-01 09:03:36.853997 e TrexNet BufferedIO.cpp(03885) : ERROR on channel 481 <10.68.104.93:30303>: type mismatch read=EOF2 expected=String
[29254]{0}[0] 2013-02-01 09:03:36.854053 e TrexNet BufferedIO.cpp(03886) : Stacktrace for Context Name:Generic
1: 0x00007f75239d0545 in Execution::Context::dumpStackTrace(char const*, ltt::basic_ostream<char, ltt::char_traits<char> >&) const+0x31 at Context.cpp:1273 (libhdbbasis.so)
2: 0x00007f752a00fd46 in TrexTrace::Tracer::stack(char const*, unsigned int, TrexTrace::TraceLevel const&)+0xa2 at TrexTrace.cpp:1369 (libhdbbasement.so)
3: 0x00007f7529fc9975 in TrexNet::BufferedNetInput::operator>>(ltt::basic_string<char, ltt::char_traits<char> >&)+0xe1 at BufferedIO.cpp:3886 (libhdbbasement.so)
4: 0x0000000000470be2 in preprocessor::PreprocessorPool::CreateConnection(int&, ltt_adp::basic_string<char, ltt::char_traits<char>, ltt::integral_constant<bool, true> >&)+0x320 at PreprocessorPool.cpp:381 (hdbpreprocessor)
5: 0x0000000000471375 in preprocessor::PreprocessorPool::run(void*)+0x3d1 at PreprocessorPool.cpp:298 (hdbpreprocessor)
6: 0x00007f752a0046fa in TrexThreads::PoolThread::run()+0xac6 at PoolThread.cpp:258 (libhdbbasement.so)
7: 0x00007f752a005ea8 in TrexThreads::PoolThread::run(void*&)+0x14 at PoolThread.cpp:105 (libhdbbasement.so)
8: 0x00007f7523a08bf2 in Execution::Thread::staticMainImp(void**)+0x6a0 at Thread.cpp:455 (libhdbbasis.so)
9: 0x00007f7523a08ddd in Execution::Thread::staticMain(void*)+0x39 at Thread.cpp:526 (libhdbbasis.so)
[29254]{0}[0] 2013-02-01 09:03:36.873597 e TrexNetIO BufferedIO.cpp(01332) : dump of TrexNet buffer: size 65, current position 0x00007f73d71712cd (65 bytes), mode incoming, channel 481, peer 10.68.104.93:30303, content:
BLOB START (Addr: 0x00007f73d717128c Len: 65)
BLOB END
[29254]{0}[0] 2013-02-01 09:03:36.873930 e preprocessor PreprocessorPool.cpp(00391) : PreprocessorPool::CreateConnection : Error while creating connection to server - TrexNet::Exception a protocol mismatch occured, returning to application with errorcode PREPROCESSOR_FAILED
[58151]{0}[0] 2013-02-01 17:51:15.667975 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error 'docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.', number of errors: 1
[58151]{0}[0] 2013-02-01 17:51:15.668100 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601
[58151]{0}[0] 2013-02-01 17:51:15.668154 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH - Processing Document with key 'SYSTEM:PRESS_RELEASESen.File_Content:8' failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)
[58151]{0}[0] 2013-02-01 17:57:39.608469 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error 'docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.', number of errors: 1
[58151]{0}[0] 2013-02-01 17:57:39.608542 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601
[58151]{0}[0] 2013-02-01 17:57:39.608600 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH - Processing Document with key 'SYSTEM:PRESS_RELEASESen.File_Content:3' failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)
[196148]{0}[0] 2013-02-01 17:57:45.606893 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error 'docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.', number of errors: 1
[196148]{0}[0] 2013-02-01 17:57:45.606961 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601
[196148]{0}[0] 2013-02-01 17:57:45.607012 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH - Processing Document with key 'SYSTEM:PRESS_RELEASESen.File_Content:9' failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)
[198399]{0}[0] 2013-02-08 17:08:30.406229 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error 'docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.', number of errors: 1
[198399]{0}[0] 2013-02-08 17:08:30.406321 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601
[198399]{0}[0] 2013-02-08 17:08:30.406374 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH - Processing Document with key 'SYSTEM:PRESS_RELEASESen.File_Content:14' failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)
[198394]{0}[0] 2013-02-08 17:08:38.717108 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error 'docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.', number of errors: 1
[198394]{0}[0] 2013-02-08 17:08:38.717179 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601
[198394]{0}[0] 2013-02-08 17:08:38.717233 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH - Processing Document with key 'ZVIV:PRESS_RELEASESen.File_Content:2' failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)
[58151]{0}[0] 2013-02-08 17:18:20.099186 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error 'docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.', number of errors: 1
[58151]{0}[0] 2013-02-08 17:18:20.099239 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601
[58151]{0}[0] 2013-02-08 17:18:20.099283 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH - Processing Document with key 'ZVIV:PRESS_RELEASESen.File_Content:2' failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)
Hi Lucas
I figured it out ..
It works for small pdf files.. But I cannot do the same with bigger pdf files.. Is there a config which I need to take care of for increasing the threshold value for annotations ?
Regards
Vivek
Hi Vivek, sorry, you got me there, I'm not aware of such a configuration...
Lucas and Vivek,
I believe that HANA SPS05 limits each document to 1 million annotations. This limit is not configurable in SPS05.
Furthermore, if you are using the EXTRACTION_CORE configuration, this limit will apply to the total number of words plus semantic entities found in the document.
Hi Bill
Thanks for your reply.. This totally explains the perf trace error ..
In case if I have a large pdf file , then do I need split in to small documents and then load into HANA so that the the annotations n semantics can be taken care ?
Unfortunately yes. It's a shame too since one of the great features is automatic conversion from PDF to text. I did create an enhancement request in the HANA tracking system to provide a way for customers to increase the limit.
Great blog!!
Basic question (sorry!) - how do you execute python script in HANA?
Hi Abhijeet, I'm glad you liked the post. When you install the SAP HANA client on your computer, all the python libraries are installed as well, so you can run the Python script from your computer after you have installed the SAP HANA Client. This is for sure true for the 64 bit version, not sure if this is also valid for the 32 bit version.
Cheers, Lucas.
OMG...This totally works....Highly impressed... 🙂
Few pointers for SAP / HANA developers who have no idea about python...Python is similar to java...I followed this blog until connection test...
http://scn.sap.com/community/developer-center/hana/blog/2012/06/08/sap-hana-and-python-yes-sir
Once I have a good connection, then its pretty straight forward. Since I have using cloudshare, I had to mention my schema names before table tech names. Also, I had to setup PATH (environment variable) with python home directory within HANA client...
can't wait for further explore...
I feel i have learnt python as well...
Kudos Lucas... 😎
Liked It..nope...Loved it..! Surely there will be mulitple real-life usecase for this..!
Vamsi
I'm glad you liked it! Thanks for your comment! 🙂
If somebody needs a code for ODBC + .NET, you can use this snippet:
using System;
using System.Collections.Generic;
using System.Data.Odbc;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Test_3_HANA {
class Program {
static void Main(string[] args) {
OdbcConnection cn = new OdbcConnection("dsn=XXX;UID=YYYYY;PWD=ZZZZZ");
String ins = "INSERT INTO \"SYSTEM\".\"FileContent\" VALUES(?, ?)";
OdbcCommand cmd = new OdbcCommand(ins, cn);
String filePath = @"C:\Temp\Workflow.pdf";
cn.Open();
OdbcTransaction trans = cn.BeginTransaction();
cmd.Transaction = trans;
OdbcParameter param = new OdbcParameter("@name", OdbcType.Text);
param.Value = filePath;
param.Size = filePath.Length;
cmd.Parameters.Add(param);
Byte[] fileContent = ReadFileByteArray(filePath);
param = new OdbcParameter("@file", OdbcType.Binary);
param.Value = fileContent;
param.Size = fileContent.Length;
cmd.Parameters.Add(param);
cmd.ExecuteNonQuery();
trans.Commit();
cn.Close();
}
static public Byte[] ReadFileByteArray(String fileName) {
FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read);
Byte[] fileData = new Byte[fs.Length];
fs.Read(fileData, 0, System.Convert.ToInt32(fs.Length));
fs.Close();
return fileData;
}
}
}
It is a dirty solution, but it works 🙂 Transactions are necessary for large files.
This is great! Thank you very much for sharing! 🙂
Hi Lucas,
I am getting this error while trying to execute the python code.
dbapi.Error: (596, 'LOB streaming is not permitted in auto-commit mode: ')
how to overcome this error? Please help.
Thanks,
Vishnu
Hi Vishnu, what you can do to make it work is set the Auto-Commit option to false right after you open the connection and then manually commit the changes after the insert statement. Let me know how it goes!
con = dbapi.connect('localhost', 30015, 'SYSTEM', ‘********’)
con.setautocommit(False)
...
cur.execute("INSERT INTO PRESS_RELEASES VALUES(?,?)", ('doc.pdf',content))
con.commit()
Thanks Lucas! it worked.. 🙂
Good info - thanks!
Hi Lucas,
Great post.
one question though
you are using 'EXTRACTION_CORE' config here
I tried using 'EXTRACTION_CORE_VOICEOFCUSTOMER' (which i believe is meant for extracting Sentiments) on a file which contained reviews for a product.
It did create $TA table; but was empty.
So is there any particular way to use 'EXTRACTION_CORE_VOICEOFCUSTOMER'?
Thank You.
Hi Achin, like Bill Miller pointed out in a previous comment, there's a limit in the size of the file, so maybe you can try it using a small file and see if that works. You can also look at the preprocessor trace file and look for the error that is causing your table to remain empty.
Thanks, Lucas.
Hi Lucas,
I'm trying to create a fuzzy search index with following parameters
CREATE FULLTEXT index text_analysis ON system.TMX(column_0)
fuzzy search index on
fast preprocess off
text analysis on
configuration 'EXTRACTION CORE';
but I get the following erro message:
Could not execute 'create fulltext index text_analysis on system.TMX(column_0) fuzzy search index on fast preprocess ...' in 8 ms 422 µs .
SAP DBTech JDBC: [2048]: column store error: add DafAttribute failed: [366] Table has no keys, Table_TA cannot be created
If I exclude the "text analysis on" option out of creation prompt, the filter is successfully created, otherwise not.
I don't understand why the $TA cannot be created.
Could you help me please with this issue?
Thank you in advance,
Inna
Hi Inna, to be able to create the $TA table you need to create a primary key in the system.TMX table. Remember the first column(s) of the $TA table are the primary key of the source table, in this case, the TMX table. Let me know how it goes...
Hi Lucas,
I got another issue while importing data from local file, for example from csv.... My CSV file contains about 12500 rows of textual data. I would like to import them into the DB...When I use HANA functionality for import data from CSV only first 1000 have been imported....
I also created the connection via python ODBC and tried via "insert into the table" to write the data into HANA DB, but the same issue, just first 1000 entries have been inserted.
Could you help me with this issue?
Best regards,
Inna
Hi Inna, when you run a select statement in the SQL editor only the first 1000 rows will be displayed, but that doesn't mean they are not stored in the table. Did you try counting the number of rows in the table:
SELECT COUNT( * ) FROM <TABLE_NAME>;
Cheers, Lucas.
Hi Lucas,
thank you for your reply, it was helpful.
Cheers,
Inna
Hi Lucas,
Thanks for publishing the blog and explaining the concept with an understandable example.
Regards,
Pavan
Hi Lucas,
Excellent Article for Text Analysis on SAP HANA. I tried to implement the same but ended up with following Errors -
[13427]{0}[0] 2013-04-03 03:33:24.556021 e preprocessor PreprocessorImpl.cpp(02072) : Error reading TA configuration file from the IS server: 'Dev Object null after reading from Repository'
[13427]{0}[0] 2013-04-03 03:33:25.556304 e preprocessor PreprocessorImpl.cpp(02121) : File read Error '/usr/sap/HDB/SYS/global/hdb/custom/config/lexicon//EXTRACTION_CORE', error='Storage object does not exist: $STORAGEOBJECT$'
[13427]{0}[0] 2013-04-03 03:33:25.556440 e preprocessor PreprocessorImpl.cpp(01134) : invalid TA_Pointer
Extracted above information from Preprocessor Trace File which you had suggested
Is there a way that you could help me out on this particular Error.
Hi Srijanth, I would try using a small PDF file and see if that works. Let me know how it goes!
Cheers, Lucas.
Hi Lucas,
I loaded a few PDF files into HANA. Orginial size of the files was 37,0 MB in sum.
I created the index and looked at the size of the table.
It was 1.7 MB in memory and 39.8 MB on disk.
After delta merge I was suprised that the size increased:
It was 38.0 MB in memory and 44.7 MB on disk.
Moreover the speed of the SQL textsearch statment increased from 15..30 ms before delta merge to 90..120 ms after delta merge.
Do you have an explanation? I thought delta merge should improve things in both areas. Or does this have to do with a kind of binary <-> text switch?
Best regards and thanks for this nice blog!!!
Christoph
P.S. I am working with revision 52.
Hi, what I'm thinking is before running the delta merge the documents were not loaded to memory. So after running the delta merge all the documents were loaded to memory and that explains the size increasing in memory. Nevertheless, not sure why the search query lost performance after the delta merge. It would be interesting to see what was the memory consumption on the Delta Storage before running the delta merge.
Cheers, Lucas.
Hi Lucas,
Great Post, Thanks for sharing the step by step info.
However I am not able to see any data in the $TA table; I have not used any pdf files, just loaded a column with customer feedback with max of 257 characters.
Investigated the pre-processor trace file which looks like this :
[4558]{-1}[-1/-1] 2013-07-23 05:30:03.545715 e preprocessor PreprocessorPool.cpp(00438) : PreprocessorPool::CreateConnection : Error while creating connection to server - TrexNet::Exception a protocol mismatch occured, returning to application with errorcode PREPROCESSOR_FAILED
[4447]{-1}[-1/-1] 2013-07-23 05:33:45.240292 i Basis TraceStream.cpp(00383) : ==== Starting hdbpreprocessor, version 1.00.60.379371 (NewDB100_REL), build linuxx86_64 not set 2013-06-27 17:39:32 ld7270.wdf.sap.corp cc (SAP release 20130125, based on SUSE gcc43-4.3.4_20091019-0.22.17.5096.1.PTF.799300) 4.3.4 [gcc-4_3-branch revision 152973]
[4447]{-1}[-1/-1] 2013-07-23 05:33:45.240370 i Basis TraceStream.cpp(00385) : MaxOpenFiles: 1048576
[4447]{-1}[-1/-1] 2013-07-23 05:33:45.240385 i Memory MallocProxy.cpp(01222) : Installed malloc hooks
[4447]{-1}[-1/-1] 2013-07-23 05:33:45.240388 i Basis Timer.cpp(00567) : Using RDTSC for HR timer
[4447]{-1}[-1/-1] 2013-07-23 05:33:45.240390 i Memory AllocatorImpl.cpp(00835) : Allocators activated
[4447]{-1}[-1/-1] 2013-07-23 05:33:45.240391 i Memory AllocatorImpl.cpp(00851) : Using big block segment size 134217728
[4447]{-1}[-1/-1] 2013-07-23 05:33:45.240393 i Basis ProcessorInfo.cpp(00392) : Using GDT segment limit to determine current CPU ID
[4447]{-1}[-1/-1] 2013-07-23 05:33:45.240395 w Environment Environment.cpp(00284) : Changing environment set IMSLERRPATH=/usr/sap/HDB/HDB00/exe//
[4447]{-1}[-1/-1] 2013-07-23 05:33:45.240397 w Environment Environment.cpp(00284) : Changing environment set IMSLSERRPATH=/usr/sap/HDB/HDB00/exe//
[4447]{-1}[-1/-1] 2013-07-23 05:33:45.240398 w Environment Environment.cpp(00284) : Changing environment set NODNSSAPTRANSHOST=1
[4447]{-1}[-1/-1] 2013-07-23 05:33:45.240400 w Environment Environment.cpp(00263) : Changing environment set SAP_DIR_GLOBAL=/usr/sap/HDB/SYS/global
[6614]{-1}[-1/-1] 2013-07-23 06:11:55.303856 e preprocessor PreprocessorImpl.cpp(01233) : Text analysis error 'util/factory/src/FactoryImpl.cpp: Line 152: Error 0x0100003E: Unable to register text analysis classes in library /usr/sap/dsod_package//DSoD/bin/libdat-filters-isysconverter41.so.', number of errors: 1
[6614]{-1}[-1/-1] 2013-07-23 06:11:55.303918 e preprocessor PreprocessorImpl.cpp(00407) : process(): Error in Document, theProcessResult was set to 6618
[6614]{-1}[-1/-1] 2013-07-23 06:11:55.303948 e preprocessor Preprocessor.cpp(00651) : HANDLE: DISPATCH - Processing Document with key 'SAMPLE:SAMPLE_DATAen.Concern1:2' failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6618)
[6615]{-1}[-1/-1] 2013-07-23 06:11:56.324973 e preprocessor PreprocessorImpl.cpp(01233) : Text analysis error 'docmodel/analyzer/src/AggregateAnalyzer.cpp: Line 104: Error 0x01000005: Request to create analyzer of unknown type:
SAP.TextAnalysis.DocumentAnalysis.FormatConversion.FormatConversionAnalyzer', number of errors: 1
[6615]{-1}[-1/-1] 2013-07-23 06:11:56.325002 e preprocessor PreprocessorImpl.cpp(00407) : process(): Error in Document, theProcessResult was set to 6604
[6615]{-1}[-1/-1] 2013-07-23 06:11:56.325020 e preprocessor Preprocessor.cpp(00651) : HANDLE: DISPATCH - Processing Document with key 'SAMPLE:SAMPLE_DATAen.Concern1:1' failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6604)
[6613]{-1}[-1/-1] 2013-07-23 06:11:57.332431 e preprocessor PreprocessorImpl.cpp(01233) : Text analysis error 'docmodel/analyzer/src/AggregateAnalyzer.cpp: Line 104: Error 0x01000005: Request to create analyzer of unknown type:
SAP.TextAnalysis.DocumentAnalysis.FormatConversion.FormatConversionAnalyzer', number of errors: 1
[6613]{-1}[-1/-1] 2013-07-23 06:11:57.334338 e preprocessor PreprocessorImpl.cpp(00407) : process(): Error in Document, theProcessResult was set to 6604
[6613]{-1}[-1/-1] 2013-07-23 06:11:57.334357 e preprocessor Preprocessor.cpp(00651) : HANDLE: DISPATCH - Processing Document with key 'SAMPLE:SAMPLE_DATAen.Concern1:3' failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6604)
[6615]{-1}[-1/-1] 2013-07-23 06:11:58.339067 e preprocessor PreprocessorImpl.cpp(01233) : Text analysis error 'docmodel/analyzer/src/AggregateAnalyzer.cpp: Line 104: Error 0x01000005: Request to create analyzer of unknown type:
SAP.TextAnalysis.DocumentAnalysis.FormatConversion.FormatConversionAnalyzer', number of errors: 1
Could you please let me know what is wrong?
Thanks,
Indu
Hi Indu,
I am the Product Owner for the Text Analysis SDK.
It appears that the wrong Text Analysis libraries are being loaded by the HANA Preprocessor. The error messages indicate that HANA is trying to load one of the TA libraries from this directory, which I think may be part of a Data Services installation:
/usr/sap/dsod_package//DSoD/bin/libdat-filters-isysconverter41.so
Is it possible that the preceeding directory appears in your PATH ahead of the HANA directories?
-Bill
Bill, Thank you for your prompt reply.
Could you please give me more details? Do you think I need to import any other packages and where do I check about the mentioned directory path.
Regards,
Indu
Indu,
Sorry, but I don't think I can help you much more. You should probably consult someone with more in-depth knowledge of HANA and or Data Services. All I can say is that the Preprocessor's library search path should not include the directory that I see in the Preprocessor trace, or at least the DS directory should not appear before the HANA installation directories. (BTW, I wrote PATH in my preceeding reply, but I think LIBPATH is actually what you want to check.)
-Bill
Thanks for the information Bill.
It was indeed helpful. I will try to figure out the issue.
Regards,
Indu Nair
Hi Lucas,
while importing data from the files stored locally using python odb, I get the following error:
SQL = cursor.execute("""INSERT INTO "POLYGLOT"."POLYGLOT_SEARCH" ("ROWID", "ID", "Text", "Lang", "Description", "StrCat", "Platform") VALUES(?,CAST(? as NVARCHAR(1000)),?,?,?,?,?)""",(self.rowID, self.id, self.target, self.language, self.description, self.sc, self.platform)).rowcount
pyodbc.Error: ('S1000', '[S1000] [SAP AG][LIBODBCHDB DLL][HDB] General error;-10427 Conversion of parameter/column (2) from data type UCS2 (LE) to NSTRING failed (-10427) (SQLPutData)')
I have long elements in my data, according this Python odbc documentation says the follwing: SQLGetTypeInfo is used to determine when the LONG types are used. If not supported by the driver, VARCHAR and WVARCHAR will be 255...
I tried to use other data type in my HANA DB it worked, but now I have a problem while quering the data. Trying to remove dublicates using "DISTINCT" I get the following error message:
Could not execute 'select distinct A."ID", A."ROWID", B."ROWID", B."Lang", B."Text", A."Text", A."Lang", score() as ...' in 6 ms 81 µs . SAP DBTech JDBC: [264] (at 18): invalid datatype: LOB type in distinct select clause: line 1 col 19 (at pos 18)
Do you have any idea how can I store my textual data to be able to process on it?
Kind regards,
Inna
Hi Inna,
You would not be able to do a distinct on a large object. You want to store the data in SHORTTEXT format which creates a column of type NVARCHAR. Also the table should be a column table. Why do you want to do a distinct? What are the other processing you want to do? To better do sampling and search within text data there is something known as - FULLTEXT index
Thanks,
Rupam Roy
Hi Lucas,
this is a great blog. I was able to get my PDFs into HANA. There was just one thing: The filenames cannot exceed a certain length. So I had to shorten them, which was a bit tedious.
Is there a workaround for longer filenames?
Thanks in advance,
Christoph
Hi Christoph, I'm not aware of any restrictions regarding the length of the name of the files. I'm using an NVARCHAR 20 in the "File_Name" field on the table. Did you try expanding this to 500 or something?
Cheers
Hi Lucas,
Thanks for the post...it really helps get a quick start....
I had some questions for which your help will be appreciated:
1. What if I want two fulltext configurations on same column? How can this be achieved?
2. How can I extend the existing standard dictionaries...For instance if a company "UNKNOWN LIMITED" is to be achieved under entity ORGANIZATION/COMMERCIAL. How can this be enhanced on standard? I know we can create custom dictionaries..but will the entity variant not overwrite the existing...also how can I add this in stand EXTRACT_CORE for instance
Thanks in advance,
Cheers,
Jemin Tanna
Hi Jemin, regarding having two different configurations on the same column, this is not possible. What you can do to achieve this is create a generated column on that has the same information as the original column and you create a second fulltextindex in there. Regarding your second questions, with SPS07 (rev 70) you can do this:
http://help.sap.com/hana/SAP_HANA_Text_Analysis_Extraction_Customization_Guide_en.pdf
Great Blog Lucas.....I enjoyed it. Can we use it to analyze social media like twitter, facebook data. If yes than how??
Regards,
Prakash
Hi Prakash, you can certainly use Text Analysis to analyze social media. There are different ways how you can get social data into HANA, for example using Data Services:
http://scn.sap.com/docs/DOC-8820
Or you can build your own code that pulls data from the Twitter API, for example using Python and a library called TwitterSearch, I can provide the code if you want. Once you have the data in HANA you can trigger Text Analysis using a configuration called EXTRACTION_CORE_VOICEOFCUSTOMER. This configuration will not only extract the main entities but it will also provide information about sentiments and requests. You can use this configuration to perform sentiment analysis.
Hope this helps.
Cheers, Lucas.
In case somebody is interested in the Python script to access the Twitter api. Note that before you can run this script you need to install the TwitterSearch Library (ckoepp/TwitterSearch · GitHub)
import dbapi
from TwitterSearch import *
con = dbapi.connect('<hana_host>', 3<instance_number>15, '<user>', '<pwd>')
cur = con.cursor()
try:
tso = TwitterSearchOrder()
tso.setKeywords(['sap hana'])
tso.setLanguage('en')
tso.setIncludeEntities(False)
ts = TwitterSearch(
consumer_key = '<consumer_key>',
consumer_secret = '<consumer_secret>',
access_token = '<access_token>',
access_token_secret = '<token_secret>'
)
for tweet in ts.searchTweetsIterable(tso):
cur.execute("INSERT INTO TWITTER_TBL VALUES(?,?,?)", (tweet['id'],tweet['user']['screen_name'],tweet['text'])) #Save the content to the table
except TwitterSearchException as e:
print(e)
cur.close()
con.close()
Hi Lucas,
Thanks for sharing, nice article, simple and crisp!.
Best Regards,
Ranjit
Hi Lucas,
I can only repeat: great Blog. It was very helpful.
One question: I read about the limit of 1.000.000 matched Entites per file in an earlier comment. I am still experiencing that any higher number leads to the following Error:
"Too many annotations added to annotation manager"
Is it still not possible to increases this limit?
Although I read about automatic conversion from PDF to text, but my $TA_TOKENs of analysed PDFs are still cryptical. Is there a special configuration for my Fulltext index that I have to care about?
Thanks and Regards,
Lukas
nice!
Nice... Question : Other than writing code OR Using SAP DS is there any way to load unstructured data (say MS word) into SAP HANA ? My question "Is there a GUI" like Import utility which loads .csv file to load MS -WORD, PDF etc into HANA ?
Hello Sam,
unfortunately, until now (SPS 07 Rev. 70) there is no such a tool. 🙁
Best wishes,
Christoph
Hi Christoph,
Will this script will also work for MS-WORD documents?
Thanks,
Amir
Thanks ...Christoph.
1 more question..
Based on my reading it seems HANA SP7 can directly search binary types e.g. DOC , PPT etc.However, what I could not find is where/how I am storing the file type ?
My understanding is that if I use SAP DS , the content of the source will be stored as TEXT in HANA irrespective of the source document type DOC , PPT etc.
What am I missing ?
Hello everyone...I tried to take TA further for a customer usecase and used predictive modelling on top of it...sharing it here...as I guess the audience here would like it
HANA Text Analysis Married to Structured Statistical Models, It's a Brochure!
Hi there,
I have few queries
1. Does SAS text analytics and sentiment analysis supports Arabic lanugage?
2. Does it support text extraction into entities like Person, country, organization and city etc?
3. Does it support .NET SDK?
4. How can I get a trail version?
Looking forward for your reply.
Have a nice day.
I'm pretty sure this is not the right place for finding answers on SAS text analytics.
Maybe the SAS website would be a good place to look for that.
- Lars
I m sorry.. Its SAP Hana and not SAS !
- Prince
In that case, the documentation for the text analysis feature will answer your questions.
SAP HANA Advanced Data Processing – SAP Help Portal Page
- Lars
Thank you.
Hello Experts,
Once PDF data insert into HANA DB. Is there any technique to extract whole paragraph of the PDF content not only the some text.
In another word, extraction of all pdf content directly in SAP HANA?
Thank,
Padmindra
Hi Lucas,
Thanks for the information provided, it is very helpful.
I have below query regarding HANA host.
1) This HANA DB host in the connect statement, could be HCP also? or it should be Hana on Premise DB?
(con = dbapi.connect(‘hanahost'..........................)
Thanks,
Divya.
Hi,
I am able to load a local PDF file through PYODBC but how to load from a web page? Creating a simple web page and having a button of uploading a file which gets stored in HANA database. Following is my code where I am giving the path of the file from my system instead of that file path , my requirement is to get the file path dynamically:
import pyodbc
conn = pyodbc.connect('DRIVER={HDBODBC};SERVERNODE=XXXX;SERVERDB=XXX;UID=XXXX;PWD=XXXXX')
cur = conn.cursor()
file = open('c:/pdf-sample.pdf','rb')
content = file.read()
cur.execute("INSERT INTO TEST.MY_TABLE_BLOB_TEST VALUES(?,?)", ('1',content))
cur.execute("COMMIT")
file.close()
cur.close()
conn.close()
How should I proceed with my requirement?