Skip to Content

With the release of SPS05 SAP implemented a considerable amount of changes and made available a handful of new features, some of them long awaited by modelers/developers, like the ability to debug procedures (finally!!). But from my perspective, one of the coolest new features in SPS05 is Text Analysis. The main goal of this new feature is to extract meaningful information from texts. In other words, companies can now process big volumes of data sources and extract meaningful information without having to read every single sentence. Now, what is meaningful information? The extraction process will identify the “who”, “what”, “where”, “when” and “how much” (among other things) from unstructured data and this way will enable you to enrich your structure data.

If we want to understand how it works from a very high level perspective, we just need to refer to the SAP HANA Developer Guide:

“The text analysis provides a vast number of possible entity types and analysis rules for many industries in 20 languages. However, you do not have to deal with this complexity when analyzing your individual set of documents. The language modules included with the software contain system dictionaries and provide an extensive set of predefined entity types. The extraction process can extract entities using these lists of specific entities. It can also discover new entities using linguistic models. Extraction classifies each extracted entity by entity type and presents this metadata in a standardized format.”

The most impressive thing about Text Analysis is how easy it is to implement it. I’m going to walk you thru the steps I followed to make it work:

First of all, I created a very simple column table that looks like this:

CREATE COLUMN TABLE “PRESS_RELEASES” (

       “File_Name” NVARCHAR(20),

       “File_Content” BLOB ST_MEMORY_LOB,

       PRIMARY KEY (“File_Name”));

In this table I’m going to store a number of PDF files that contain information from the aviation industry in the US. In order to do so, I created a very simple python script that stores the PDF files in the BLOB field (“File_Content”) of my table:

con = dbapi.connect(‘hanahost', 30015, 'SYSTEM', '********') #Open connection to SAP HANA
cur = con.cursor() #Open a cursor
file = open('doc.pdf', 'rb') #Open file in read-only and binary
content = file.read() #Save the content of the file in a variable
cur.execute("INSERT INTO PRESS_RELEASES VALUES(?,?)", ('doc.pdf',content)) #Save the content to the table
file.close() #Close the file
cur.close() #Close the cursor
con.close() #Close the connection

Now that I have my table loaded with unstructured data I’m ready to begin my text analysis process. The only thing I need to do is run the following statement:

Create FullText Index “PDF_FTI” On “PRESS_RELEASES”(“File_Content”)

TEXT ANALYSIS ON

CONFIGURATION ‘EXTRACTION_CORE’;

So what I’m doing here is creating a full text index called “PDF_FTI” (you can use any name) on the BLOB column “File_Content” of my table “PRESS_RELEASES”. I’m also saying that I would like to turn Text Analysis on and that I would like to use the configuration called ‘EXTRACTION_CORE’ (you can refer to the SAP HANA Developer Guide for the different configurations). With the execution of this script a new column table is created called $TA_PDF_FTI ($TA_<Index_Name>) that contains the result of my Text Analysis Process. The structure of this table looks like this:

Column Name

Key

Description

Data Type

File_Name

Yes

This is the primary key of my table. If you have more than one column in your primary key, the $TA table will include every single column

Same as in source table. In this case: NVARCHAR(20)

RULE

Yes

Stores the rule package that yielded the token. In my case: “Entity Extraction”

NVARCHAR(200)

COUNTER

Yes

Counts all tokens across the document

BIGINT

TOKEN

No

The token that was extracted (the “who”, “what”, “where”, “when” and “how much”)

NVARCHAR(250)

LANGUAGE

No

You can either specify a language column when you create the fulltext index or it can be derived from the text. In my case it was derived from the text and is English (en)

NVARCHAR(2)

TYPE

No

The Token Type, whether it is a “who”, a “what”, a “where”, etc.

NVARCHAR(100)

NORMALIZED

No

Stores a normalized representation of the token. This becomes relevant e.g. for German with umlauts, or ß/ss. Normalization with regards to capitalization would not be as important as to justify this column.

NVARCHAR(250)

STEM

No

Stores the linguistic stemming information, e.g. the singular nominative for nouns, or the indicative for verbs. If text analysis yields several stems, only the first stem will be stored, assuming this to be the best match.

NVARCHAR(300)

PARAGRAPH

No

The paragraph number where my token is located in the document

INTEGER

SENTENCE

No

The sentence number where my token is located in the document

INTEGER

CREATED_AT

No

Creation timestamp

TIMESTAMP

This is a Data Preview of my $TA table:

/wp-content/uploads/2013/01/1_171299.png

Now, let’s pretend I work in an airline, for example Delta Air Lines, and I would like to know whether the company I work for is mentioned in any of the press releases. I can easily find the answer by filtering the column TYPE = ‘ORGANIZATION/COMMERCIAL’ and TOKEN = ‘Delta Air Lines’. As you can see in the screenshot below Delta Air Lines is mentioned twice in the bts201001.pdf file (paragraphs 9 and 151), twice in bts201002.pdf (paragraphs 9 and 136) and so on.

/wp-content/uploads/2013/01/2_171300.png

I’m very skeptic by nature, so I would like to see if this is accurate. So I built a very simple XS JavaScript script to show the content of one of the PDF files (bts201002.pdf) that looks like this:

var conn = $.db.getConnection();
try {
var query = "Select \"File_Content\" From \"PRESS_RELEASES\" Where \"File_Name\" = 'bts201002.pdf'";
          var pstmt = conn.prepareStatement(query);
          var rs = pstmt.executeQuery();
          rs.next();
$.response.headers.set("Content-Disposition", "Content-Disposition: attachment; filename=bts201002.pdf");
          $.response.contentType = 'application/pdf';
          $.response.setBody(rs.getBlob(1));
          $.response.status = $.net.http.OK;
} catch (e) {
          $.response.setBody("Error while downloading : "+e);
          $.response.status = 500;
}
conn.close();

Like my Text Analysis Process said, Delta Air Lines is mentioned twice in this particular file and it’s right where my $TA table said it would be:

/wp-content/uploads/2013/01/3_171301.png

Hope you find this post useful.

Follow me on Twitter: @LukiSpa

To report this post you need to login first.

65 Comments

You must be Logged on to comment or reply to a post.

    1. Lucas Sparvieri Post author

      Hi Holger, I’m glad you liked it :-). A couple of questions:

      1. Are you using the SYSTEM user? If not, you could try using it to see if it’s a security problem
      2. In which language is the text you are analyzing? If it’s other than English, you should specify a Language Column, otherwise the process will default it to English.

        CREATE FULLTEXT INDEX <INDEX_NAME> ON <TAB_NAME>(<COL_NAME>)
        TEXT ANALYSIS ON
        CONFIGURATION ‘EXTRACTION_CORE’
        LANGUAGE COLUMN <LANG>

      Let me know how it goes!

      Cheers, Lucas.

      (0) 
  1. Vivek RR

    this is amazing ..thanks for sharing it . .

    but as Holger mentioned, the $TA table is blank for me too..

    I have system user and my document is in English ..

    Could you please let  me know what else I need to take care of ?

    Regards

    Vivek

    (0) 
    1. Lucas Sparvieri Post author

      Hi Vivek, you can take a look at the preprocessor trace file to see what is causing the $TA table to be blank. You should see an error message in the preprocessor trace file.

      (0) 
    2. Vivek RR

      Hi Lucas

      thanks for your input.. I looked @ the preprocessor trace file and had this info:..Can you help ? IS there a problem with configuration?

      LOB END

      [123820]{0}[0] 2013-01-29 20:56:42.862016 e preprocessor PreprocessorPool.cpp(00391) : PreprocessorPool::CreateConnection : Error while creating connection to  server – TrexNet::Exception a protocol mismatch occured, returning to application with errorcode PREPROCESSOR_FAILED

      [200111]{0}[0] 2013-01-29 23:03:29.297154 e TrexNet      Request.cpp(01346) : ERROR INFO: cancelled Request on channel 225. Errors for this channel can be ignored.

      [94490]{0}[0] 2013-01-29 23:03:30.050334 e TrexNet      Channel.cpp(00326) : ERROR: reading from channel 225 to 0 <10.68.104.93:30303> failed with rc=7 (internal error)

      [94490]{0}[0] 2013-01-29 23:03:30.050408 e TrexNetBuffe BufferedIO.cpp(01092) : channel 225 from <10.68.104.93:30303>: read from channel failed; resetting buffer

      [94490]{0}[0] 2013-01-29 23:03:30.050613 e preprocessor PreprocessorPool.cpp(00391) : PreprocessorPool::CreateConnection : Error while creating connection to  server – TrexNet::Exception an error occured while reading from the channel, returning to application with errorcode PREPROCESSOR_FAILED

      [60634]{0}[0] 2013-01-31 19:39:24.567948 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error ‘docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.’, number of errors: 1

      [60634]{0}[0] 2013-01-31 19:39:24.568150 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601

      [60634]{0}[0] 2013-01-31 19:39:24.568213 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH – Processing Document with key ‘ZVIV:PRESS_RELEASESen.File_Content:7’ failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)

      [60626]{0}[0] 2013-01-31 19:39:24.893542 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error ‘docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.’, number of errors: 1

      [60626]{0}[0] 2013-01-31 19:39:24.893622 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601

      [60626]{0}[0] 2013-01-31 19:39:24.893681 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH – Processing Document with key ‘ZVIV:PRESS_RELEASESen.File_Content:5’ failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)

      [60634]{0}[0] 2013-01-31 20:02:03.464320 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error ‘docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.’, number of errors: 1

      [60634]{0}[0] 2013-01-31 20:02:03.464408 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601

      [60634]{0}[0] 2013-01-31 20:02:03.464463 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH – Processing Document with key ‘ZVIV:PRESS_RELEASESen.File_Content:9’ failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)

      [60625]{0}[0] 2013-01-31 20:02:03.814165 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error ‘docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.’, number of errors: 1

      [60625]{0}[0] 2013-01-31 20:02:03.814228 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601

      [60625]{0}[0] 2013-01-31 20:02:03.814276 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH – Processing Document with key ‘ZVIV:PRESS_RELEASESen.File_Content:8’ failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)

      [60625]{0}[0] 2013-01-31 20:06:24.079945 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error ‘docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.’, number of errors: 1

      [60625]{0}[0] 2013-01-31 20:06:24.080006 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601

      [60625]{0}[0] 2013-01-31 20:06:24.080059 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH – Processing Document with key ‘ZVIV:PRESS_RELEASESen.File_Content:11’ failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)

      [60620]{0}[0] 2013-01-31 20:06:26.192253 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error ‘docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.’, number of errors: 1

      [60620]{0}[0] 2013-01-31 20:06:26.192322 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601

      [60620]{0}[0] 2013-01-31 20:06:26.192373 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH – Processing Document with key ‘ZVIV:PRESS_RELEASESen.File_Content:10’ failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)

      [60622]{0}[0] 2013-01-31 20:11:38.499618 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error ‘docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.’, number of errors: 1

      [60622]{0}[0] 2013-01-31 20:11:38.499683 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601

      [60622]{0}[0] 2013-01-31 20:11:38.499724 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH – Processing Document with key ‘SYSTEM:PRESS_RELEASESen.File_Content:2’ failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)

      [41419]{0}[0] 2013-01-31 22:49:22.390772 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error ‘docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.’, number of errors: 1

      [41419]{0}[0] 2013-01-31 22:49:22.390851 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601

      [41419]{0}[0] 2013-01-31 22:49:22.390938 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH – Processing Document with key ‘SYSTEM:PRESS_RELEASESen.File_Content:2’ failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)

      [29254]{0}[0] 2013-02-01 09:03:36.853997 e TrexNet      BufferedIO.cpp(03885) : ERROR on channel 481 <10.68.104.93:30303>: type mismatch read=EOF2 expected=String

      [29254]{0}[0] 2013-02-01 09:03:36.854053 e TrexNet      BufferedIO.cpp(03886) : Stacktrace for Context Name:Generic

      1: 0x00007f75239d0545 in Execution::Context::dumpStackTrace(char const*, ltt::basic_ostream<char, ltt::char_traits<char> >&) const+0x31 at Context.cpp:1273 (libhdbbasis.so)

      2: 0x00007f752a00fd46 in TrexTrace::Tracer::stack(char const*, unsigned int, TrexTrace::TraceLevel const&)+0xa2 at TrexTrace.cpp:1369 (libhdbbasement.so)

      3: 0x00007f7529fc9975 in TrexNet::BufferedNetInput::operator>>(ltt::basic_string<char, ltt::char_traits<char> >&)+0xe1 at BufferedIO.cpp:3886 (libhdbbasement.so)

      4: 0x0000000000470be2 in preprocessor::PreprocessorPool::CreateConnection(int&, ltt_adp::basic_string<char, ltt::char_traits<char>, ltt::integral_constant<bool, true> >&)+0x320 at PreprocessorPool.cpp:381 (hdbpreprocessor)

      5: 0x0000000000471375 in preprocessor::PreprocessorPool::run(void*)+0x3d1 at PreprocessorPool.cpp:298 (hdbpreprocessor)

      6: 0x00007f752a0046fa in TrexThreads::PoolThread::run()+0xac6 at PoolThread.cpp:258 (libhdbbasement.so)

      7: 0x00007f752a005ea8 in TrexThreads::PoolThread::run(void*&)+0x14 at PoolThread.cpp:105 (libhdbbasement.so)

      8: 0x00007f7523a08bf2 in Execution::Thread::staticMainImp(void**)+0x6a0 at Thread.cpp:455 (libhdbbasis.so)

      9: 0x00007f7523a08ddd in Execution::Thread::staticMain(void*)+0x39 at Thread.cpp:526 (libhdbbasis.so)

      [29254]{0}[0] 2013-02-01 09:03:36.873597 e TrexNetIO    BufferedIO.cpp(01332) : dump of TrexNet buffer: size 65, current position 0x00007f73d71712cd (65 bytes), mode incoming, channel 481, peer 10.68.104.93:30303, content:

      BLOB START (Addr: 0x00007f73d717128c Len: 65)

      BLOB END

      [29254]{0}[0] 2013-02-01 09:03:36.873930 e preprocessor PreprocessorPool.cpp(00391) : PreprocessorPool::CreateConnection : Error while creating connection to  server – TrexNet::Exception a protocol mismatch occured, returning to application with errorcode PREPROCESSOR_FAILED

      [58151]{0}[0] 2013-02-01 17:51:15.667975 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error ‘docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.’, number of errors: 1

      [58151]{0}[0] 2013-02-01 17:51:15.668100 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601

      [58151]{0}[0] 2013-02-01 17:51:15.668154 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH – Processing Document with key ‘SYSTEM:PRESS_RELEASESen.File_Content:8’ failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)

      [58151]{0}[0] 2013-02-01 17:57:39.608469 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error ‘docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.’, number of errors: 1

      [58151]{0}[0] 2013-02-01 17:57:39.608542 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601

      [58151]{0}[0] 2013-02-01 17:57:39.608600 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH – Processing Document with key ‘SYSTEM:PRESS_RELEASESen.File_Content:3’ failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)

      [196148]{0}[0] 2013-02-01 17:57:45.606893 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error ‘docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.’, number of errors: 1

      [196148]{0}[0] 2013-02-01 17:57:45.606961 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601

      [196148]{0}[0] 2013-02-01 17:57:45.607012 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH – Processing Document with key ‘SYSTEM:PRESS_RELEASESen.File_Content:9’ failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)

      [198399]{0}[0] 2013-02-08 17:08:30.406229 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error ‘docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.’, number of errors: 1

      [198399]{0}[0] 2013-02-08 17:08:30.406321 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601

      [198399]{0}[0] 2013-02-08 17:08:30.406374 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH – Processing Document with key ‘SYSTEM:PRESS_RELEASESen.File_Content:14’ failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)

      [198394]{0}[0] 2013-02-08 17:08:38.717108 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error ‘docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.’, number of errors: 1

      [198394]{0}[0] 2013-02-08 17:08:38.717179 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601

      [198394]{0}[0] 2013-02-08 17:08:38.717233 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH – Processing Document with key ‘ZVIV:PRESS_RELEASESen.File_Content:2’ failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)

      [58151]{0}[0] 2013-02-08 17:18:20.099186 e preprocessor PreprocessorImpl.cpp(01484) : Text analysis error ‘docmodel/document/src/AnnotationManager.cpp: Line 105: Error 0x0100001A: Too many annotations added to annotation manager.’, number of errors: 1

      [58151]{0}[0] 2013-02-08 17:18:20.099239 e preprocessor PreprocessorImpl.cpp(00377) : process(): Error in Document, theProcessResult was set to 6601

      [58151]{0}[0] 2013-02-08 17:18:20.099283 e preprocessor Preprocessor.cpp(00664) : HANDLE: DISPATCH – Processing Document with key ‘ZVIV:PRESS_RELEASESen.File_Content:2’ failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6601)

      (0) 
      1. Vivek RR

        Hi Lucas

        I figured it out ..

        It works for small pdf files.. But I cannot do the same with bigger pdf files.. Is there a config which I need to take care of for increasing the threshold value for annotations ?

        Regards

        Vivek

        (0) 
          1. Bill Miller

            Lucas and Vivek,

            I believe that HANA SPS05 limits each document to 1 million annotations. This limit is not configurable in SPS05.

            Furthermore, if you are using the EXTRACTION_CORE configuration, this limit will apply to the total number of words plus semantic entities found in the document.

            (0) 
            1. Vivek RR

              Hi Bill

              Thanks for your reply.. This totally explains the perf trace error ..

              In  case if I have a large pdf file , then do I need split in to small documents and then load into HANA so that the  the annotations n semantics can be taken care ?

              (0) 
              1. Bill Miller

                Unfortunately yes. It’s a shame too since one of the great features is automatic conversion from PDF to text. I did create an enhancement request in the HANA tracking system to provide a way for customers to increase the limit.

                (0) 
    1. Lucas Sparvieri Post author

      Hi Abhijeet, I’m glad you liked the post. When you install the SAP HANA client on your computer, all the python libraries are installed as well, so you can run the Python script from your computer after you have installed the SAP HANA Client. This is for sure true for the 64 bit version, not sure if this is also valid for the 32 bit version.

      Cheers, Lucas.

      (0) 
  2. Abhijeet Jangam

    OMG…This totally works….Highly impressed… πŸ™‚

    Few pointers for SAP / HANA developers who have no idea about python…Python is similar to java…I followed this blog until connection test…

    http://scn.sap.com/community/developer-center/hana/blog/2012/06/08/sap-hana-and-python-yes-sir

    Once I have a good connection, then its pretty straight forward. Since I have using cloudshare, I had to mention my schema names before table tech names. Also, I had to setup PATH (environment variable) with python home directory within HANA client…

    can’t wait for further explore…

    I feel i have learnt python as well…

    Kudos Lucas… 😎

    (0) 
  3. Vladislav Volodin

    If somebody needs a code for ODBC + .NET, you can use this snippet:

    using System;

    using System.Collections.Generic;

    using System.Data.Odbc;

    using System.IO;

    using System.Linq;

    using System.Text;

    using System.Threading.Tasks;

    namespace Test_3_HANA {

        class Program {

            static void Main(string[] args) {           

                OdbcConnection cn = new OdbcConnection(“dsn=XXX;UID=YYYYY;PWD=ZZZZZ”);           

                String ins = “INSERT INTO \”SYSTEM\”.\”FileContent\” VALUES(?, ?)”;

                OdbcCommand cmd = new OdbcCommand(ins, cn);

                String filePath = @”C:\Temp\Workflow.pdf”;

                cn.Open();

                OdbcTransaction trans = cn.BeginTransaction();

               cmd.Transaction = trans;

                OdbcParameter param = new OdbcParameter(“@name”, OdbcType.Text);

               param.Value = filePath;

               param.Size = filePath.Length;

               cmd.Parameters.Add(param);

                Byte[] fileContent = ReadFileByteArray(filePath);

               param = new OdbcParameter(“@file”, OdbcType.Binary);

               param.Value = fileContent;

               param.Size = fileContent.Length;

               cmd.Parameters.Add(param);

               cmd.ExecuteNonQuery();

               trans.Commit();

               cn.Close();

            }

            static public Byte[] ReadFileByteArray(String fileName) {

                FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read);

                Byte[] fileData = new Byte[fs.Length];

               fs.Read(fileData, 0, System.Convert.ToInt32(fs.Length));

               fs.Close();

                return fileData;

            }

        }

    }

    It is a dirty solution, but it works πŸ™‚ Transactions are necessary for large files.

    (0) 
  4. Vishnu Kumar Jakhoria

    Hi Lucas,

    I am getting this error while trying to execute the python code.

    dbapi.Error: (596, ‘LOB streaming is not permitted in auto-commit mode: ‘)

    how to overcome this error? Please help.

    Thanks,

    Vishnu

    (0) 
    1. Lucas Sparvieri Post author

      Hi Vishnu, what you can do to make it work is set the Auto-Commit option to false right after you open the connection and then manually commit the changes after the insert statement. Let me know how it goes!

      con = dbapi.connect(‘localhost’, 30015, ‘SYSTEM’, ‘********’)

      con.setautocommit(False)

      cur.execute(“INSERT INTO PRESS_RELEASES VALUES(?,?)”, (‘doc.pdf’,content))

      con.commit()

      (0) 
  5. Achin Kimtee

    Hi Lucas,

    Great post.

    one question though

    you are using ‘EXTRACTION_CORE’ config here

    I tried using ‘EXTRACTION_CORE_VOICEOFCUSTOMER’ (which i believe is meant for extracting Sentiments) on a file which contained reviews for a product.

    It did create $TA table; but was empty.

    So is there any particular way to use ‘EXTRACTION_CORE_VOICEOFCUSTOMER’?

    Thank You.

    (0) 
    1. Lucas Sparvieri Post author

      Hi Achin, like Bill Miller pointed out in a previous comment, there’s a limit in the size of the file, so maybe you can try it using a small file and see if that works. You can also look at the preprocessor trace file and look for the error that is causing your table to remain empty.

      Thanks, Lucas.

      (0) 
  6. Inna Nickel

    Hi Lucas,

    I’m trying to create a fuzzy search index with following parameters

    CREATE FULLTEXT index text_analysis ON system.TMX(column_0) 

    fuzzy search index on

    fast preprocess off 

    text analysis on

    configuration ‘EXTRACTION CORE’;

    but I get the following erro message:

    Could not execute ‘create fulltext index text_analysis on system.TMX(column_0) fuzzy search index on fast preprocess …’ in 8 ms 422 µs .

    SAP DBTech JDBC: [2048]: column store error: add DafAttribute failed:  [366] Table has no keys, Table_TA cannot be created

    If I exclude the “text analysis on” option out of creation prompt, the filter is successfully created, otherwise not.

    I don’t understand why the $TA cannot be created.

    Could you help me please with this issue?

    Thank you in advance,

    Inna

    (0) 
    1. Lucas Sparvieri Post author

      Hi Inna, to be able to create the $TA table you need to create a primary key in the system.TMX table. Remember the first column(s) of the $TA table are the primary key of the source table, in this case, the TMX table. Let me know how it goes…

      (0) 
    2. Inna Nickel

      Hi Lucas,

      I got another issue while importing data from local file, for example from csv…. My CSV file contains about 12500 rows of textual data. I would like to import them into the DB…When I use HANA functionality  for import data from CSV only first 1000 have been imported….

      I also created the connection via python ODBC and tried via “insert into the table” to write the data into HANA DB, but the same issue, just first 1000  entries have been inserted.

      Could you help me with this issue?

      Best regards,

      Inna

      (0) 
      1. Lucas Sparvieri Post author

        Hi Inna, when you run a select statement in the SQL editor only the first 1000 rows will be displayed, but that doesn’t mean they are not stored in the table. Did you try counting the number of rows in the table:

        SELECT COUNT( * ) FROM <TABLE_NAME>;

        Cheers, Lucas.

        (0) 
  7. Srikanth Guna

    Hi Lucas,

    Excellent Article for Text Analysis on SAP HANA. I tried to implement the same but ended up with following Errors –

    [13427]{0}[0] 2013-04-03 03:33:24.556021 e preprocessor PreprocessorImpl.cpp(02072) : Error reading TA configuration file from the IS server: ‘Dev Object null after reading from Repository’

    [13427]{0}[0] 2013-04-03 03:33:25.556304 e preprocessor PreprocessorImpl.cpp(02121) : File read Error ‘/usr/sap/HDB/SYS/global/hdb/custom/config/lexicon//EXTRACTION_CORE’, error=’Storage object does not exist: $STORAGEOBJECT$’

    [13427]{0}[0] 2013-04-03 03:33:25.556440 e preprocessor PreprocessorImpl.cpp(01134) : invalid TA_Pointer

    Extracted above information from Preprocessor Trace File which you had suggested

    Is there a way that you could help me out on this particular Error.

    (0) 
  8. Christoph Kloppenburg

    Hi Lucas,

    I loaded a few PDF files into HANA. Orginial size of the files was 37,0 MB in sum.

    I created the index and looked at the size of the table.

    It was 1.7 MB in memory and 39.8 MB on disk.

    After delta merge I was suprised that the size increased:

    It was 38.0 MB in memory and 44.7 MB on disk.

    Moreover the speed of the SQL textsearch statment increased from 15..30 ms before delta merge to 90..120 ms after delta merge.

    Do you have an explanation? I thought delta merge should improve things in both areas. Or does this have to do with a kind of binary <-> text switch?

    Best regards and thanks for this nice blog!!!

    Christoph

    P.S. I am working with revision 52.

    (0) 
    1. Lucas Sparvieri Post author

      Hi, what I’m thinking is before running the delta merge the documents were not loaded to memory. So after running the delta merge all the documents were loaded to memory and that explains the size increasing in memory. Nevertheless, not sure why the search query lost performance after the delta merge. It would be interesting to see what was the memory consumption on the Delta Storage before running the delta merge.

      Cheers, Lucas.

      (0) 
  9. Indu Nair

    Hi Lucas,

    Great Post, Thanks for sharing the step by step info.

    However I am not able to see any data in the $TA table; I have not used any pdf files, just loaded a column with customer feedback with max of 257 characters.

    Investigated the pre-processor trace file which looks like this :

    [4558]{-1}[-1/-1] 2013-07-23 05:30:03.545715 e preprocessor     PreprocessorPool.cpp(00438) : PreprocessorPool::CreateConnection : Error while creating connection to  server – TrexNet::Exception a protocol mismatch occured, returning to application with errorcode PREPROCESSOR_FAILED

    [4447]{-1}[-1/-1] 2013-07-23 05:33:45.240292 i Basis            TraceStream.cpp(00383) : ==== Starting hdbpreprocessor, version 1.00.60.379371 (NewDB100_REL), build linuxx86_64 not set 2013-06-27 17:39:32 ld7270.wdf.sap.corp cc (SAP release 20130125, based on SUSE gcc43-4.3.4_20091019-0.22.17.5096.1.PTF.799300) 4.3.4 [gcc-4_3-branch revision 152973]

    [4447]{-1}[-1/-1] 2013-07-23 05:33:45.240370 i Basis            TraceStream.cpp(00385) : MaxOpenFiles: 1048576

    [4447]{-1}[-1/-1] 2013-07-23 05:33:45.240385 i Memory           MallocProxy.cpp(01222) : Installed malloc hooks

    [4447]{-1}[-1/-1] 2013-07-23 05:33:45.240388 i Basis            Timer.cpp(00567) : Using RDTSC for HR timer

    [4447]{-1}[-1/-1] 2013-07-23 05:33:45.240390 i Memory           AllocatorImpl.cpp(00835) : Allocators activated

    [4447]{-1}[-1/-1] 2013-07-23 05:33:45.240391 i Memory           AllocatorImpl.cpp(00851) : Using big block segment size 134217728

    [4447]{-1}[-1/-1] 2013-07-23 05:33:45.240393 i Basis            ProcessorInfo.cpp(00392) : Using GDT segment limit to determine current CPU ID

    [4447]{-1}[-1/-1] 2013-07-23 05:33:45.240395 w Environment      Environment.cpp(00284) : Changing environment set IMSLERRPATH=/usr/sap/HDB/HDB00/exe//

    [4447]{-1}[-1/-1] 2013-07-23 05:33:45.240397 w Environment      Environment.cpp(00284) : Changing environment set IMSLSERRPATH=/usr/sap/HDB/HDB00/exe//

    [4447]{-1}[-1/-1] 2013-07-23 05:33:45.240398 w Environment      Environment.cpp(00284) : Changing environment set NODNSSAPTRANSHOST=1

    [4447]{-1}[-1/-1] 2013-07-23 05:33:45.240400 w Environment      Environment.cpp(00263) : Changing environment set SAP_DIR_GLOBAL=/usr/sap/HDB/SYS/global

    [6614]{-1}[-1/-1] 2013-07-23 06:11:55.303856 e preprocessor     PreprocessorImpl.cpp(01233) : Text analysis error ‘util/factory/src/FactoryImpl.cpp: Line 152: Error 0x0100003E: Unable to register text analysis classes in library /usr/sap/dsod_package//DSoD/bin/libdat-filters-isysconverter41.so.’, number of errors: 1

    [6614]{-1}[-1/-1] 2013-07-23 06:11:55.303918 e preprocessor     PreprocessorImpl.cpp(00407) : process(): Error in Document, theProcessResult was set to 6618

    [6614]{-1}[-1/-1] 2013-07-23 06:11:55.303948 e preprocessor     Preprocessor.cpp(00651) : HANDLE: DISPATCH – Processing Document with key ‘SAMPLE:SAMPLE_DATAen.Concern1:2’ failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6618)

    [6615]{-1}[-1/-1] 2013-07-23 06:11:56.324973 e preprocessor     PreprocessorImpl.cpp(01233) : Text analysis error ‘docmodel/analyzer/src/AggregateAnalyzer.cpp: Line 104: Error 0x01000005: Request to create analyzer of unknown type:

    SAP.TextAnalysis.DocumentAnalysis.FormatConversion.FormatConversionAnalyzer’, number of errors: 1

    [6615]{-1}[-1/-1] 2013-07-23 06:11:56.325002 e preprocessor     PreprocessorImpl.cpp(00407) : process(): Error in Document, theProcessResult was set to 6604

    [6615]{-1}[-1/-1] 2013-07-23 06:11:56.325020 e preprocessor     Preprocessor.cpp(00651) : HANDLE: DISPATCH – Processing Document with key ‘SAMPLE:SAMPLE_DATAen.Concern1:1’ failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6604)

    [6613]{-1}[-1/-1] 2013-07-23 06:11:57.332431 e preprocessor     PreprocessorImpl.cpp(01233) : Text analysis error ‘docmodel/analyzer/src/AggregateAnalyzer.cpp: Line 104: Error 0x01000005: Request to create analyzer of unknown type:

    SAP.TextAnalysis.DocumentAnalysis.FormatConversion.FormatConversionAnalyzer’, number of errors: 1

    [6613]{-1}[-1/-1] 2013-07-23 06:11:57.334338 e preprocessor     PreprocessorImpl.cpp(00407) : process(): Error in Document, theProcessResult was set to 6604

    [6613]{-1}[-1/-1] 2013-07-23 06:11:57.334357 e preprocessor     Preprocessor.cpp(00651) : HANDLE: DISPATCH – Processing Document with key ‘SAMPLE:SAMPLE_DATAen.Concern1:3’ failed, returning PREPROCESSOR_ACTIVITY_ERROR (Code 6604)

    [6615]{-1}[-1/-1] 2013-07-23 06:11:58.339067 e preprocessor     PreprocessorImpl.cpp(01233) : Text analysis error ‘docmodel/analyzer/src/AggregateAnalyzer.cpp: Line 104: Error 0x01000005: Request to create analyzer of unknown type:

    SAP.TextAnalysis.DocumentAnalysis.FormatConversion.FormatConversionAnalyzer’, number of errors: 1

    Could you please let me know what is wrong?

    Thanks,

    Indu

    (0) 
    1. Bill Miller

      Hi Indu,

      I am the Product Owner for the Text Analysis SDK.

      It appears that the wrong Text Analysis libraries are being loaded by the HANA Preprocessor. The error messages indicate that HANA is trying to load one of the TA libraries from this directory, which I think may be part of a Data Services installation:

        /usr/sap/dsod_package//DSoD/bin/libdat-filters-isysconverter41.so

      Is it possible that the preceeding directory appears in your PATH ahead of the HANA directories?

      -Bill

      (0) 
      1. Indu Nair

        Bill, Thank you for your prompt reply.

        Could you please give me more details? Do you think I need to import any other packages and where do I check about the mentioned directory path.

        Regards,

        Indu

        (0) 
        1. Bill Miller

          Indu,

          Sorry, but I don’t think I can help you much more. You should probably consult someone with more in-depth knowledge of HANA and or Data Services. All I can say is that the Preprocessor’s library search path should not include the directory that I see in the Preprocessor trace, or at least the DS directory should not appear before the HANA installation directories. (BTW, I wrote PATH in my preceeding reply, but I think LIBPATH is actually what you want to check.)

          -Bill

          (0) 
  10. Inna Nickel

    Hi Lucas,

    while importing data from the files stored locally using python odb, I get the following error:

      SQL = cursor.execute(“””INSERT INTO “POLYGLOT”.”POLYGLOT_SEARCH” (“ROWID”, “ID”, “Text”, “Lang”, “Description”, “StrCat”, “Platform”) VALUES(?,CAST(? as NVARCHAR(1000)),?,?,?,?,?)”””,(self.rowID, self.id, self.target, self.language, self.description, self.sc, self.platform)).rowcount

    pyodbc.Error: (‘S1000’, ‘[S1000] [SAP AG][LIBODBCHDB DLL][HDB] General error;-10427 Conversion of parameter/column (2) from data type UCS2 (LE) to NSTRING failed (-10427) (SQLPutData)’)

    I have long elements in my data, according this Python odbc documentation says the follwing: SQLGetTypeInfo is used to determine when the LONG types are used. If not supported by the driver, VARCHAR and WVARCHAR will be 255

    I tried to use other data type in my HANA DB it worked, but now I have a problem while quering the data. Trying to remove dublicates using “DISTINCT” I get the following error message:

     

    Could not execute ‘select distinct A.”ID”, A.”ROWID”, B.”ROWID”, B.”Lang”, B.”Text”, A.”Text”, A.”Lang”, score() as …’ in 6 ms 81 µs . SAP DBTech JDBC: [264] (at 18): invalid datatype: LOB type in distinct select clause: line 1 col 19 (at pos 18)

    Do you have any idea how can I store my textual data to be able to process on it?

    Kind regards,

    Inna

    (0) 
    1. Rupam Roy

      Hi Inna,

      You would not be able to do a distinct on a large object. You want to store the data in SHORTTEXT format which creates a column of type NVARCHAR. Also the table should be a column table. Why do you want to do a distinct? What are the other processing you want to do? To better do sampling and search within text data there is something known as – FULLTEXT index

      Thanks,

      Rupam Roy

      (0) 
  11. Christoph Kloppenburg

    Hi Lucas,

    this is a great blog. I was able to get my PDFs into HANA. There was just one thing: The filenames cannot exceed a certain length. So I had to shorten them, which was a bit tedious.

    Is there a workaround for longer filenames?

    Thanks in advance,

    Christoph

    (0) 
    1. Lucas Sparvieri Post author

      Hi Christoph, I’m not aware of any restrictions regarding the length of the name of the files. I’m using an NVARCHAR 20 in the “File_Name” field on the table. Did you try expanding this to 500 or something?

      Cheers

      (0) 
  12. Jemin Tanna

    Hi Lucas,

    Thanks for the post…it really helps get a quick start….

    I had some questions for which your help will be appreciated:

    1. What if I want two fulltext configurations on same column? How can this be achieved?

    2. How can I extend the existing standard dictionaries…For instance if a company “UNKNOWN LIMITED” is to be achieved under entity ORGANIZATION/COMMERCIAL. How can this be enhanced on standard? I know we can create custom dictionaries..but will the entity variant not overwrite the existing…also how can I add this in stand EXTRACT_CORE for instance

    Thanks in advance,

    Cheers,

    Jemin Tanna

    (0) 
    1. Lucas Sparvieri Post author

      Hi Prakash, you can certainly use Text Analysis to analyze social media. There are different ways how you can get social data into HANA, for example using Data Services:

      http://scn.sap.com/docs/DOC-8820

      Or you can build your own code that pulls data from the Twitter API, for example using Python and a library called TwitterSearch, I can provide the code if you want. Once you have the data in HANA you can trigger Text Analysis using a configuration called EXTRACTION_CORE_VOICEOFCUSTOMER. This configuration will not only extract the main entities but it will also provide information about sentiments and requests. You can use this configuration to perform sentiment analysis.

      Hope this helps.

      Cheers, Lucas.

      (0) 
      1. Lucas Sparvieri Post author

        In case somebody is interested in the Python script to access the Twitter api. Note that before you can run this script you need to install the TwitterSearch Library (ckoepp/TwitterSearch · GitHub)

        import dbapi

        from TwitterSearch import *

        con = dbapi.connect(‘<hana_host>’, 3<instance_number>15, ‘<user>’, ‘<pwd>’)

        cur = con.cursor()

        try:

            tso = TwitterSearchOrder()

            tso.setKeywords([‘sap hana’])

            tso.setLanguage(‘en’)

            tso.setIncludeEntities(False)

           

            ts = TwitterSearch(

                consumer_key = ‘<consumer_key>’,

                consumer_secret = ‘<consumer_secret>’,

                access_token = ‘<access_token>’,

                access_token_secret = ‘<token_secret>’

             )

            for tweet in ts.searchTweetsIterable(tso):

                cur.execute(“INSERT INTO TWITTER_TBL VALUES(?,?,?)”, (tweet[‘id’],tweet[‘user’][‘screen_name’],tweet[‘text’])) #Save the content to the table

        except TwitterSearchException as e:

            print(e)

           

        cur.close() 

        con.close() 

        (0) 
  13. Lukas Carullo

    Hi Lucas,

    I can only repeat: great Blog. It was very helpful.

    One question: I read about the limit of 1.000.000 matched Entites per file in an earlier comment. I am still experiencing that any higher number leads to the following Error:

     

    “Too many annotations added to annotation manager”

    Is it still not possible to increases this limit?

    Although I read about automatic conversion from PDF to text, but my $TA_TOKENs of analysed PDFs are still cryptical. Is there a special configuration for my Fulltext index that I have to care about?

    Thanks and Regards,

    Lukas

    (0) 
  14. SAM CHAKRA

    Nice… Question : Other than writing code OR Using SAP DS is there any way to load unstructured data (say MS word) into SAP HANA ? My question “Is there a GUI” like Import utility which loads .csv file to load MS -WORD, PDF etc into HANA ?

    (0) 
  15. SAM CHAKRA

    Thanks …Christoph.

    1 more question..

    Based on my reading it seems HANA SP7 can directly search binary types e.g. DOC , PPT etc.However, what I could not find is where/how I am storing the file type ?

    My understanding is that if I use SAP DS , the content of the source will be stored as TEXT in HANA irrespective of the source document type DOC , PPT etc.

    What am I missing ?

    (0) 
  16. Prince Victor

    Hi there,

    I have few queries

    1. Does SAS text analytics and sentiment analysis supports Arabic lanugage?
    2. Does it support text extraction into entities like Person, country, organization and city etc?
    3. Does it support .NET SDK?
    4. How can I get a trail version?

    Looking forward for your reply.
    Have a nice day.

    (0) 
    1. Lars Breddemann

      I’m pretty sure this is not the right place for finding answers on SAS text analytics.

      Maybe the SAS website would be a good place to look for that.

      – Lars

      (0) 
  17. Padmindra Gurung

    Hello Experts,

    Once PDF data insert into HANA DB. Is there any technique to extract whole paragraph of the PDF content not only the some text.

    In another word, extraction of all pdf content directly in SAP HANA?

    Thank,
    Padmindra

    (0) 
  18. Naga Divya Haridasula

    Hi Lucas,

    Thanks for the information provided, it is very helpful.
    I have below query regarding HANA host.
    1) This HANA DB host in the connect statement, could be HCP also? or it should be Hana on Premise DB?
    (con = dbapi.connect(β€˜hanahost‘……………………..)

    Thanks,
    Divya.

    (0) 
  19. Sakshi Srivastava

    Hi,

    I am able to load a local PDF file through PYODBC but how to load from a web page? Creating a simple web page and having a button of uploading a file which gets stored in HANA database. Following is my code where I am giving the path of the file from my system instead of that file path , my requirement is to get the file path dynamically:

    import pyodbc
    conn = pyodbc.connect(‘DRIVER={HDBODBC};SERVERNODE=XXXX;SERVERDB=XXX;UID=XXXX;PWD=XXXXX’)
    cur = conn.cursor()
    file = open(‘c:/pdf-sample.pdf’,’rb’)
    content = file.read()
    cur.execute(“INSERT INTO TEST.MY_TABLE_BLOB_TEST VALUES(?,?)”, (‘1’,content))
    cur.execute(“COMMIT”)
    file.close()
    cur.close()
    conn.close()

    How should I proceed with my requirement?

    (0) 

Leave a Reply