Ok, so I was experimenting with HANA text analysis and found out that:

  1. With HANA Dictionaries we can find out what the documents are talking about
  2. With CGUL rules we can find out what we are talking about them

CGUL, rules are not in scope of this blog, however I would want to give a gist of when we should use them with two examples:

  1. For removing ambiguity using figure of speech, for instance you have a company All Star, the abbreviation of it is AS, now AS can come as a non noun value too, and more frequently too, here you can write CGUL rule to extract it only if it is noun for example
  2. For removing substring contradictions, for instance, we want to find word positive, but not positive also has positive and if extracted the fact would be a contradiction, so here too we can write a rule saying “positive and NOT not positive 🙂

So, now lets talk about what this blog want to talk about:

Problem statement: There are thousands of documents loaded in the system. We want to classify these documents type into

  1. Brochure
  2. Manual
  3. Technical Data
  4. Argument Document
  5. MISC

Approach:

First run TA and identify the following:

  1. From TA collect the following information
    1. Number of different manufacturer mentioned in the document
      1. This for example could be more in an argument document than in a brochure where it should be only one
    2. Mention of “technical data”, “DataSheet” or its synonyms
      1. This for example could be more in brochure formally than in an argument document
    3. Number of products (as maintained in semantic net) form same manufacturer.
      1. This for example if more likely to occur in technical document/spec sheet compared to argument document
    4. Number of occurrence of setup actions and help key words like: Installation, Instructions, Configuration, Fault,  Maintenance
    5. Number of time a document type term or its synonym occurs on first Page/paragraph/number of lines
  2. Feed-in learning document types for each of the category and record the relevant text analysis characteristics.
  3. Classify remaining document set based on points via a supervised learning algorithm for classification — here we use KNN from PAL

So 1. Lets see what we need for TA

Dictionary:

We create a .hdbtextdict file in XS project

This will for example have following content

<?xml version=”1.0″ encoding=”UTF-8″?>

<dictionary xmlns=”http://www.sap.com/ta/4.0“>

   <entity_category name=”DOC_IDENTIFIERS”>

<entity_name standard_form=”DOC_TYPE_BROCHURE”>

         <variant name=”BROCHURE” />

         <variant name=”Brochure” />

         <variant name=”brochure” />

         <variant name=”Prospekt” />

         <variant name=”leaflet” />

         <variant name=”offer” />

         <variant name=”Druckluftkommentare” />

         <variant name=”kommentare” />

         <variant name=”comment” />

         <variant name=”Angebotstext” />

         <variant name=”angebotstext” />

      </entity_name>

      <entity_name standard_form=”DOC_TYPE_ARGUMENT”>

         <variant name=”ARGUMENT” />

         <variant name=”Argument” />

         <variant name=”argument” />

      </entity_name>

      <entity_name standard_form=”DOC_TYPE_TECHNICAL”>

         <variant name=”DATASHEET” />

         <variant name=”DataSheet” />

         <variant name=”datasheet” />

         <variant name=”TECHNICAL DATA”/>

         <variant name=”Technical Data”/>

         <variant name=”technical data”/>

      </entity_name>

      <entity_name standard_form=”MANUAL_TOPICS”>

         <variant name=”INSTRUCTIONS” />

         <variant name=”Instructions” />

         <variant name=”nnstructions” />

         <variant name=”CONFIGURATIONS” />

         <variant name=”Configurations” />

         <variant name=”Configurations” />

         <variant name=”FAULT” />

         <variant name=”Fault” />

         <variant name=”fault” />

         <variant name=”MAINTENANCE” />

         <variant name=”Maintenance” />

         <variant name=”maintenance” />

      </entity_name>

   </entity_category>

</dictionary>

Then we use this dictionary in configuration file

Create a .cfg file, copy content from some standard cfg file in HANA

Modification:

under

<property name=”Dictionaries” type=”string-list”>

put

<string-list-value><package where you created you dictionary>::<filename>.hdbtextdict</string-list-value>

Then we use this configuration file in full text index over the binary content column of the document store table

Example:

CREATE FULLTEXT INDEX “DOCUMENT_STORE_BIN_CONTENT” ON “DEMO”.”come.test.ta::DOCUMENT_STORE” (“BIN_CONTENT”)

  LANGUAGE COLUMN “LANG”

  MIME TYPE COLUMN “MIME_TYPE”

  CONFIGURATION ‘<pacakge where you create config file>::<name of your .cfg file without extension>’ ASYNC

  LANGUAGE DETECTION (‘en’,’de’)

  PHRASE INDEX RATIO 0.000000

  FUZZY SEARCH INDEX OFF

  SEARCH ONLY OFF

  FAST PREPROCESS OFF

  TEXT MINING OFF

  TEXT ANALYSIS ON;

Check if all documents have been index by querying table :

select * from “SYS”.”M_FULLTEXT_QUEUES”

Once the indexing is done

you can find in your $TA_* table the extracted entities from the documents

$TA_TABLE.PNG

So now that TA is done….lets marry it to statistical modelling.

We use KNN  from PAL for this example.

So first we create vectors for feeding in the KNN, we will create a view on top of this $TA table to get distinct counts of the TA_NORMALIZED for every document

LearningVector.PNG

Feed some of these document to KNN learning set where we are sure of the types already, more the number of learning set the better

NOTE* make sure the learning set per document type are all of same size else the KNN will be biased.

And the the pass all documents to such learning set and let KNN do its part!

By the end of it it would have classified documents…and indeed for some Brochures…it would proudly say….its a Brochure

I tried this on document set of around 20000 with around 5 languages, had learning sets of around 30(this is a bit low), and K of 75 as the group cluster of 30 were very tight.

The outcome of classification was accurate to around 90%+

Hope this blog would help you with potentials of marrying text analysis with statistical modelling and come up with even more interesting usecases.

Watch out this space for more detailed blog for hows and whats of CGUl rules.

Cheers,

Jemin Tanna

Connect with me on linkedin: Jemin Tanna | LinkedIn

To report this post you need to login first.

8 Comments

You must be Logged on to comment or reply to a post.

  1. Ranjit Alapati

    Hi Jemin,

    Thanks for posting does TA in HANA  support attributes of an item like shape, color, style and material to search these terms in the scope of TA?

    Best,

    Ranjit

    (0) 
    1. Jemin Tanna Post author

      Hi Ranjit,

      Thank you for the feedback 🙂

      The main premise of TA is natural language processing. So the attributes mentioned by you is not exactly the targret usecase for TA.

      However as post processing on pre processing these are very strong usecases….and you never know we might hear something from SAP too on this….but meanwhile there are a lot of open source things which are aiding to this.

      Regards,

      Jemin

      (0) 
    1. Jemin Tanna Post author

      Hi Thomas,

      Sorry for the late reply…yes it can be applied to that problem statement…however with HANA text mining in place we could do a sql based queries for similar products too…however for more industry specific usecase this approach is till relevant.

      Thanks,

      Jemin

      (0) 

Leave a Reply