HANA Text Analysis Married to Structured Statistical Models, It’s a Brochure!
Ok, so I was experimenting with HANA text analysis and found out that:
- With HANA Dictionaries we can find out what the documents are talking about
- With CGUL rules we can find out what we are talking about them
CGUL, rules are not in scope of this blog, however I would want to give a gist of when we should use them with two examples:
- For removing ambiguity using figure of speech, for instance you have a company All Star, the abbreviation of it is AS, now AS can come as a non noun value too, and more frequently too, here you can write CGUL rule to extract it only if it is noun for example
- For removing substring contradictions, for instance, we want to find word positive, but not positive also has positive and if extracted the fact would be a contradiction, so here too we can write a rule saying “positive and NOT not positive 🙂
So, now lets talk about what this blog want to talk about:
Problem statement: There are thousands of documents loaded in the system. We want to classify these documents type into
- Brochure
- Manual
- Technical Data
- Argument Document
- MISC
Approach:
First run TA and identify the following:
- From TA collect the following information
- Number of different manufacturer mentioned in the document
- This for example could be more in an argument document than in a brochure where it should be only one
- Mention of “technical data”, “DataSheet” or its synonyms
- This for example could be more in brochure formally than in an argument document
- Number of products (as maintained in semantic net) form same manufacturer.
- This for example if more likely to occur in technical document/spec sheet compared to argument document
- Number of occurrence of setup actions and help key words like: Installation, Instructions, Configuration, Fault, Maintenance
- Number of time a document type term or its synonym occurs on first Page/paragraph/number of lines
- Number of different manufacturer mentioned in the document
- Feed-in learning document types for each of the category and record the relevant text analysis characteristics.
- Classify remaining document set based on points via a supervised learning algorithm for classification — here we use KNN from PAL
So 1. Lets see what we need for TA
Dictionary:
We create a .hdbtextdict file in XS project
This will for example have following content
<?xml version=”1.0″ encoding=”UTF-8″?>
<dictionary xmlns=”http://www.sap.com/ta/4.0“>
<entity_category name=”DOC_IDENTIFIERS”>
<entity_name standard_form=”DOC_TYPE_BROCHURE”>
<variant name=”BROCHURE” />
<variant name=”Brochure” />
<variant name=”brochure” />
<variant name=”Prospekt” />
<variant name=”leaflet” />
<variant name=”offer” />
<variant name=”Druckluftkommentare” />
<variant name=”kommentare” />
<variant name=”comment” />
<variant name=”Angebotstext” />
<variant name=”angebotstext” />
</entity_name>
<entity_name standard_form=”DOC_TYPE_ARGUMENT”>
<variant name=”ARGUMENT” />
<variant name=”Argument” />
<variant name=”argument” />
</entity_name>
<entity_name standard_form=”DOC_TYPE_TECHNICAL”>
<variant name=”DATASHEET” />
<variant name=”DataSheet” />
<variant name=”datasheet” />
<variant name=”TECHNICAL DATA”/>
<variant name=”Technical Data”/>
<variant name=”technical data”/>
</entity_name>
<entity_name standard_form=”MANUAL_TOPICS”>
<variant name=”INSTRUCTIONS” />
<variant name=”Instructions” />
<variant name=”nnstructions” />
<variant name=”CONFIGURATIONS” />
<variant name=”Configurations” />
<variant name=”Configurations” />
<variant name=”FAULT” />
<variant name=”Fault” />
<variant name=”fault” />
<variant name=”MAINTENANCE” />
<variant name=”Maintenance” />
<variant name=”maintenance” />
</entity_name>
</entity_category>
</dictionary>
Then we use this dictionary in configuration file
Create a .cfg file, copy content from some standard cfg file in HANA
Modification:
under
<property name=”Dictionaries” type=”string-list”>
put
<string-list-value><package where you created you dictionary>::<filename>.hdbtextdict</string-list-value>
Then we use this configuration file in full text index over the binary content column of the document store table
Example:
CREATE FULLTEXT INDEX “DOCUMENT_STORE_BIN_CONTENT” ON “DEMO”.”come.test.ta::DOCUMENT_STORE” (“BIN_CONTENT”)
LANGUAGE COLUMN “LANG”
MIME TYPE COLUMN “MIME_TYPE”
CONFIGURATION ‘<pacakge where you create config file>::<name of your .cfg file without extension>’ ASYNC
LANGUAGE DETECTION (‘en’,’de’)
PHRASE INDEX RATIO 0.000000
FUZZY SEARCH INDEX OFF
SEARCH ONLY OFF
FAST PREPROCESS OFF
TEXT MINING OFF
TEXT ANALYSIS ON;
Check if all documents have been index by querying table :
select * from “SYS”.”M_FULLTEXT_QUEUES”
Once the indexing is done
you can find in your $TA_* table the extracted entities from the documents
So now that TA is done….lets marry it to statistical modelling.
We use KNN from PAL for this example.
So first we create vectors for feeding in the KNN, we will create a view on top of this $TA table to get distinct counts of the TA_NORMALIZED for every document
Feed some of these document to KNN learning set where we are sure of the types already, more the number of learning set the better
NOTE* make sure the learning set per document type are all of same size else the KNN will be biased.
And the the pass all documents to such learning set and let KNN do its part!
By the end of it it would have classified documents…and indeed for some Brochures…it would proudly say….its a Brochure
I tried this on document set of around 20000 with around 5 languages, had learning sets of around 30(this is a bit low), and K of 75 as the group cluster of 30 were very tight.
The outcome of classification was accurate to around 90%+
Hope this blog would help you with potentials of marrying text analysis with statistical modelling and come up with even more interesting usecases.
Watch out this space for more detailed blog for hows and whats of CGUl rules.
Cheers,
Jemin Tanna
Connect with me on linkedin: Jemin Tanna | LinkedIn
Hi Jemin,
Thanks for posting does TA in HANA support attributes of an item like shape, color, style and material to search these terms in the scope of TA?
Best,
Ranjit
Hi Ranjit,
Thank you for the feedback 🙂
The main premise of TA is natural language processing. So the attributes mentioned by you is not exactly the targret usecase for TA.
However as post processing on pre processing these are very strong usecases....and you never know we might hear something from SAP too on this....but meanwhile there are a lot of open source things which are aiding to this.
Regards,
Jemin
A new blog around a pre/post processing for table data extraction is shared:
PRE and POST processing around HANA Text Analysis : PDF Table Extraction
New BLOG @CGUL rules:
To Be or Not To Be: HANA Text Analysis CGUL rules has the answer
Well explained and articulated.
Thank you for sharing the information!
Thanks Amita 🙂
Hi Jemin,
i Know its been some years since you've posted this Blogpost, but please dont mind if I ask you this question.
Do you think your approach shown in this post can be applied to this problem: Classify Documents in HCP according to given information??
I am trying to solve this but cant really follow an approach that works for my case..
Thanks in advance!!
Hi Thomas,
Sorry for the late reply...yes it can be applied to that problem statement...however with HANA text mining in place we could do a sql based queries for similar products too...however for more industry specific usecase this approach is till relevant.
Thanks,
Jemin