Ok, so I was experimenting with HANA text analysis and found out that:
CGUL, rules are not in scope of this blog, however I would want to give a gist of when we should use them with two examples:
So, now lets talk about what this blog want to talk about:
Problem statement: There are thousands of documents loaded in the system. We want to classify these documents type into
Approach:
First run TA and identify the following:
So 1. Lets see what we need for TA
Dictionary:
We create a .hdbtextdict file in XS project
This will for example have following content
<?xml version="1.0" encoding="UTF-8"?>
<dictionary xmlns="http://www.sap.com/ta/4.0">
<entity_category name="DOC_IDENTIFIERS">
<entity_name standard_form="DOC_TYPE_BROCHURE">
<variant name="BROCHURE" />
<variant name="Brochure" />
<variant name="brochure" />
<variant name="Prospekt" />
<variant name="leaflet" />
<variant name="offer" />
<variant name="Druckluftkommentare" />
<variant name="kommentare" />
<variant name="comment" />
<variant name="Angebotstext" />
<variant name="angebotstext" />
</entity_name>
<entity_name standard_form="DOC_TYPE_ARGUMENT">
<variant name="ARGUMENT" />
<variant name="Argument" />
<variant name="argument" />
</entity_name>
<entity_name standard_form="DOC_TYPE_TECHNICAL">
<variant name="DATASHEET" />
<variant name="DataSheet" />
<variant name="datasheet" />
<variant name="TECHNICAL DATA"/>
<variant name="Technical Data"/>
<variant name="technical data"/>
</entity_name>
<entity_name standard_form="MANUAL_TOPICS">
<variant name="INSTRUCTIONS" />
<variant name="Instructions" />
<variant name="nnstructions" />
<variant name="CONFIGURATIONS" />
<variant name="Configurations" />
<variant name="Configurations" />
<variant name="FAULT" />
<variant name="Fault" />
<variant name="fault" />
<variant name="MAINTENANCE" />
<variant name="Maintenance" />
<variant name="maintenance" />
</entity_name>
</entity_category>
</dictionary>
Then we use this dictionary in configuration file
Create a .cfg file, copy content from some standard cfg file in HANA
Modification:
under
<property name="Dictionaries" type="string-list">
put
<string-list-value><package where you created you dictionary>::<filename>.hdbtextdict</string-list-value>
Then we use this configuration file in full text index over the binary content column of the document store table
Example:
CREATE FULLTEXT INDEX "DOCUMENT_STORE_BIN_CONTENT" ON "DEMO"."come.test.ta::DOCUMENT_STORE" ("BIN_CONTENT")
LANGUAGE COLUMN "LANG"
MIME TYPE COLUMN "MIME_TYPE"
CONFIGURATION '<pacakge where you create config file>::<name of your .cfg file without extension>' ASYNC
LANGUAGE DETECTION ('en','de')
PHRASE INDEX RATIO 0.000000
FUZZY SEARCH INDEX OFF
SEARCH ONLY OFF
FAST PREPROCESS OFF
TEXT MINING OFF
TEXT ANALYSIS ON;
Check if all documents have been index by querying table :
select * from "SYS"."M_FULLTEXT_QUEUES"
Once the indexing is done
you can find in your $TA_* table the extracted entities from the documents
So now that TA is done....lets marry it to statistical modelling.
We use KNN from PAL for this example.
So first we create vectors for feeding in the KNN, we will create a view on top of this $TA table to get distinct counts of the TA_NORMALIZED for every document
Feed some of these document to KNN learning set where we are sure of the types already, more the number of learning set the better
NOTE* make sure the learning set per document type are all of same size else the KNN will be biased.
And the the pass all documents to such learning set and let KNN do its part!
By the end of it it would have classified documents...and indeed for some Brochures...it would proudly say....its a Brochure
I tried this on document set of around 20000 with around 5 languages, had learning sets of around 30(this is a bit low), and K of 75 as the group cluster of 30 were very tight.
The outcome of classification was accurate to around 90%+
Hope this blog would help you with potentials of marrying text analysis with statistical modelling and come up with even more interesting usecases.
Watch out this space for more detailed blog for hows and whats of CGUl rules.
Cheers,
Jemin Tanna
Connect with me on linkedin: Jemin Tanna | LinkedIn
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
6 | |
5 | |
5 | |
4 | |
4 | |
4 | |
4 | |
3 | |
3 | |
3 |