PRE and POST processing around HANA Text Analysis : PDF Table Extraction
With something as powerful as HANA TA, there is an increasing need to do pre and post processing of various elements of a given document.
The first part about such post processing to identify document type is covered in my previous blog: HANA Text Analysis Married to Structured Statistical Models, It’s a Brochure!
So, the scenario under discussion here is the table extraction from a document, say a PDF and make sense out of it for semantic search.
To make things more clear…assume we have table like:
Now, with HANA TA, or in general any natural language processing we cannot set the relationship that Model GX 5 has Power Consumption[kW] of 7.33
So, I got a very good starting point from the blog: http://craiget.com/extracting-table-data-from-pdfs-with-ocr/
Here I had to adjust some bits for windows 64-bit, and had to adjust the cell recognition coding. But with that it was almost ready to be used.
So, what I followed is
- Split PDF pages,
- Make monochrome image from the pages to remove the color bias,
- Also important change I had to do was to enable negation of image to enable the white bordered tables.
- Identify cells, based on criss-cross of horizontal lines and vertical lines
- Submit cells to OCR, tesseract in this example
So, as you can see the result below, it is the outcome of running the python with 4 steps above for the pdf whose representation is given above.
As we see some lines are missed, this is on virtue of noise in the image, i.e; the 3rd horizontal line has a lot of pixel noise, its not black enough.
Once we have this information, we can store this information against the document, page into the graph, and use it to search. Now, we have used tesseract which is open source and needs neat, high quality image, however one could opt using a good licensed OCR or licensed PDF readers and play around with it.
With this approach all meta data influences, like font size, table, italics, bolds etc can now be considered to enhance the semantics net build-up as a supplementary pre/post processing over HANA TA.
Hope this blog helps you, awaiting your feedback and the usecases on your wish-list this could help achieve.
New BLOG @CGUL rules:
To Be or Not To Be: HANA Text Analysis CGUL rules has the answer