Have you ever wanted to combine HANA’s many engines but didn’t know how to?
Perhaps you’re still waiting for the right use case?
Here we wanted to identify people or places within unstructured documents, PDF and Word documents. Surface some intelligence, perhaps the relationships between the people.
For the purpose of this blog, we have taken some public documents – company annual reports, UK political manifestos and SAP documentation.
Could this be achieved within SAP HANA? The answer is of course yes 🙂
The process went like this.
- Acquire the unstructured documents (Smart Data Integration)
- Identify people with in the documents (Text Analysis)
- Clean the people identified (Smart Data Quality)
- Identify linked people (HANA Graph)
- Search for relationships between documents and people (Text Search & Text Mining)
- Expose data for consumption (Calculation Views)
- Combine the Output (SAP Analytics Cloud, Analytics Designer)
1. Acquire the unstructured documents (HANA Smart Data Integration)
Opening the virtual table shows that we have three useful columns, the directory path, the document name and the document itself in a binary format.
The SDI flowgraph below creates a physical table from the virtual table and adds 3 further columns (CATEGORY, MIME, LANG) that will be useful later.
2. Identify people in the documents (HANA Text Analysis)
Using HANA Text Analysis we can turn the unstructured documents into a structured form. The structured form of the document identifies many different types of entities including people. To do this we need to create the text index which will create the $TA table for us. Below we created a .hdbfulltextindex
Looking at the data shows that Text Analysis has found a large number of people (13,996)
We should inspect some of the PERSON entities.
In the result below, we can see that some of the people are actual people – Bill McDermott, Gerhard Oswald, Bernd Leukert, Hasso Plattner and more familiar names, but there are some that do not appear to be people IDE, tion, ing, ment.
3. Clean the people identified (HANA Smart Data Quality)
As the above output shows, correctly extracting the people from the documents is not easy. There are some entities identified as people that are not really people. It was recommended to me by a colleague Remi ASTIER to perform some Data Quality checks to clean up these names. This will give significantly better quality output.
Additionally, we can include further processing, specify some quality rules what is acceptable. e.g. requires first name and last name.
To do this we created the flowgraph below.
The output of the flowgraph is a table containing the unique people cleaned up. We have added the number of occurrences of each person and the number of documents they appear in.
4. Identify linked people (HANA Graph)
5. Search for relationships between documents and people (Text Search & Text Mining)
6. Exposing data for consumption (HANA Calculation Views)
We can explore the data using hand coded SQL, but that is only good for those of us who understand SQL. Fixed SQL would not be suitable if you want to provide this capability to users. Here we used a traditional calculation view as we couldn’t get the Text Mining functions to work in the schema-less HDI container.
With the relevant terms function we first need a document, we then pass that into the text mining function.
7. Combine the Output (SAP Analytics Cloud, Analytics Designer)
Individually the HANA calculation views can provide some interesting insight. Perhaps if we combine Fuzzy Text Search, Relevant Terms, Related Terms, Related Documents, Relevant Documents and Text Analysis we can expose something meaningful
I tried combining the different calc views within a SAC story but the user experience was not complete. We were not able to find a way to pass a parameter from a row within a table to another chart/data provider. The SAC linked analysis is very cool, but it couldn’t quite achieve what we needed. We have six calculation views, each with input parameters that we want to pass parameters between.
I heard the Design Studio / Lumira Designer like functionality is now available within SAC. We were keen to give this a try. The capability is released as Analytics Designer. Using Analytics Designer usually requires a small amount of scripting to get all the components working together.
In SAC we created an Analytic Application which would bring the calculation views together.
Here we can see the 6 calculation views that were used.
The wireframe of the output is below. We decided to include 4 Text Mining functions, Fuzzy Text Search and the Text Analysis output. The top 3 elements are driven by the search term, and the bottom 3 relate to a specific document selected.
Within our application, we select the model and visualisations we require and then format them as required.
We now link the variables that are related. The top 3 visualisations all use the input SEARCH_TERM.
Now we wanted to be able to click on the Document Name in the table and then pass this to the three visualisations below. This piece of code below does that, albeit the setVariableValue has currently been disabled in our tenant (2019.8.3)
When we launch the application we are prompted once for our search term.
After inputting our search term the SAC Analytic Application is displayed.