Skip to Content
Technical Articles

Putting HANA’s Multiple Engines to the test

Have you ever wanted to combine HANA’s many engines but didn’t know how to?
Perhaps you’re still waiting for the right use case?

Here we wanted to identify people or places within unstructured documents, PDF and Word documents. Surface some intelligence, perhaps the relationships between the people.

For the purpose of this blog, we have taken some public documents – company annual reports, UK political manifestos and SAP documentation.

Could this be achieved within SAP HANA? The answer is of course yes 🙂
The process went like this.

  1. Acquire the unstructured documents (Smart Data Integration)
  2. Identify people with in the documents (Text Analysis)
  3. Clean the people identified (Smart Data Quality)
  4. Identify linked people (HANA Graph)
  5. Search for relationships between documents and people (Text Search & Text Mining)
  6. Expose data for consumption (Calculation Views)
  7. Combine the Output (SAP Analytics Cloud, Analytics Designer)

 

1. Acquire the unstructured documents (HANA Smart Data Integration)

We used the SDI Data Provisioning Agent and a flowgraph to load the unstructured PDFs into a physical table so that we can perform Text Analysis on the documents.
The FileAdapter remote source includes the FILE_LOADER virtual table, this gives us the raw binary document (BLOB).
We can see our Remote Sources when we connect Database Explorer to the database and not  the HDI container, as these are not currently managed via XSA.

 

Opening the virtual table shows that we have three useful columns, the directory path, the document name and the document itself in a binary format.

The SDI flowgraph below creates a physical table from the virtual table and adds 3 further columns (CATEGORY, MIME, LANG) that will be useful later.

2. Identify people in the documents (HANA Text Analysis)

Using HANA Text Analysis we can turn the unstructured documents into a structured form.  The structured form of the document identifies many different types of entities including people.  To do this we need to create the text index which will create the $TA table for us.  Below we created a .hdbfulltextindex

Looking at the data shows that Text Analysis has found a large number of people (13,996)

 

We should inspect some of the PERSON entities.

In the result below, we can see that some of the people are actual people – Bill McDermott, Gerhard Oswald, Bernd Leukert, Hasso Plattner and more familiar names, but there are some that do not appear to be people IDE, tion, ing, ment.

 

3. Clean the people identified (HANA Smart Data Quality)

As the above output shows, correctly extracting the people from the documents is not easy.  There are some entities identified as people that are not really people. It was recommended to me by a colleague Remi ASTIER to perform some Data Quality checks to clean up these names. This will give significantly better quality output.

Additionally, we can include further processing, specify some quality rules what is acceptable. e.g. requires first name and last name.

To do this we created the flowgraph below.

 

The output of the flowgraph is a table containing the unique people cleaned up. We have added the number of occurrences of each person and the number of documents they appear in.

 

4. Identify linked people (HANA Graph)

To use HANA Graph we need a table or view that holds our vertices and edges.  The vertices are the nodes and the edges are the connections between them.
Translated to our use case, our people will be the vertices and the documents will be our edges.
The flowgraph also created a table for use with the graph engine, this associates the people with other people found in the same document.
The Graph definition is shown below
After building the .hdbgraph we can find the graph workspace in the HDI container or the traditional schema.
Clicking the glasses allows us to explore the graph itself.  Our graph is not that interesting as we have only loaded a small number of documents (73).

5. Search for relationships between documents and people (Text Search & Text Mining)

Once we visualise our graph it becomes apparent we are missing a whole lot of intelligence around our documents, this is where Text Mining is great, it can tell us which documents are related, what are the key document terms, and what terms are related to each other.  Text Mining was activated using the .hdbfulltextindex in step 2 above.
We thought using HANA Text Search would be a better place to start as everyone is familiar with searching.
Using the fuzzy search gives us fault tolerance for spelling errors and can find results that are close to the search criteria.
Search, using the CONTAINS () Predicate will identify the document and can provide the text snippet with the search term highlighted with <b>.
Now we know the document where the phrase occurs we can ask Text Mining to tell the related documents or relevant terms from that document.

6. Exposing data for consumption (HANA Calculation Views)

We can explore the data using hand coded SQL, but that is only good for those of us who understand SQL. Fixed SQL would not be suitable if you want to provide this capability to users.   Here we used a traditional calculation view as we couldn’t get the Text Mining functions to work in the schema-less HDI container.

With the relevant terms function we first need a document, we then pass that into the text mining function.

7. Combine the Output (SAP Analytics Cloud, Analytics Designer)

Individually the HANA calculation views can provide some interesting insight. Perhaps if we combine Fuzzy Text Search, Relevant Terms, Related Terms, Related Documents, Relevant Documents and Text Analysis we can expose something meaningful

I tried combining the different calc views within a SAC story but the user experience was not complete.  We were not able to find a way to pass a parameter from a row within a table to another chart/data provider. The SAC linked analysis is very cool, but it couldn’t quite achieve what we needed.  We have six calculation views, each with input parameters that we want to pass parameters between.

I heard the Design Studio / Lumira Designer like functionality is now available within SAC.  We were keen to give this a try.  The capability is released as Analytics Designer. Using Analytics Designer usually requires a small amount of scripting to get all the components working together.

In SAC we created an Analytic Application which would bring the calculation views together.

Here we can see the 6 calculation views that were used.

 

The wireframe of the output is below.  We decided to include 4 Text Mining functions, Fuzzy Text Search and the Text Analysis output.  The top 3 elements are driven by the search term, and the bottom 3 relate to a specific document selected.

Within our application, we select the model and visualisations we require and then format them as required.

We now link the variables that are related. The top 3 visualisations all use the input SEARCH_TERM.

Now we wanted to be able to click on the Document Name in the table and then pass this to the three visualisations below. This piece of code below does that, albeit the setVariableValue has currently been disabled in our tenant (2019.8.3)

When we launch the application we are prompted once for our search term.

After inputting our search term the SAC Analytic Application is displayed.

With that, we are able to explore our documents, search for people, places, anything we like and provide that insight to business users.  Thanks for reading and please provide your feedback in the comments below.
3 Comments
You must be Logged on to comment or reply to a post.