Take a Deep Breath and Think of Stars in the Sky on a Warm Still Night: Text Analysis Enterprise Configurations in SPS09
Enterprise. Now that word resonates. In this video Tahir Hussain Babar, aka Bob, of the SAP HANA Academy demonstrates how to use the Extraction_Core_ Enterprise configuration for SAP HANA SPS09. This is such a contrast to yesterday when I was writing about Text Analysis for the public sector. I am in a similar location but little things make a difference. Although I have many happy memories of working in Education I have an innate dislike for the greyness which characterises much of the public sector. So with a lungful of breath at my disposal, armed with my laptop and access to vast, virtual clouds of information, I am ready to consider the stars, even though it’s still daylight and the sky is grey. You may have guessed that I have watched the occasional episode of Star Trek and have even foisted its virtues on my unsuspecting children. Even doing my MCSE, which had an Enterprise module, could not corrupt my pristine image of seeking and experiencing. Every day I am taught new words by my children and seen words I have come to know and love used, and in my opinion misused, in many ways. So for me Text Analysis, in a competitive, fast moving environment which has change imposed by a multitude of internal and external factors, is exciting. From the SAP perspective, ‘Text Analysis’, refers to the ability to do Natural Language Processing, linguistically understand the text and apply statistical techniques to refine the results. This is an area of research which will always have huge scope for improvement but for now let’s consider the potential that’s out there right now with SPS09.
Bob starts as usual by outlining his tools for the day which in this case are the Admin console for HANA Studio within which he has two connections, one to a database running SPS08, the other to a database running SPS09. Both schemas do not have tables in them at this stage.
As in the last video Bob shows you how the Extraction_Core_ VOICEOFCUSTOMER configuration works in SPS08 first so that he can contrast it with Extraction_Core_ Enterprise in SPS09. He starts by creating a table in SQL with two columns, a primary key and an integer.
He then loads in four rows of data. The four rows are short sentences on SAP’s acquisition of Business Objects and subsequent events. Bob then previews the data loaded into the table.
He creates a full text index and then highlights a new feature for SPS09 which is the Extraction_Core_ Enterprise configuration. In SPS08 there were five configurations one of them being Extraction_Core_VOICEOFCUSTOMER.
Bob discusses the output of the Text Analysis and Search (SPS08) and indicates that it does not really show the meaning of the data in this context. It does have its strengths, for example, it has found SAP HANA SPS09 as a product. Nevertheless, SPS09 in this particular configuration will greatly enrich the TA_TYPE or meanings that can be extracted from that data.
Bob then starts a demonstration of how the output above would look with SPS09 using the same scripts. The only difference being the configuration type which you will notice has not been commented out.
Bob runs the code above in this order. He creates the table, builds the index and then loads the data in batches so that he can discuss the meaning of the data. He does this by commenting out the lines he is not using as below.
Using the first line as an example, Bob discusses what the difference between outputs produced in SPS08 and SPS09 for the first row of data loaded. He focuses on TA_TYPE and contrasts it with the output for SPS08. You will notice that Bernard Liautaud is identified as a PERSON in SPS08 but in SPS09 this is made more specific by identifying him as an OrganisationFounder because in the text it describes him as cofounding Business Objects.
Bob uses the same methodology with the second statement highlighting that in SPS08 two lines of information were extracted compared to seven with SPS09. He then shows how WebI is a ProductRelease. This TA_TYPE assignment was made based on the Action “introduces” which applied to the Product WebI.
On the third statement Bob shows how a lot of information has been gathered compared to SPS08, which only found amongst other things, the organisation, currency and year. However, in SPS09 we can identify it was a BuyEvent based on the start of the sentence “SAP announced it would buy Business Objects”. SPS09 identifies the buyer and who was bought which is indicated by OrganisationA and OrganisationB respectively and the Action being “acquire”. SPS09 also identifies the stock price for Business Objects.
For the final statement, “SAP AG releases SAP HANA SPS09 in 2014” in SPS08 you get the organisation, the product and the year but in SPS09 you get a lot more information. This enrichment comes from the availability of more TA_TYPES.
Bob concludes by noting that Extraction_Core_ Enterprise works for English in SPS09 and it contains rules for the extraction of entities and facts of particular interest to an enterprise domain. For example, membership information, affiliations, personnel or management changes, information about mergers and organizational information such as location or contact information.
There has obviously been a leap from SPS08 to SPS09 and I suspect, as in previous versions, the number of supported languages will grow. However, given the complexity of the areas of language now being considered, this growth will require significant investment which will be driven by languages and dialects that are identified with economic success. We have come a long way from when Sentiment Analysis, which helps determine the attitude of the author from the text, was considered pretty impressive. To further complicate matters there is always a debate around context in language. Up until now Text Analysis in SAP HANA was characterised by its ability to identify the language, apply appropriate linguistic rules for that particular language and semantically interpret the data produced. For example, a company could analyze tweets in different languages, HANA could classify the tweet according to the appropriate language rules and provide sentiment analysis like ‘strong positive’, ‘weak positive’ etc. However, in terms of languages supported, Sentiment Analysis has always lagged behind Text Analysis due that maddening thing: context. This is particularly problematic in Japanese and Polish. I suspect that at some point, given the increasing support for a range of business cases which have extensive dictionaries and rules associated with them, that SAP will have to make a choice between widening and deepening this technology. I suspect that they will adopt an extensible approach, one that lends itself to customization and will be configurable for evolutionary changes, which tries to widen and deepen at once.