Getting insight from text with Predictive Analytic...

PPaolo · ‎05-27-2015

A few days ago I came across an interesting dataset about patient hospitalizations: for each hospitalized patient a record was entered in the system.

The data was interesting because, among various attributes, there was a field containing the initial diagnosis for the patient, as a text description, and a field showing if the same patient had been readmitted to the hospital after being released a first time.

The question which came to mind was: ‘is it possible to see if there is a relationship between the initial diagnosis for a patient and the need for re-hospitalization?’ in other terms, if a correlation exists, what health problems are more likely to demand a readmission into hospital?

SAP Predictive Analytics 2.1 (SAP PA 2.1) has a text analysis workflow in its Automated Analytics interface which helps answering this kind of questions, so I gave it a try.

The goal was to quickly understand if there was a relationship and what was it; probably there are ways to do a much better model than the one I am presenting here. In no case this paper wants to be a best practices document, it is rather a quick overview of what is possible with the Text Analytics component of SAP PA 2.1.

In this test I wanted to keep things simple and just see if some useful insight was there, without necessarily trying to build the best possible model, so, out of laziness, I used the default settings everywhere.

Let’s go!

Just a hint before we start: the pictures below don't always look good in a plain browser. Try the ''View as PDF" option at the right of this page and you will see better pictures.

In the image below you can see an excerpt of the hospitalization file, most entries were removed and only the initial diagnosis (Primary_Diag) and readmission (Readmitted_Flag) fields which serve this example were kept.

In SAP PA2.1 you find the text analysis module under [Data Manager]. When you click on [Perform a Text Analysis] you arrive on the first page where you need to choose the appropriate action.

In our case we want to know if “yes or no” a patient is likely going to be readmitted based on the diagnosis. This is a typical example of classification where the output is a choice between two values so you are now going to click on [Add a Classification/Regression].

You are now asked to select the file on the [Reference Data Set] page, do so and click [Next].

The [Data Description] page appears. The first thing to do is to click the [Analyze] button. All the fields of the source file are displayed.

In this page you have to set the [Value] attribute of the fields you want to text-analyze to [Textual].

In the picture below you see that I set that value for the Primary_Diag field. This field contains a text written by a doctor which provides the first diagnosis for the patient.

The [Textual] setting indicates to the application that the field has to undergo a text analysis. In this case the string won’t be treated as a single value field but each word will be extracted and used as a separate attribute. In general you can set multiple text fields to [Textual]: all fields with that setting are going to contribute to the list of words used in the Classification module.

You click on [Next] and arrive to the first page of [Text Coding Parameters Settings].

This page and the following let you specify with great detail what words to consider for the analysis. You can, for example, define what words should be excluded (by default the application removes unnecessary words such as ‘and’, ‘is’, etc. and you can define your own) or you can group words under a single term (e.g. ‘laptop’, ‘desktop’, ‘mainframe’ can be grouped under the same word ‘computer’). You can also specify how to calculate the weight of each word (just a 1 and 0 if it appears or not, or rather the number of occurrences in the texts, or other choices).

For sake of simplicity we are going to accept the default settings, the only thing we are going to set is that we want the analysis to be performed in English as shown in the next picture.

You click [Next] to arrive to second [Text Coding Parameters Settings] page, which we leave as-is to accept the default, and click [Next] again.

Now SAP PA 2.1 analyzes the textual fields and retrieves all useful words into them, the next screen displays the list of words and their frequency in the dataset.

You can see that the application has extracted the root of each word (e.g. the root ‘local’ represents words such as ‘localized’, ‘localization’, ’location’), the analysis is going to be performed using the root words. Click [Next].

On the following page you see all of the variables which have been generated for the analysis. Each root found in the Textual fields appear here with a “tc_” prefix.

As shown in the picture below, for our analysis we are going to exclude two variables: the CountInformation and EffectiveRoot generated from the fileld Primary_Diag. Those variables count the number of roots and are not useful for our model.

In the [Explanatory Variables Selected] list we keep all the word roots extracted from the Primary_Diag field

Click [Next] and then [Generate] to build the model.

With my sample file I had a Ki of 0,7790 which is quite ok for a very simple model (Ki is a index showing the quality of the model, for more information about Ki please read the product documentation).

We can start now checking if there is any relationship between a readmission and some words in the initial diagnosis.

In the [Using the Model] page click on [Contributions by variables] and you see a prioritized list of the roots which most contribute to the readmission of a patient into the hospital.

We see here that 'intertrochanter', 'respiratori' and 'local' are the three words which have the biggest influence on a readmission. Let's see now how they are impacting it.

If you double click on the first entry (“intertochanter”) you see that if the category is [1] (hence the root is in the Primary_Diag field) then the influence is on the left side of the graph and is positive (hence there is a positive influence on the Radmitted_flag field). This is shown in the image below.

In ‘business terms’, this means that if a word with the root “intertrochanter” appears in the initial diagnosis than there is a higher probability that the patient will be readmitted.

Looking now at the contribution of the root “local” we see exactly the opposite situation: the readmission is more likely when the root “local” is NOT in the initial diagnosis. This can be seen by the value 0 in the category at the left part of the graph as shown below.

This is already a good result: patients with ‘intertochanter’ in the diagnosis of the dataset were more likely to be readmitted while patients with the root word ‘local’ were more likely not to be readmitted.

We can find additional information by going back at the Contributions by variables graph and hovering with the mouse pointer on the ‘intertochanter’ bar as shown in the figure below.

You can see that there is a correlation between the ‘intertochanter’ root and other roots (‘neck’, ‘femur’). It is likely that those words are used together or bring a similar contribution to the readmission flag.

To summarize, we have seen that there is a strong influence of ‘intertrochanter’ on a readmission of a patient and on the opposite, the word ‘local’ is less likely to be associated to a patient who needs readmission.

Moreover we can see that those words relate to other expressions in the texts: going back to the field entries we see that the root ‘intertrochanter’ appears in diagnosis such as “Closed fracture of intertrochanteric section of neck of femur”.

This is a great insight for a few minutes of work.

Unfortunately I cannot share this file on the web but you can still run a whole text analysis workflow if you want.

First, if you don’t have it already, download SAP Predictive Analytics 2.1 for free from this link: www.sap.com/trypredictive

Then go on this page and look for the video showing how to perform a text analysis: http://scn.sap.com/docs/DOC-32651

The demo file used in the video is available in your installation of PA 2.1 under the Samples/KTC directory.

Or, better try out with a file of your own!

Have fun with PA 2.1 and, if you have any feedback on how SAP could improve the product, drop your suggestion in Ideaplace here: https://ideas.sap.com/PredictiveAnalytics

Getting insight from text with Predictive Analytics 2.1 automated mode

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win