Product Information
Text mining at SAP HANA Cloud with Python
Why you should care about that?
If you are developing in SAP HANA Cloud, good news: you don’t have to move your data in/out of SAP HANA Cloud to operate Text Mining techniques.
Would you like to enhance your advance analytics reports with insights stemmed from text? e.g.
Which is the most frequent spare-part mentioned on High criticality issues reported from Field engineers to the manufacturing plant?
What are the top 3 operations taking place each month from the Maintenance department?
Good news. All you need is one connection to a SAP HANA Cloud instance, consolidate your results in a Calculation View and your BI engineers can enhance their report in SAP Analytics Cloud (SAC).
Last but not least you are able to classify an input document with respect to sets of categories.
This blog will be a series of 2 blog posts, starting with this one to get the very basics of the new functionalities offered. We will use a Jupyter notebook and a sample text with a tiny dataset having 3 columns.
- Document ID as a unique id per row
- Free text mocking an S/4 Plant Maintenance ticket notification.
- If you are not familiar with this S/4 Module, a notification is how a maintenance department is getting a request to check at abnormal or exceptional situation(s) in technical objects of a plant. The ticket among many other fields, includes a free text field with a description of the problem.
- Priority (High or Low) of a ticket.
The second blog post would be to see how we can make use of another SAP BTP Component (Cloud Foundry or Kyma Runtime) to build an application on top of these functions to make them available to our business users.
Setup your Jupyter notebook
1.Create a Jupyter notebook and import hana_ml library:
#SAP LIBRARIES
import hana_ml
import hana_ml.dataframe as dataframe
import hana_ml.text.tm as tmh
###EXTERNAL LIBRARIES
import pandas as pd
##SETTINGS
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.set_option('display.max_colwidth', 100)
2.Create 2 csv files at the same directory.
1st file , call it sample-text.csv and add the text below and save it.
ID,NOTIFICATION_TEXT,PRIORITY DOC1000,repair damaged pipe,HIGH DOC2000,replace main valve,HIGH DOC3000,repair oil leak,HIGH DOC4000,paint doors,LOW DOC5000,clean filters,LOW DOC6000,maintain A/C filters,LOW
2nd file , call it new-text.csv and add the text below and save it. (note that we are not including a 3rd Column with Priority. This dataset would be to showcase the text classification function.)
ID,NOTIFICATION_TEXT DOC7000,repair main valve DOC8000,clean debris
Import CSV files to SAP HANA Cloud
table_name = "TM_SAMPLE_TEXT"
file_name = "sample-text.csv"
df_data = pd.read_csv(file_name,sep=',',encoding= 'unicode_escape')
dataframe.create_dataframe_from_pandas(connection_context = conn,
pandas_df = df_data,
table_name = table_name,
force = True,
replace = True)
df_remote = conn.table("TM_SAMPLE_TEXT")
df_remote.collect()
If all worked alright, it should look like this:
Repeat for the second CSV file:
table_name = "TM_NEW_TEXT"
file_name = "new-text.csv"
df_data = pd.read_csv(file_name,sep=',',encoding= 'unicode_escape')
dataframe.create_dataframe_from_pandas(connection_context = conn,
pandas_df = df_data,
table_name = table_name,
force = True,
replace = True)
df_remote_new = conn.table("TM_NEW_TEXT")
df_remote_new.collect()
Tip 1: Avoid collecting all data as you see above by calling .collect() on the dataframe in real projects. Remember that if you do that you collect all the data at the Jupyter notebook! Instead you should use df_remote.head(5).collect() to check your data and keep the heavy lifting on SAP HANA Cloud.
Now you are holding all the information is this df_remote object. Perform a Term Frequency analysis by simply passing the df_remote object as parameter to the tf_analysis function:
tfidf= tmh.tf_analysis(df_remote)
tfidf[0].head(5).collect()
tfidf[1].head(5).collect()
tfidf[2].head(5).collect()
The object returned is a tuple of hana_ml dataframes and you may receive each one by simply adding an index from 0 to 2 if you are interested to check the TF/IDF values calculated etc.
However, once you have the tfidf object initiated, it can be used as input to other text mining functions. Let’s see an example:
get_related_doc
This function returns the top-ranked related documents for a query document based on Term Frequency – Inverse Document Frequency(TF-IDF) result or reference data.
Get a sample query document from the new dataset. At this case we select just the first one with text “repair main valve”. We pass to get_related_doc this unseen document and the tfidf object.
#get one line from the unseen documents
new_text = conn.sql('SELECT "NOTIFICATION_TEXT" FROM ({}) LIMIT 1 '.format(df_remote_new.select_statement))
display(new_text.collect())
#run the get_related_doc
get_related_doc_rs = tmh.get_related_doc(new_text, tfidf)
#display the list of top N related documents with their mathcing score
display(get_related_doc_rs.collect())
#join the result table wit the initial table to have text and scores for ambiguity
_result = get_related_doc_rs.alias('A').join(
df_remote.alias('B'),
"A.ID = B.ID"
).sort("SCORE",desc="TRUE").deselect(["ID","PRIORITY"]).collect()
display(_result)
Tip 2 : Please note the latest addition to the hana_ml data frame where you can deselect a column. Never again having to select all the columns one by one and skipping the ones we don’t need!
The results look like below:
The function has returned three documents, with the top one being “replace main valve” whereas the new document was “repair main valve”.
get_suggested_term
This function returns the top-ranked terms that match an initial substring based on Term Frequency – Inverse Document Frequency(TF-IDF) result or reference data.
Let’s hack our way to retrieve one term from a document. What we do below is retrieving just the the leftmost three characters. The full string is “repair main valve” so we expect to get just “rep” by adding an index of 3.
_index = 3
sub_term = conn.sql('SELECT LEFT(NOTIFICATION_TEXT,{}) as "SUB_STRING_" \
FROM ({}) LIMIT 1 '.format(_index,df_remote_new.select_statement))
display(sub_term.collect())
Let’s make use of the get_suggested_term functions to get some results.
_result = tmh.get_suggested_term(sub_term,df_remote)
_result.collect()
The results are two other terms “repair” and “replace” with the first one coming on top.
text_classification
Last but not least, the text classification function. This is where we will need both the dataframes we created at the beginning of the blog. This is based on k-nearest neighbors algorithm used for classification. The default k is 1, so the new document will be assigned to the class of the nearest neighbor. Just pass the two dataframes to the text_classification function.
res = tmh.text_classification(df_remote_new, df_remote)
display(res[0].collect())
display(res[1].collect())
The returned object consists of two dataframes which can be retrieved by passing the correct index.
The first one has the new documents assigned with a category and the latter one the matching distance and the matched document.
Let’s operate some basic sql using hana_ml join operators to see what was the actual text between the unseen documents and the matched ones.
res[1].alias('A').join(
df_remote.alias('B'),
"A.TRAIN_ID = B.ID"
).select(
("TEST_ID","NEW_DOC_ID"),
("NOTIFICATION_TEXT","MATCHED_NOTIFICATION_TEXT"),
("TRAIN_ID","MATCHED_ID"),
("PRIORITY","MATCHED_PRIORITY"),
("DISTANCE")
).alias('C').join(
df_remote_new.alias("D"),
"C.NEW_DOC_ID = D.ID"
).select(
"NEW_DOC_ID",
"MATCHED_ID",
("NOTIFICATION_TEXT","NEW_NOTIFICATION_TEXT"),
"MATCHED_NOTIFICATION_TEXT",
"DISTANCE",
"MATCHED_PRIORITY"
).collect()
Outro
From a developer’s point of view is alleviating to know that once you have connected to a SAP HANA Cloud instance with your Jupyter you have scalable and secure access to functions that can handle any data type.
From a business point of view, it is easier than ever forming a business hypotheses that requires data preparation, exploration and visualisation capabilities and get back a report on the spot. (fail fast or make impact faster)
Upgrade to latest SAP HANA Cloud and start formulating business use-cases and improve significantly your traditional analytic approaches by combining all sorts of data (and types).
Thanks for reading!
Alex
Alex, this is extremely helpful and the blog extremely well written and informative. Thank you!
Many thanks Rafael!
Hi,
we get
Error: (328, 'invalid name of function or procedure: no procedure with name PAL_TF_ANALYSIS found: line 9 col 15 (at pos 280)')
Is there something to be installed on HANA ?
I have AFLPAL installed however.
Any help appreciated.
Regards
Marcus
Hello Marcus,
thanks for reaching out. I am assuming you are using a HANA Cloud instance.
Text Mining(TM) functions made available with the 2021_01_QRC HANA Cloud Release via the PAL library.
Could you please check if your Hana Cloud version is the latest one?
You can do this by executing the following command at the SQL Console:
You will get a response of one column with a code similar to the one below.
fa/CE2021.8
You can use this link to check if your version includes the TM functions
Thanks!
Alex
thanks for the reply.
It turned out that we needed to add an OSS request to finally activate the script server on our HANA Cloud Instance to get the Text Mining features ready.
Although other APL procedures were available, the text mining only became visible after this step.
Regards
Marcus