COPD study, explanation and interpretability with ...

raymond_yao · ‎12-16-2020

Chronic obstructive pulmonary disease (COPD) is a type of obstructive lung disease. Globally, it is estimated that 3.17 million deaths were caused by this disease in 2015. Exposure to indoor and outdoor air pollution, tabacco smoke (especially secondhand smoke), dusts and fumes is the key facts. In this blog post, I'd like to introduce two new features of Python machine learning client for SAP HANA: dataset report and model report to support me to study COPD cases. These two features provide great convenience to data scientists to analyze their data and the trained model. Let's go through a use case to learn how it works.

The data comes from a PoC in China and has been desensitized. Firstly, we use hana_ml package to build a connection to my SAP HANA instance and upload the COPD data into a table called "COPD_DEMO". The columns containing Chinese character are renamed.

import hana_ml

import pandas as pd



conn = hana_ml.ConnectionContext(address="lsvxc0103.sjc.sap.corp", port=30315, user="PAL_USER")

data = hana_ml.dataframe.create_dataframe_from_pandas(conn, table_name="COPD_DEMO", pandas_df=pd.read_csv("COPD.csv"), force=True, drop_exist_tab=True)

data = data.rename_columns({"Unnamed: 0" : "ID", "性别" : "GENDER", "年龄" : "AGE", "身高" : "HEIGHT", "体重": "WEIGHT", "糖尿病" : "DIABETES", "高血压": "HYPERTENSION", "住院史" : "HOSPITALIZATION", "长期接触粉尘" :  "LONG DUST EXPOSURE" })

data.head(3).collect()

Let's explore the dataframe data via dataset_report.

from hana_ml.visualizers.dataset_report import DatasetReportBuilder

datasetReportBuilder = DatasetReportBuilder()

datasetReportBuilder.build(data, key="ID")

We can speed up the report build by providing sampling method. In this blog post, we use the default settings.

The report can be rendered in the notebook via generate_notebook_iframe_report() function.

datasetReportBuilder.generate_notebook_iframe_report()

The report is interactive. We can click the Overview button at first.

We can move to Variables tab to see the distribution of certain variable and statistics.

We can also look into the scatter matrix for this dataframe.

From the dataset report, we can view the statistics of all the columns, their distributions and the correlcations. From the report, the features "HEIGHT" and "WEIGHT" follows the normal distribution and the features "DIABETES", "HYPERTENSION" and "LONG DUST EXPOSURE" are unbiased. Next step, we will use unified classification API to train the model in order to predict COPD for the new records.

from hana_ml.algorithms.pal import metrics

from hana_ml.algorithms.pal.unified_classification import UnifiedClassification, json2tab_for_reason_code

from hana_ml.algorithms.pal.model_selection import RandomSearchCV

uc_hgbt = UnifiedClassification(func='hybridgradientboostingtree')

rscv = RandomSearchCV(estimator=uc_hgbt,

                      param_grid={

                        "split_threshold":[1e-5, 1e-7],

                        "learning_rate":[0.1, 0.01, 0.5],

                        "n_estimators":[6, 10],

                        "max_depth":[10, 12]},

                      train_control={"fold_num":5, "resampling_method": "cv", "random_search_times":3},

                      scoring="auc"

                      )



rscv.fit(data=data,

         key= 'ID', 

         label='COPD',

         partition_method='stratified',

         stratified_column='COPD', 

         partition_random_state=2,

         training_percent=0.7,

         ntiles=2)

We use RandomSearchCV module to perform the hyper-parameter tuning. After the fit, we use generate_notebook_iframe_report to do the model debriefing.

uc_hgbt.generate_notebook_iframe_report()

The model report provides detailed training and validation information including precision, recall, f1-score, kappa, auc etc.

It also provides confusion matrix, variable importance and ROC curve. For the variable importance, we provide both pie chart and bar chart.

In the next step, we will do the forecast for the candidate and use the reason code to understand the contributions of the features.

We will create a new HANA table "COPD_DEMO_PRED" with the new data via hana_ml and use predict function to do the forecast.

pred = pd.DataFrame({"ID": [1],

                     "GENDER": ['M'],

                     "AGE": [60],

                     "HEIGHT": [160],

                     "WEIGHT": [100],

                     "DIABETES": ['Y'],

                     "HYPERTENSION": ['N'],

                     "HOSPITALIZATION": [12],

                     "LONG DUST EXPOSURE": ['Y']})

pred_data = hana_ml.dataframe.create_dataframe_from_pandas(conn, table_name="COPD_DEMO_PRED", pandas_df=pred, force=True)

pred_res = uc_hgbt.predict(pred_data, key='ID')

pred_res.collect()

Let's use json2tab_for_reason_code() function to create a table view.

json2tab_for_reason_code(pred_res).collect()

Python machine learning client for SAP HANA not only provides user friendly machine learning interface but also very useful visualization tools for data analysis and investigation.

If you want to learn more about hana_ml and SAP HANA Predictive Analysis Library (PAL), please refer to the following links:

Outlier Detection using Statistical Tests in Python Machine Learning Client for SAP HANA

Outlier Detection by Clustering using Python Machine Learning Client for SAP HANA

Anomaly Detection in Time-Series using Seasonal Decomposition in Python Machine Learning Client for ...

Outlier Detection with One-class Classification using Python Machine Learning Client for SAP HANA

Learning from Labeled Anomalies for Efficient Anomaly Detection using Python Machine Learning Client...

Additive Model Time-series Analysis using Python Machine Learning Client for SAP HANA

Time-Series Modeling and Analysis using SAP HANA Predictive Analysis Library(PAL) through Python Mac...

Import multiple excel files into a single SAP HANA table

Weibull Analysis using Python machine learning client for SAP HANA