Weibull Analysis using Python machine learning cli...

raymond_yao · ‎01-06-2021

Weibull analysis is used to analyze and forecast the life of the products. In this blog post, I'd like to introduce how to use Python machine learning client for SAP HANA to do the Weibull analysis.

The data comes from a PoC in China.

Firstly we import the related package and build the connection to my SAP HANA instance.

import pandas as pd

from hana_ml.dataframe import create_dataframe_from_pandas, ConnectionContext, DataFrame



conn = ConnectionContext(address="lsvxc0103.sjc.sap.corp", port=30315, user="PAL_USER")

The excel file can be imported into SAP HANA table via pandas and hana-ml pacakge. The Chinese column name can be renamed by rename_columns() function.

pf = pd.read_excel("./排气管相关故障示例.xlsx")

wc_df = create_dataframe_from_pandas(conn,

                                     pandas_df=pf,

                                     table_name="wc_data",

                                     force=True,

                                     table_structure={"生产日期": "DATE", "故障日期": "DATE"})

wc_df = wc_df.rename_columns({"生产日期": "PRODUCTION_DATE", "故障日期": "FAULT_TIME"})

wc_df.head(5).collect()

We construct an SAP HANA dataframe and transform it according to the API of hana-ml's Weibull fit.

weibull_prepare = DataFrame(conn, "SELECT SURVIVAL_DAYS L, SURVIVAL_DAYS R FROM (SELECT DAYS_BETWEEN(PRODUCTION_DATE, FAULT_TIME) SURVIVAL_DAYS FROM ({}))".format(wc_df.select_statement))

In the next step, we use distribution_fit() function to fit the data.

from hana_ml.algorithms.pal.stats import distribution_fit, cdf



fitted, _ = distribution_fit(weibull_prepare, distr_type='weibull', censored=True)

fitted.collect()

The survival curve and hazard ratio can be computed via cdf() function. We use dataframe's diff() function to differentiate survival_curve.

weibull_xaxis = create_dataframe_from_pandas(conn,

                                             pandas_df=pd.DataFrame({"Survival Days": [k for k in range(1, 1000)]}),

                                             table_name="#wc_weibull_test",

                                             force=True)

shape = float(fitted.filter("NAME='SHAPE'").collect().iat[0, 1])

scale = float(fitted.filter("NAME='SCALE'").collect().iat[0, 1])

survival_curve = cdf(weibull_xaxis, distr_info={'name':'weibull', 'shape':shape, 'scale':scale}, complementary=True)

hazard_ratio = survival_curve.diff("Survival Days", -1).fillna(0).collect()

Last but not least, the survival curve and hazard ratio can be visualized via plotly.

import plotly.graph_objects as go

from plotly.subplots import make_subplots



fig = make_subplots(specs=[[{"secondary_y": True}]])



fig.add_trace(

    go.Scatter(x=survival_curve_p.index, y=survival_curve_p.values.flatten(), name="Survival Probability"),

    secondary_y=False,

)



fig.add_trace(

    go.Scatter(x=survival_curve_p.index, y=hazard_ratio.values.flatten(), name="Hazard Ratio"),

    secondary_y=True,

)



fig.update_layout(

    title_text="WC-Weibull"

)



fig.update_xaxes(title_text="Survival Days")



fig.update_yaxes(title_text="Survival Probability", secondary_y=False)

fig.update_yaxes(title_text="Hazard Ratio", secondary_y=True)



fig.write_html("./wc_weibull.html")

Thanks to Python machine learning client for SAP HANA, we can perform data upload, distribution fit and survival curve calculation in such a convenient way.

If you want to learn more about hana_ml and SAP HANA Predictive Analysis Library (PAL), please refer to the following links:

Outlier Detection using Statistical Tests in Python Machine Learning Client for SAP HANA

Outlier Detection by Clustering using Python Machine Learning Client for SAP HANA

Anomaly Detection in Time-Series using Seasonal Decomposition in Python Machine Learning Client for ...

Outlier Detection with One-class Classification using Python Machine Learning Client for SAP HANA

Learning from Labeled Anomalies for Efficient Anomaly Detection using Python Machine Learning Client...

Additive Model Time-series Analysis using Python Machine Learning Client for SAP HANA

Time-Series Modeling and Analysis using SAP HANA Predictive Analysis Library(PAL) through Python Mac...

Import multiple excel files into a single SAP HANA table

COPD study, explanation and interpretability with Python machine learning client for SAP HANA