Technical Articles
Weibull Analysis using Python machine learning client for SAP HANA
Weibull analysis is used to analyze and forecast the life of the products. In this blog post, I’d like to introduce how to use Python machine learning client for SAP HANA to do the Weibull analysis.
The data comes from a PoC in China.
Firstly we import the related package and build the connection to my SAP HANA instance.
import pandas as pd
from hana_ml.dataframe import create_dataframe_from_pandas, ConnectionContext, DataFrame
conn = ConnectionContext(address="lsvxc0103.sjc.sap.corp", port=30315, user="PAL_USER")
The excel file can be imported into SAP HANA table via pandas and hana-ml pacakge. The Chinese column name can be renamed by rename_columns() function.
pf = pd.read_excel("./排气管相关故障示例.xlsx")
wc_df = create_dataframe_from_pandas(conn,
pandas_df=pf,
table_name="wc_data",
force=True,
table_structure={"生产日期": "DATE", "故障日期": "DATE"})
wc_df = wc_df.rename_columns({"生产日期": "PRODUCTION_DATE", "故障日期": "FAULT_TIME"})
wc_df.head(5).collect()
We construct an SAP HANA dataframe and transform it according to the API of hana-ml’s Weibull fit.
weibull_prepare = DataFrame(conn, "SELECT SURVIVAL_DAYS L, SURVIVAL_DAYS R FROM (SELECT DAYS_BETWEEN(PRODUCTION_DATE, FAULT_TIME) SURVIVAL_DAYS FROM ({}))".format(wc_df.select_statement))
In the next step, we use distribution_fit() function to fit the data.
from hana_ml.algorithms.pal.stats import distribution_fit, cdf
fitted, _ = distribution_fit(weibull_prepare, distr_type='weibull', censored=True)
fitted.collect()
The survival curve and hazard ratio can be computed via cdf() function. We use dataframe’s diff() function to differentiate survival_curve.
weibull_xaxis = create_dataframe_from_pandas(conn,
pandas_df=pd.DataFrame({"Survival Days": [k for k in range(1, 1000)]}),
table_name="#wc_weibull_test",
force=True)
shape = float(fitted.filter("NAME='SHAPE'").collect().iat[0, 1])
scale = float(fitted.filter("NAME='SCALE'").collect().iat[0, 1])
survival_curve = cdf(weibull_xaxis, distr_info={'name':'weibull', 'shape':shape, 'scale':scale}, complementary=True)
hazard_ratio = survival_curve.diff("Survival Days", -1).fillna(0).collect()
Last but not least, the survival curve and hazard ratio can be visualized via plotly.
import plotly.graph_objects as go
from plotly.subplots import make_subplots
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(
go.Scatter(x=survival_curve_p.index, y=survival_curve_p.values.flatten(), name="Survival Probability"),
secondary_y=False,
)
fig.add_trace(
go.Scatter(x=survival_curve_p.index, y=hazard_ratio.values.flatten(), name="Hazard Ratio"),
secondary_y=True,
)
fig.update_layout(
title_text="WC-Weibull"
)
fig.update_xaxes(title_text="Survival Days")
fig.update_yaxes(title_text="Survival Probability", secondary_y=False)
fig.update_yaxes(title_text="Hazard Ratio", secondary_y=True)
fig.write_html("./wc_weibull.html")
Thanks to Python machine learning client for SAP HANA, we can perform data upload, distribution fit and survival curve calculation in such a convenient way.
If you want to learn more about hana_ml and SAP HANA Predictive Analysis Library (PAL), please refer to the following links:
Outlier Detection using Statistical Tests in Python Machine Learning Client for SAP HANA
Outlier Detection by Clustering using Python Machine Learning Client for SAP HANA
Outlier Detection with One-class Classification using Python Machine Learning Client for SAP HANA
Additive Model Time-series Analysis using Python Machine Learning Client for SAP HANA
Import multiple excel files into a single SAP HANA table
COPD study, explanation and interpretability with Python machine learning client for SAP HANA