Model Storage with Python Machine Learning Client for SAP HANA
Machine learning models, e.g. classification and regression model, contain the relationships and patterns between features in the training dataset which could be applied to similar data in the future for prediction. Model training could be time consuming, so it is desirable to store/persist a model for future use without retraining the model.
In Python Machine Learning Client for SAP HANA(hana-ml), we provide a model storage class for model persistence, such as classification, regression, clustering and time series models. In this blog post, you will learn:
- ModelStorage class and its methods.
- How to apply model storage and its methods in a use case.
1. ModelStorage Class
ModelStorage class allows users to save, load, list and delete models. Internally, a model is stored as two parts:
- Metadata: contains the model identification (name, version, algorithm class) and its python model object attributes required for re-instantiation. It is saved in a table named HANAML_MODEL_STORAGE by default.
- Back-end model: consists in the model returned by SAP HANA Predictive Analysis Library(PAL) and a model can be saved into different SAP HANA tables depending on the requirement of the algorithm.
Some important methods and descriptions are below:
- save_model (model, if_exists=’upgrade’)
The model is stored in SAP HANA tables in a schema specified by the user. A model is identified by its name and version. Parameter ‘if_exists’ provides three ways for handing the model saving if a model with the same name and version already exists:
- ‘replace’ :, the previous model will be overwritten.
- ‘upgrade’: the current model will be saved with an incremented version number.
- ‘fail’: an error message is thrown to indicate that the model with same name and version already exists.
- load_model (name, version=None)
Load a model according to the model name. If the version is not provided, the latest version is loaded.
- delete_model (name, version)
Delete a model according to the name and version.
- list_models(name=None, version=None)
List all the existing models stored in the SAP HANA.
Delete all the models and the meta table in the SAP HANA.
Please refer to the hana-ml ModelStorage documentation for the whole list of methods.
Algorithms who has predict and transform functions are supported by Model Storage. A part of list of supported algorithms is as follows:
- Classification: UnifiedClassification, MLPClassifier, RDTClassifier, HybridGradientBoostingClassifier, SVC, DecisionTreeClassifier, CRF, LogisticRegression, KNNClassifier, NaiveBayes…
- Regression: UnifiedRegression, LinearRegression, PolynomialRegression, GLM, ExponentialRegression, BiVariateGeometricRegression, BiVariateNaturalLogarithmicRegression, CoxProportionalHazardModel…
- Clustering: UnifiedClustering, DBSCAN, SOM……
- Time Series: ARIMA, OnlineARIMA, VectorARIMA, AutoARIMA, lstm…
- Preprocessing: Imputer, KBinsDiscretizer…
2. use case
All source code will use Python machine learning client for SAP HANA Predictive Analsysi Library(PAL).
We firstly need to create a connection to a SAP HANA and then we could use various functions of hana-ml to do the data analysis. The following is an example:
import hana_ml from hana_ml import dataframe conn = dataframe.ConnectionContext('sysName', 'port', 'username', 'password')
A simple self-made dataset is used to show the usage of model storage for classification. The data is stored in SAP HANA tables called DATA_TBL_FIT, DATA_TBL_PREDICT. Let’s have a look at the dataset.
df_fit = conn.table('DATA_TBL_FIT') df_predict = conn.table('DATA_TBL_PREDICT') print(df_fit.collect()) print(df_predict.collect())
The result is shown below:
ID OUTLOOK TEMP HUMIDITY WINDY CLASS 0 0 Sunny 75 70.0 Yes Play 1 1 Sunny 77 90.0 Yes Do not Play 2 2 Sunny 85 79.0 No Do not Play 3 3 Sunny 72 95.0 No Do not Play 4 4 Sunny 88 70.0 No Play 5 5 Overcast 72 90.0 Yes Play 6 6 Overcast 83 78.0 No Play 7 7 Overcast 64 65.0 Yes Play 8 8 Overcast 81 75.0 No Play 9 9 Overcast 71 80.0 Yes Do not Play 10 10 Rain 65 70.0 Yes Do not Play 11 11 Rain 75 80.0 No Play 12 12 Rain 68 80.0 No Play 13 13 Rain 70 96.0 No Play ID OUTLOOK TEMP HUMIDITY WINDY 0 0 Overcast 75 -10000.0 Yes 1 1 Rain 78 70.0 Yes 2 2 Sunny -10000 78.0 Yes 3 3 Sunny 69 70.0 Yes 4 4 Rain 74 70.0 Yes 5 5 Rain 70 70.0 Yes 6 6 *** 70 70.0 Yes
Train the model with UnifiedClassification function and various algorithms ‘MLP’, ‘NaiveBayes’, ‘LogisticRegression’, ‘decisiontree’, ‘HybridGradientBoostingTree’, ‘RandomDecisionTree’,’SVM’:
from hana_ml.algorithms.pal.unified_classification import UnifiedClassification from hana_ml.model_storage import ModelStorage ms = ModelStorage(conn) classification_algorithms = ['MLP', 'NaiveBayes', 'LogisticRegression', 'decisiontree', 'HybridGradientBoostingTree', 'RandomDecisionTree','SVM'] dt_param = dict(algorithm='c45') mlp_param = dict(hidden_layer_size=(10,), activation='TANH', output_activation='TANH', training_style='batch', max_iter=1000, normalization='z-transform', weight_init='normal', thread_ratio=1) for name in classification_algorithms: if name == 'decisiontree': algorithm = UnifiedClassification(func = name, **dt_param) elif name == 'MLP': algorithm = UnifiedClassification(func = name, **mlp_param) else: algorithm = UnifiedClassification(func = name) if name == 'LogisticRegression': algorithm.fit(data=df_fit, key='ID', class_map0='Play', class_map1='Do not Play') else: algorithm.fit(data=df_fit, key='ID') algorithm.name = name algorithm.version = 1 ms.save_model(model=algorithm, if_exists='replace')
Use list_models function to list all models and we could see all 6 models with name, version and other information are shown in a table:
The model list is shown below:
Let’s select one model ‘RandomDecisionTree’ to load the model for prediction:
new_model = ms.load_model(name='SVM', version =1) type(new_model)
The type of new_model is a object of Unifiedclassfication. we could use this object for prediction:
res = new_model.predict(df_predict, key='ID') print(res.collect())
ID SCORE CONFIDENCE REASON_CODE 0 0 Play 0.296441 None 1 1 Play 0.505984 None 2 2 Play 0.296441 None 3 3 Play 0.595937 None 4 4 Play 0.635761 None 5 5 Do not Play 0.248283 None 6 6 Do not Play 0.313294 None
For example, if we want to delete the model ‘SVM’:
ms.delete_model(name='SVM', version=1) ms.list_models()
We could also clean up all models at once:
In this blog, we described what is model storage of hana-ml and how to use its methods. if you want to learn more on hana-ml and SAP HANA PAL, please refer to our offical documenation.
Python Machine Learning Client for SAP HANA
SAP HANA Predictive Analysis Library(PAL)
Other Useful Links:
hana-ml on Pypi.
We also provide a R API for SAP HANA PAL called hana.ml.r, please refer to more information on the documentation.
For other blog posts on hana-ml:
- Identification of Seasonality in Time Series with Python Machine Learning Client for SAP HANA
- Outlier Detection using Statistical Tests in Python Machine Learning Client for SAP HANA
- Outlier Detection by Clustering using Python Machine Learning Client for SAP HANA
- Anomaly Detection in Time-Series using Seasonal Decomposition in Python Machine Learning Client for SAP HANA
- Outlier Detection with One-class Classification using Python Machine Learning Client for SAP HANA
- Learning from Labeled Anomalies for Efficient Anomaly Detection using Python Machine Learning Client for SAP HANA
- Python Machine Learning Client for SAP HANA
- Import multiple excel files into a single SAP HANA table
- COPD study, explanation and interpretability with Python machine learning client for SAP HANA
- A Multivariate Time Series Modeling and Forecasting Guide with Python Machine Learning Client for SAP HANA
Hi Xin, great blog. Is there a way to set the schedule on the model storage without a connection_userkey?
I'm connecting to a HANA Cloud instance to run PAL/APL and don't have the userstore installed locally (under "Connect with Secure Password" https://blogs.sap.com/2019/11/05/hands-on-tutorial-machine-learning-push-down-to-sap-hana-with-python/)