Technical Articles
Diving into the HANA DataFrame: Python Integration – Part 2
This is the second part of my blog series, following on from Part 1.
In Part 1 we learnt:
- What the SAP HANA DataFrame is,
- Why it is valuable,
- How to use it.
In Part 2, we will now explore a highly topical application of the HANA DataFrame – Machine Learning.
Source: dilbert.com & https://www.analyticsvidhya.com/blog/2015/12/hilarious-jokes-videos-statistics-data-science/
This blog will go into detail on how the SAP HANA DataFrame provides Machine Learning capabilities to overcome some of the biggest challenges faced in this domain. It will also cover a specific algorithm and an explanation of how this can be applied.
In an enterprise setting with huge amounts of data, productivity is the biggest blocker and so waiting for data to transfer and models to train will cost a lot of time and money. A more efficient and productive way of training algorithms is needed. Therefore let’s revisit the Python Client API and more specifically the ML API.
The Machine Learning API
If you have used the infamous “Scikit-Learn” before, then this ML API will provide a very similar user experience. It provides a Pythonic way of calling the algorithms stored within HANA PAL against a HANA DataFrame. The processing is delegated to HANA and once the model has been trained, there are various methods available to optimise and understand the results.
There are 90+ algorithms that reside within PAL, and this list is ever increasing. These algorithms cover:
- Classification
- Regression
- Clustering
- Time Series Analysis
- Association Analysis
- Probability Distribution
- Outlier Detection
- Recommender Systems
- Link Prediction
- Statistical Functions
- Data Preparation
- Other
For detailed information on the library, see this documentation.
What’s the value?
- Similarly to Part 1, the reduction in data transfer cannot be underestimated here. This means faster iteration for improving algorithm performance and greater productivity.
- Use syntax that your Data Scientists are comfortable with, no need to learn a proprietary programming language.
- SAP are opening out to the world, giving users the ability to combine open source packages with the Python API. We are not locking Data Scientists in to use specific tools, we are enabling them to do more by combining enterprise and open source tools.
How can I use it?
For instructions on how to set up the Python API, see this blog.
Assuming that you have set up your environment, I will take you through how to train a Decision Tree Classifier to predict ‘Survived’ or ‘Did Not Survive’ on the Titanic dataset.
This dataset contains 12 columns:
- Age – The Age of the passenger in years.
- Cabin – The Cabin number that the passenger was assigned to. If NaN, this means they had no cabin and perhaps were not assigned one due to the cost of their ticket.
- Embarked – Port of embarkation (S = Southampton, C = Cherbourg, Q = Queenstown).
- Fare – The fare of the ticket purchased by the passenger.
- Name – Full name and title of the passenger.
- Parch – Number of parents and children associated with the passenger aboard.
- PassengerId – Unique ID assigned to each passenger.
- Pclass – Class of ticket purchased (1 = 1st class, 2 = 2nd class, 3 = 3rd class).
- Gender – Gender of the passenger.
- SibSp – Number of siblings and spouses associated with the passenger aboard.
- Ticket – Ticket number.
- Survived – Survival flag of passenger.
Note:
- Pclass can be viewed as a proxy for socio-economic status (SES):
- 1st = Upper class
- 2nd = Middle class
- 3rd = Lower class
- Age is fractional if the passenger is less than 1 year old. If the age is estimated, is it in the form of xx.5
- SibSp & Parch: These 2 columns represent family relations, which are defined in the following ways:
- Sibling = Brother, sister, stepbrother, stepsister.
- Spouse = Husband, wife (mistresses and fiancés were ignored).
- Parent = Mother, father.
- Child = daughter, son, stepdaughter, stepson.
- Some children travelled only with a nanny, therefore Parch = 0 for them.
Credit for the dataset goes to: https://www.kaggle.com/c/titanic/data, where you can find more information.
First let’s import the HANA DataFrame, relevant algorithms and evaluation metrics from the hana_ml package.
- We have imported the entire library of tree based algorithms, of which we will be using the Decision Tree Classifier.
- We have also imported the metrics from PAL – in order to evaluate the performance of our model – of which we will be using the ROC AUC score.
# Import the Python Client API, Dataframe and Trees library
from hana_ml import dataframe
from hana_ml.algorithms.pal import trees
from hana_ml.algorithms.pal import metrics
Like in Part 1, let’s open the “ConnectionContext” and invoke a table() method to create our DataFrame:
# Instantiate the Connection Object (conn)
conn = dataframe.ConnectionContext("IP ADDRESS, "PORT", "USER", "PASSWORD")
# Create the HANA Dataframe (df_train) and point to the training table.
df_train = conn.table("TABLE", schema="SCHEMA")
df_train.head(5).collect()
PASSENGER_ID | PCLASS | NAME | GENDER | AGE | SIBSP | PARCH | TICKET | FARE | CABIN | EMBARKED | SURVIVED |
1 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | S | Did Not Survive | |
2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | Survived |
3 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | S | Survived | |
4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1 | C123 | S | Survived |
5 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.05 | S | Did Not Survive |
Now that we have our HANA DataFrame, we want to think about applying an algorithm to it.
Note: Before applying an algorithm there are always intermediary steps, such as data preparation, feature engineering and exploratory data analysis. For the purposes of this blog I will assume these have all been completed and the data is ready for an algorithm.
- The PAL algorithm that we will be using is the Decision Tree Classifier. The aim of a Decision Tree is to create a “tree” that is able to classify observations by routing decisions down different paths of the tree. It builds a tree using one of the following criteria:
- Gini – a measure of impurity
- Entropy – a measure of information gain
An overview of how this algorithm works is given below.
Note: The paths on this explanation are given as an example. We will identify the exact tree that is built after we train the model.
Let’s instantiate the algorithm and train the model to our Titanic data.
# Create the DecisionTreeClassifier instance and specify the parameters
dtc = trees.DecisionTreeClassifier(conn_context=conn,
algorithm='c45',
min_records_of_parent=2,
min_records_of_leaf=1,
thread_ratio=0.4,
split_threshold=1e-5,
model_format='json',
output_rules=True)
# Store the necessary features in a list and invoke the fit method to train the algorithm
features = ['PCLASS', 'GENDER', 'AGE', 'SIBSP', 'PARCH', 'FARE']
dtc.fit(df_train, features=features, key='PASSENGER_ID', label='SURVIVED')
- Notice that HANA encodes categorical string values automatically. Manual encoding can be done if you desire to follow a certain strategy, however this can save a lot of time for a Data Scientist.
- Once we have a trained model, we want to evaluate its performance to assess if this model will perform well with new data.
- Creating models ready for production can be tricky, the correct evaluation criteria must be selected and the model must be deployed in a way that allows it to be easily updated and scheduled.
Using HANA as one single source for data, algorithms, training and deployment makes the Data Science process incredibly streamlined and efficient.
Next, I will evaluate the decision rules of the trained model.
# Create a 'for' loop to print the decision rules learnt by the algorithm using the 'decision_rules_' attribute
n = 1
for i in dtc.decision_rules_.collect().RULES_CONTENT:
print(n, "-", i)
n+=1
>>> 1 - (GENDER=male) && (PCLASS>=1.5) && (AGE>=63.5) => Did Not Survive
>>> 2 - (GENDER=male) && (PCLASS<1.5) && (AGE<63.5) && (FARE<10.4812) => Did Not Survive
>>> 3 - (GENDER=male) && (PCLASS<1.5) && (AGE>=63.5) && (PARCH<0.5) => Did Not Survive
>>> 4 - (GENDER=male) && (PCLASS<1.5) && (AGE>=63.5) && (PARCH>=0.5) => Did Not Survive
>>> 5 - (GENDER=female) && (PCLASS<1.5) && (PARCH<0.5) && 10.4812=<FARE<74.375) => Survived
>>> 6 - (GENDER=female) && (PCLASS<1.5) && (PARCH<0.5) && (FARE>=74.375) => Survived
>>> 7 - (GENDER=female) && (PCLASS<1.5) && (PARCH>=0.5) && 10.4812=<FARE<74.375) => Survived
>>> 8 - (GENDER=female) && (PCLASS<1.5) && (PARCH>=0.5) && (FARE>=74.375) => Survived
>>> 9 - (GENDER=female) && (PCLASS>=1.5) && (SIBSP>=3.5) && (FARE<10.4812) => Survived
>>> 10 - (GENDER=female) && (PCLASS>=1.5) && (SIBSP>=3.5) && 10.4812=<FARE<74.375) => Did Not Survive
>>> 11 - (GENDER=male) && (PCLASS<1.5) && (AGE<63.5) && 10.4812=<FARE<74.375) && (PARCH<0.5) => Did Not Survive
>>> 12 - (GENDER=male) && (PCLASS<1.5) && (AGE<63.5) && 10.4812=<FARE<74.375) && (PARCH>=0.5) => Survived
>>> 13 - (GENDER=male) && (PCLASS<1.5) && (AGE<63.5) && (FARE>=74.375) && (PARCH<0.5) => Did Not Survive
>>> 14 - (GENDER=male) && (PCLASS<1.5) && (AGE<63.5) && (FARE>=74.375) && (PARCH>=0.5) => Did Not Survive
>>> 15 - (GENDER=male) && (PCLASS>=1.5) && (AGE<63.5) && (PARCH<0.5) && (FARE<10.4812) => Did Not Survive
>>> 16 - (GENDER=male) && (PCLASS>=1.5) && (AGE<63.5) && (PARCH<0.5) && 10.4812=<FARE<74.375) => Did Not Survive
>>> 17 - (GENDER=male) && (PCLASS>=1.5) && (AGE<63.5) && (PARCH>=0.5) && (FARE<10.4812) => Did Not Survive
>>> 18 - (GENDER=female) && (PCLASS>=1.5) && (SIBSP<3.5) && (FARE<10.4812) && (PARCH<0.5) => Survived
>>> 19 - (GENDER=female) && (PCLASS>=1.5) && (SIBSP<3.5) && (FARE<10.4812) && (PARCH>=0.5) => Did Not Survive
>>> 20 - (GENDER=female) && (PCLASS>=1.5) && (SIBSP<3.5) && 10.4812=<FARE<74.375) && (PARCH<0.5) => Survived
>>> 21 - (GENDER=female) && (PCLASS>=1.5) && (SIBSP<3.5) && 10.4812=<FARE<74.375) && (PARCH>=0.5) => Survived
>>> 22 - (GENDER=male) && (PCLASS>=1.5) && (AGE<63.5) && (PARCH>=0.5) && 10.4812=<FARE<74.375) && (SIBSP<3.5) => Did Not Survive
>>> 23 - (GENDER=male) && (PCLASS>=1.5) && (AGE<63.5) && (PARCH>=0.5) && 10.4812=<FARE<74.375) && (SIBSP>=3.5) => Did Not Survive
Here you can see that the tree has formulated 23 decision rules. These are the rules that the trained algorithm will use to make predictions about new data. Now let’s evaluate it’s performance, using the DataFrame’s built in score() method.
# Print the accuracy score of this trained model using the score method
print("Trained model accuracy score: {:.2f}". format(dtc.score(data=df_train, key='PASSENGER_ID')*100) + str('%'))
>>> Trained model accuracy score: 79.45%
# View the number of passengers that 'Survived' and 'Did Not Survive'
df_train.collect().SURVIVED.value_counts()
>>> Did Not Survive 423
>>> Survived 268
>>> Name: SURVIVED, dtype: int64
This accuracy score looks good at 79.45%, but the data is unbalanced, meaning that we have much fewer instances of ‘Survived’ than ‘Did Not Survive’. With unbalanced data like this, a simple accuracy score is not always a good measure of true accuracy.
For example, if you are trying to predict the event of a heart failure, this may only occur 1 time out of 100 events. If you are able to train a model to predict all 99 occurrences of the failure not happening, but you miss the 1 time that it does fail… a simple accuracy score will indicate 99% accuracy! However, the consequence of missing the heart failure in this case is huge. This is known as a Type I error and depending on the use case, can have heavy consequences. Hence, a more appropriate measure of accuracy is needed.
Using a validation sample I will apply my trained model to this data using the predict() method and evaluate the performance of the model using the PAL ROC AUC score instead.
# Create the HANA Dataframe (df_val) and point to the validation table.
df_val = conn.table(table="TABLE", schema="SCHEMA")
df_val.head(5).collect()
PASSENGER_ID | PCLASS | NAME | GENDER | AGE | SIBSP | PARCH | TICKET | FARE | CABIN | EMBARKED | SURVIVED |
692 | 3 | Karun, Miss. Manca | female | 4 | 0 | 1 | 349256 | 13.4167 | C | Survived | |
693 | 3 | Lam, Mr. Ali | male | 0 | 0 | 0 | 1601 | 56.4958 | S | Survived | |
694 | 3 | Saad, Mr. Khalil | male | 25 | 0 | 0 | 2672 | 7.225 | C | Did Not Survive | |
695 | 1 | Weir, Col. John | male | 60 | 0 | 0 | 113800 | 26.55 | S | Did Not Survive | |
696 | 2 | Chapman, Mr. Charles Henry | male | 52 | 0 | 0 | 248731 | 13.5 | S | Did Not Survive |
# Invoke the prediction method ("predict()") of the trained algorithm and inspect the result
val_result = dtc.predict(df_val, features=features, key="PASSENGER_ID", verbose=False)
val_result.head(5).collect()
PASSENGER_ID | SCORE | CONFIDENCE |
692 | Survived | 0.655738 |
693 | Did Not Survive | 0.846154 |
694 | Did Not Survive | 0.883249 |
695 | Did Not Survive | 0.636364 |
696 | Did Not Survive | 0.846154 |
# Create list of actual labels from the validation dataset
val_actual = list(df_val.collect().SURVIVED.values)
# Create list of probabilities from the validation dataset
val_proba = list(val_result.collect().CONFIDENCE)
# Calculate ROC AUC Score
auc = metrics.roc_auc_score(y_true=val_actual, y_score=val_proba)
print("ROC AUC Score: {:.2f}".format(auc))
>>> ROC AUC Score: 0.42
This model is not finished, we want to improve on this score until we are satisfied with the result.
Note: the ideal score is subjective based on the use case and should be defined before the models are evaluated.
- One way we can do this is by tuning hyper-parameters of the algorithm.
- Previously we used fairly standard values for the hyper-parameters.
- Now we can feed a grid of values to each hyper-parameter and select the combination that produces the best results.
# This prevents the hyperparameter search from continuing for too long and will cause it to stop
quick = True
# Set up the grid of hyperparameters
algorithms = ['c45', 'chaid', 'cart']
min_records_of_parent = [2, 3, 4, 5]
min_records_of_leaf = [1, 2, 3, 4, 5]
# Use embedded for loops to print results of each combination
for a in algorithms:
for b in min_records_of_parent:
for c in min_records_of_leaf:
dtc = trees.DecisionTreeClassifier(conn_context=conn,
algorithm=a,
min_records_of_parent=b,
min_records_of_leaf=c,
thread_ratio=0.8,
split_threshold=1e-5,
model_format='json',
output_rules=True)
dtc.fit(df_train, features=features, label='SURVIVED')
val_result = dtc.predict(df_val, features=features, key="PASSENGER_ID")
val_score = list(val_result.collect().CONFIDENCE.values)
val_auc = metrics.roc_auc_score(val_actual_binary, val_score)
print("Algorithm: ", a, " | min_records_of_parent: ", b, " | min_records_of_leaf: ", c, " --- Accuracy: {:.2f}". format(val_auc*100))
>>> Algorithm: c45 | min_records_of_parent: 2 | min_records_of_leaf: 1 --- Accuracy: 41.83
>>> Algorithm: c45 | min_records_of_parent: 2 | min_records_of_leaf: 2 --- Accuracy: 42.12
>>> Algorithm: c45 | min_records_of_parent: 2 | min_records_of_leaf: 3 --- Accuracy: 44.72
>>> Algorithm: c45 | min_records_of_parent: 2 | min_records_of_leaf: 4 --- Accuracy: 43.40
>>> Algorithm: c45 | min_records_of_parent: 2 | min_records_of_leaf: 5 --- Accuracy: 44.00
>>> Algorithm: c45 | min_records_of_parent: 3 | min_records_of_leaf: 1 --- Accuracy: 41.83
>>> Algorithm: c45 | min_records_of_parent: 3 | min_records_of_leaf: 2 --- Accuracy: 42.12
>>> Algorithm: c45 | min_records_of_parent: 3 | min_records_of_leaf: 3 --- Accuracy: 44.72
>>> Algorithm: c45 | min_records_of_parent: 3 | min_records_of_leaf: 4 --- Accuracy: 43.40
>>> Algorithm: c45 | min_records_of_parent: 3 | min_records_of_leaf: 5 --- Accuracy: 44.00
>>> Algorithm: c45 | min_records_of_parent: 4 | min_records_of_leaf: 1 --- Accuracy: 41.83
>>> Algorithm: c45 | min_records_of_parent: 4 | min_records_of_leaf: 2 --- Accuracy: 42.12
>>> Algorithm: c45 | min_records_of_parent: 4 | min_records_of_leaf: 3 --- Accuracy: 44.72
>>> Algorithm: c45 | min_records_of_parent: 4 | min_records_of_leaf: 4 --- Accuracy: 43.40
>>> Algorithm: c45 | min_records_of_parent: 4 | min_records_of_leaf: 5 --- Accuracy: 44.00
>>> Algorithm: c45 | min_records_of_parent: 5 | min_records_of_leaf: 1 --- Accuracy: 41.83
>>> Algorithm: c45 | min_records_of_parent: 5 | min_records_of_leaf: 2 --- Accuracy: 42.12
>>> Algorithm: c45 | min_records_of_parent: 5 | min_records_of_leaf: 3 --- Accuracy: 44.72
>>> Algorithm: c45 | min_records_of_parent: 5 | min_records_of_leaf: 4 --- Accuracy: 43.40
>>> Algorithm: c45 | min_records_of_parent: 5 | min_records_of_leaf: 5 --- Accuracy: 44.00
>>> Algorithm: chaid | min_records_of_parent: 2 | min_records_of_leaf: 1 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 2 | min_records_of_leaf: 2 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 2 | min_records_of_leaf: 3 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 2 | min_records_of_leaf: 4 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 2 | min_records_of_leaf: 5 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 3 | min_records_of_leaf: 1 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 3 | min_records_of_leaf: 2 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 3 | min_records_of_leaf: 3 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 3 | min_records_of_leaf: 4 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 3 | min_records_of_leaf: 5 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 4 | min_records_of_leaf: 1 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 4 | min_records_of_leaf: 2 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 4 | min_records_of_leaf: 3 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 4 | min_records_of_leaf: 4 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 4 | min_records_of_leaf: 5 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 5 | min_records_of_leaf: 1 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 5 | min_records_of_leaf: 2 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 5 | min_records_of_leaf: 3 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 5 | min_records_of_leaf: 4 --- Accuracy: 52.03
>>> Algorithm: chaid | min_records_of_parent: 5 | min_records_of_leaf: 5 --- Accuracy: 52.03
>>> Algorithm: cart | min_records_of_parent: 2 | min_records_of_leaf: 1 --- Accuracy: 48.50
>>> Algorithm: cart | min_records_of_parent: 2 | min_records_of_leaf: 2 --- Accuracy: 49.79
>>> Algorithm: cart | min_records_of_parent: 2 | min_records_of_leaf: 3 --- Accuracy: 51.26
>>> Algorithm: cart | min_records_of_parent: 2 | min_records_of_leaf: 4 --- Accuracy: 52.88
>>> Algorithm: cart | min_records_of_parent: 2 | min_records_of_leaf: 5 --- Accuracy: 50.68
>>> Algorithm: cart | min_records_of_parent: 3 | min_records_of_leaf: 1 --- Accuracy: 50.43
>>> Algorithm: cart | min_records_of_parent: 3 | min_records_of_leaf: 2 --- Accuracy: 49.79
>>> Algorithm: cart | min_records_of_parent: 3 | min_records_of_leaf: 3 --- Accuracy: 51.26
>>> Algorithm: cart | min_records_of_parent: 3 | min_records_of_leaf: 4 --- Accuracy: 52.88
>>> Algorithm: cart | min_records_of_parent: 3 | min_records_of_leaf: 5 --- Accuracy: 50.68
>>> Algorithm: cart | min_records_of_parent: 4 | min_records_of_leaf: 1 --- Accuracy: 48.98
>>> Algorithm: cart | min_records_of_parent: 4 | min_records_of_leaf: 2 --- Accuracy: 49.79
>>> Algorithm: cart | min_records_of_parent: 4 | min_records_of_leaf: 3 --- Accuracy: 51.26
>>> Algorithm: cart | min_records_of_parent: 4 | min_records_of_leaf: 4 --- Accuracy: 52.88
>>> Algorithm: cart | min_records_of_parent: 4 | min_records_of_leaf: 5 --- Accuracy: 50.68
>>> Algorithm: cart | min_records_of_parent: 5 | min_records_of_leaf: 1 --- Accuracy: 49.71
>>> Algorithm: cart | min_records_of_parent: 5 | min_records_of_leaf: 2 --- Accuracy: 50.78
>>> Algorithm: cart | min_records_of_parent: 5 | min_records_of_leaf: 3 --- Accuracy: 51.26
>>> Algorithm: cart | min_records_of_parent: 5 | min_records_of_leaf: 4 --- Accuracy: 52.88
>>> Algorithm: cart | min_records_of_parent: 5 | min_records_of_leaf: 5 --- Accuracy: 50.68
From the hyper-paramater tuning we can see the optimal combination of parameters. Now we can retrain the model with these parameters for the final application.
# Create the DecisionTreeClassifier instance and specify the newly identified optimum hyperparameter combination
dtc_optimised = trees.DecisionTreeClassifier(conn_context=conn,
algorithm='cart',
min_records_of_parent=4,
min_records_of_leaf=4,
thread_ratio=0.4,
split_threshold=1e-5,
model_format='json',
output_rules=True)
# Store the necessary features in a list and invoke the fit method on the training dataset
features = ['PCLASS', 'GENDER', 'AGE', 'SIBSP', 'PARCH', 'FARE']
dtc_optimised.fit(df_train, features=features, key='PASSENGER_ID', label='SURVIVED')
# Invoke the prediction method ("predict()") on the validation dataset
val_result = dtc_optimised.predict(df_val, features=features, key="PASSENGER_ID", verbose=False)
# Create list of actual labels from the validation dataset
val_actual = list(df_val.collect().SURVIVED.values)
# Create list of probabilities from the validation dataset
val_proba = list(val_result.collect().CONFIDENCE)
auc = metrics.roc_auc_score(y_true=val_actual, y_score=val_proba)
print("ROC AUC Score: {:.2f}".format(auc))
>>> ROC AUC Score: 0.53
When we are happy with the model, we can simulate the model in production by applying it to the test sample. As you will notice, we are missing the target column ‘SURVIVED’ to simulate real data without labels.
# Create the HANA dataframe and point to the apply table
df_apply = conn.table(table="TABLE", schema="SCHEMA")
# Preview the dataframe before predicting
df_apply.head(5).collect()
PASSENGER_ID | PCLASS | NAME | GENDER | AGE | SIBSP | PARCH | TICKET | FARE | CABIN | EMBARKED |
892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | Q | |
893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47 | 1 | 0 | 363272 | 7 | S | |
894 | 2 | Myles, Mr. Thomas Francis | male | 62 | 0 | 0 | 240276 | 9.6875 | Q | |
895 | 3 | Wirz, Mr. Albert | male | 27 | 0 | 0 | 315154 | 8.6625 | S | |
896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22 | 1 | 1 | 3101298 | 12.2875 | S |
# Invoke the prediction method ("predict()") on the application dataset and inspect the result
result = dtc_optimised.predict(df_apply, features=features, key="PASSENGER_ID", verbose=False)
result.head(5).collect()
PASSENGER_ID | SCORE | CONFIDENCE |
892 | Did Not Survive | 1 |
893 | Did Not Survive | 0.833333 |
894 | Did Not Survive | 0.903226 |
895 | Did Not Survive | 1 |
896 | Did Not Survive | 0.583333 |
This completes the walk-through on how to leverage the power of the ML API alongside the SAP HANA DataFrame to optimise and improve the Data Science process. As mentioned previously, this blog was not intended to provide an exhaustive analysis of the Titanic dataset, simply to demonstrate how the ML API can help you to train, optimise and understand algorithms with extreme speed and efficiency.
Summary
The SAP HANA DataFrame provides an extensive library of Machine Learning algorithms with the bonus of being able to train, evaluate, optimise and deploy these models within HANA using your favourite Pythonic syntax. This greatly increases productivity of Data Scientists by reducing the time dedicated to model training and improvement. By shifting the focus from analysis to process, the Python Client API can help to convert a “Data Science Project” into an industrial machine learning project.
Download HANA Express Edition today and get started!
What do you think of the ML API? Please feel free to share, comment, like and follow!
Arun Godwin Patel
Appendix
- Part 1 – https://blogs.sap.com/2018/12/17/diving-into-the-hana-dataframe-python-integration-part-1/
- SAP HANA PAL Documentation – https://help.sap.com/viewer/2cfbc5cf2bc14f028cfbe2a2bba60a50/2.0.03/en-US/c9eeed704f3f4ec39441434db8a874ad.html
- Python Client API for Machine Learning in SAP HANA 2.0, Express Edition SPS 03, Revision 33 – https://blogs.sap.com/2018/10/29/python-client-api-for-machine-learning-in-sap-hana-2.0-express-edition-sps-03-revision-33/
- Install the SAP HANA Python Client API for Machine Learning Algorithms – https://developers.sap.com/tutorials/hxe-ua-install-python-ml-api.html
- Python Client API Documentation site – https://help.sap.com/doc/0172e3957b5946da85d3fde85ee8f33d/2.0.03/en-US/html/index.html
- Download link for HANA Express – https://developers.sap.com/topics/sap-hana-express.html
- How to set up a HANA Express Python Machine Learning API Demo VM – https://blogs.sap.com/2018/11/03/setting-up-a-hana-express-python-machine-learning-api-demo-vm/
Hi Arun,
great Blog - it helped me a lot.
I have implemented a similar scenario which worked fine. My question is how to bring it to production? This model in the Jupyter notebook I want to export/deploy to HANA so that I can execute it in HANA and automate the process for doing predictions to future productive data in the HANA database. Do you have any ideas?
BR
Robert
Hi Robert, for now you can save the model to HANA and pick it up from there. We are Looking into how we can capture and package the generated SQL Code e.g. in a Train procedure etc.
Best regards,
Christoph
Hi,
Nice blog knowing an in-built HANA data structure with an abstraction over conventional pandas.
I was trying to follow your blog, did all pre-req installation steps, able to access TITANIC table into hana dataframe.
I went through Hana dataframe APIs at https://help.sap.com/doc/0172e3957b5946da85d3fde85ee8f33d/2.0.03/en-US/html/hana_ml.dataframe.html#
My first impression is that Hana dataframe isn't much suitable for wrangling as the APIs are limited.
Suppose I wanted to replace a column SURVIVED value based on a condition we can do using numpy with pandas dataframe in simple way as (example)
df['SURVIVED'] = np.where(df['SURVIVED'] ==0,'Did not survive','Survived')
I was into an assumption that hana dataframe can be applied the same but it is failing.
Can we convert between hana dataframe and pandas ?
Thanks,
Ramana Mittapalli
Hi Ramana,
By using a ".collect()" method on a HANA DataFrame, this will return the data from HANA to the client and create a Pandas DataFrame for you to manipulate how you wish.
Of course, using ".collect()" on large datasets will mean long waiting times, so I suggest you try to keep the data within HANA as much as possible, or complete the data replacements within HANA first.
Hope this helps.
Best regards,
Arun
Thanks Arun for the reply, got your point.
I have another query, if I wrangle the data in pandas data frame format is there a way to push back it to HANA or save as a HANA data frame. I asked this question as I am more comfortable wrangling pandas data frames than at data prepping HANA interface.
Thanks,
Ramana Mittapalli
you can use SQLAlchemy for that.
Hi Arun,
Thanks for the nice blog. Would you know if I can use current hana_ml to choose for a classification problem APL/ModelType='binary classification' (i.e. XGBoost) rather than 'regression/classification'? Took a look at robust_regression_base.py and saw that by default model_type is 'regression/classification' and that APL/ModelType is taken out from method set_params.
Thanks!
Daniel