HANA Cloud advanced feature importance with help o...

former_member671115 · ‎12-29-2021

Introduction

This blog post will be about advanced feature importance with help of tree analysis. We will have a look on the famous dataset WINE (easy multi-class classification dataset) and upload it to HANA Cloud instance. On top of it we will prepare specific class for analysis of feature importance.

Main Idea

Decision trees and especially gradient boosting - is quite powerful algorithm for classification and regression tasks. Also, it could be very useful to have a look on feature importance. But let's add additional variation. Let's have a look not only on basic importance, but with different max_depths. This approach could bring us additional insights - smth like level of importance (for tree). The most important features for splits will be on the first max_depth = 1 and for max_depth = 2 and so on - it is interaction important. So, we can have a look on the heatmap.

Realisation

Let's have a look into code part and as first step - we need some libs:

import hana_ml as hml

from hana_ml.algorithms.pal.trees import GradientBoostingClassifier

from sklearn.datasets import load_wine

import pandas as pd

import seaborn as sns

from matplotlib.pylab import plt

%matplotlib inline

After that - some data and connection to HANA cloud

cc = hml.dataframe.ConnectionContext(address=...,

                                     port=443,

                                     user=...,

                                     password=..., 

                                     encrypt=True

                                    )

data = load_wine(as_frame=True)

data.data['target']=data.target

data.data.head()

Let's upload it to HANA Cloud

hml.dataframe.create_dataframe_from_pandas(cc,data.data,'WINE')

And now we can grab this data and check:

df = hml.dataframe.DataFrame(cc,'SELECT * FROM WINE')

df.head().collect()

Ok, so, we can add additional class for feature importance with variable max_depth:

class FImpy:

    def __init__(self,data:hml.dataframe.DataFrame,target_name:str,max_depth:int):

        self.data = data

        self.target_name = target_name

        self.max_depth=max_depth+1

        self._calc_importance()

    def plot(self):

        plt.figure(figsize=(12,int(self.df.shape[0]/1.65)))

        sns.heatmap(self.df[self.df.columns[1:]],linewidths=.5,

                    yticklabels=self.df['VARIABLE_NAME'],annot=True)

    def _calc_importance(self):

        self.df = self.get_importance(1)

        for depth in range(2,self.max_depth,1):

            self.df = self.df.merge(self.get_importance(depth),how='left',on='VARIABLE_NAME')

    def get_importance(self,max_depth:int):

        params={

            'n_estimators':10,

            'fold_num':1,

            'max_depth':max_depth

        }

        hgbc = HybridGradientBoostingClassifier(**params,

            evaluation_metric = 'error_rate', 

            ref_metric=['auc'],

            calculate_importance=True)

        hgbc.fit(df,label=self.target_name)

        hres = hgbc.feature_importances_.collect().sort_values('IMPORTANCE',ascending=False)

        fscore_name = f'IMPScore_d{max_depth}'

        hres = hres.rename(mapper={'IMPORTANCE':fscore_name},axis=1)

        return hres

    def mean(self):

        cols=self.df.columns[1:]

        res = self.df[['VARIABLE_NAME']].copy()

        res['IMPScore_mean'] = self.df[cols].mean(axis=1)

        return res.sort_values('IMPScore_mean',ascending=False).style\

                .bar(color='lightgreen')

And create new instance for feature importance analysis, where df - is our data, 'target' - name of target columns and 5 - max_depth (from 1 to 5)

fimportance = FImpy(df,'target',5)

So, we can have a look on the mean importance as easy as :

fimportance.mean()

And if we want full view - we have to call method plot()

fimportance.plot()

So, we can see that there are a lot of not so usefull features, and it is very interesting that color_intensity only shine after 3+ depth, so this feature need additional interaction to others.

The End

Try it yourself and share you thoughts about this method.