Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
former_member671115
Active Participant

Introduction


This blog post will be about advanced feature importance with help of tree analysis. We will have a look on the famous dataset WINE (easy multi-class classification dataset) and upload it to HANA Cloud instance. On top of it we will prepare specific class for analysis of feature importance.

Main Idea


Decision trees and especially gradient boosting - is quite powerful algorithm for classification and regression tasks. Also, it could be very useful to have a look on feature importance. But let's add additional variation. Let's have a look not only on basic importance, but with different max_depths. This approach could bring us additional insights - smth like level of importance (for tree). The most important features for splits will be on the first max_depth = 1 and for max_depth = 2 and so on - it is interaction important. So, we can have a look on the heatmap.

Realisation


Let's have a look into code part and as first step - we need some libs:
import hana_ml as hml
from hana_ml.algorithms.pal.trees import GradientBoostingClassifier
from sklearn.datasets import load_wine
import pandas as pd
import seaborn as sns
from matplotlib.pylab import plt
%matplotlib inline

After that - some data and connection to HANA cloud
cc = hml.dataframe.ConnectionContext(address=...,
port=443,
user=...,
password=...,
encrypt=True
)
data = load_wine(as_frame=True)
data.data['target']=data.target
data.data.head()

Let's upload it to HANA Cloud
hml.dataframe.create_dataframe_from_pandas(cc,data.data,'WINE')

 

And now we can grab this data and check:
df = hml.dataframe.DataFrame(cc,'SELECT * FROM WINE')
df.head().collect()

Ok, so, we can add additional class for feature importance with variable max_depth:
class FImpy:
def __init__(self,data:hml.dataframe.DataFrame,target_name:str,max_depth:int):
self.data = data
self.target_name = target_name
self.max_depth=max_depth+1
self._calc_importance()
def plot(self):
plt.figure(figsize=(12,int(self.df.shape[0]/1.65)))
sns.heatmap(self.df[self.df.columns[1:]],linewidths=.5,
yticklabels=self.df['VARIABLE_NAME'],annot=True)
def _calc_importance(self):
self.df = self.get_importance(1)
for depth in range(2,self.max_depth,1):
self.df = self.df.merge(self.get_importance(depth),how='left',on='VARIABLE_NAME')
def get_importance(self,max_depth:int):
params={
'n_estimators':10,
'fold_num':1,
'max_depth':max_depth
}
hgbc = HybridGradientBoostingClassifier(**params,
evaluation_metric = 'error_rate',
ref_metric=['auc'],
calculate_importance=True)
hgbc.fit(df,label=self.target_name)
hres = hgbc.feature_importances_.collect().sort_values('IMPORTANCE',ascending=False)
fscore_name = f'IMPScore_d{max_depth}'
hres = hres.rename(mapper={'IMPORTANCE':fscore_name},axis=1)
return hres
def mean(self):
cols=self.df.columns[1:]
res = self.df[['VARIABLE_NAME']].copy()
res['IMPScore_mean'] = self.df[cols].mean(axis=1)
return res.sort_values('IMPScore_mean',ascending=False).style\
.bar(color='lightgreen')

And create new instance for feature importance analysis, where df - is our data, 'target' - name of target columns and 5 - max_depth (from 1 to 5)
fimportance = FImpy(df,'target',5)

So, we can have a look on the mean importance as easy as :
fimportance.mean()


And if we want full view - we have to call method plot()
fimportance.plot()


 

So, we can see that there are a lot of not so usefull features, and it is very interesting that color_intensity only shine after 3+ depth, so this feature need additional interaction to others.

The End


Try it yourself and share you thoughts about this method.

 
1 Comment