Removing outlier using standard deviation in SAP HANA
In this blog post we will learn how to remove the outlier in the data-set using the standard deviation , We can have one sample data set with product sales for all the years.
Before moving into the topic we should know what is a outlier and why it used. The Outlier is the values that lies above or below form the particular range of values . For example consider the data set (20,10,15,40,200,50) So in this 200 is the outlier value, There are many technique adopted to remove the outlier but we are going to use standard deviation technique.
We already know that the standard deviation is square root of the variance. In HANA we have aggregation type STDDEV to calculate it.
i have created a sample data with a product and it no of sales during all the years.
Create a calculation view and use the table which you have created. Add the Product and sales value to semantics the sales is a integer data type convert the aggregation to the Standard deviation.(stddev). We also need to calculate the average value of sales, Create a calculated column as “sales_average” with aggregation type as Average in the semantics.
Now we have the Standard deviation value and the average value of the sales for the product.
By using this calculated values we have to frame the equation that can replace the outlier value with Zeros.
For this we have to create a another calculation view and use the same table in projection and the calculation view with standard deviation calculation in the other projection.
We have to join both by using the product field. Create an new calculated column as “NEW_SALES” and as the following expression.
if(“SALES”<=(“AVERAGE_SALES”+(2*”SALES_STD”)) and (“SALES”>=(“AVERAGE_SALES”+(-2*”SALES_STD”))),”SALES”,0)
By this it compares the value above or below the range if it doesn’t satisfies the condition it is replaced with zero
So now we can see that in these case the value of sales for 2003 is replaced with zero and it can be removed .
We have some data set which can used for training the predictive model. So we can use outliers to remove the values and can be used for effective training model.
Any Suggestion please comment below.