Predictive Analysis using SQL Functions of HANA
Problem Statement:
An online retailer wants to improve it’s shipping/delivery process. Currently, there are deviations noticed between the planned and the actual date of delivery. Causing customer dis-satisfaction.
Currently the order book has 360 pending orders. Retailer wants to know the following:
- What pecent of orders will get delivered within 3 days of the planned delivery date?
- Within how many days of planned date will 95% of orders be delivered?
Solution Approach:
Note: A basic knowledge of statistics and normal distribution is required.
Retailer stores the historic delivery data in the “DELIVERYDATA” table in HANA.
Table “DELIVERYDATA” has the following columns:
OrderNo, CustomerID, ProductID,PlanDelDate,ActDelDate
where PlanDelDate –> Planned Delivery Date
ActDelDate –> Actual Delivery Date
Step 1: Assuming that actual delivery dates follow a normal distribution. And delay is the difference between the Planned and Actual delivery date.
Step 2: Need to find the mean delay and standard deviation of delay for the historical delivery data. Here is the SQL written in HANA:
select avg((days_between(plandeldate,actdeldate))) as mean,
stddev((days_between(plandeldate,actdeldate))) as standard_deviation
from deliverydata
Mean came as 0.08 and standard deviation came as 3.5466. This is directly from the SQL above on HANA.
Step 3: Now the first problem. Need to find the percentage of delivery where the lag will be + or – 3 days. Used the following calculation on paper ( Manual ):
For “+3” days, the corresponding deviation comes as : ( 3 – 0.08 ) / 3.5466 = 0.8233. Looked into the standard deviation table manually, and the probability percentage comes as 29.39%. Please scroll to the end of this document to get the link for the standard deviation table.
In a similar way, for “-3″ day”, the probability percentage comes as 30.5%.
Combined together it comes = 29.39 + 30.5 = 59.89%
Company has 360 pending orders. 59.89% of 360 is ~ 216
So, 216 of these 360 orders are expected to be delivered within 3 days of the planned delivery date.
Step 4: Next problem. Need to know the delay range, by which 95% of the 360 orders will be delivered.
For normal distribution, twice the standard deviation (2 sigma) accounts for 95% of the distribution.
Mean delay was 0.08 days. As some orders reached late and some early. And standard deviation was 3.5467.
+ 2 sigma comes as : 2 x 3.5467 = 7.0934. Add it to the mean of 0.08. So the total comes as 7.1734. For practical purposes, lets take 7.
Similary for – 2 sigma : 2 x 3.5467 – 0.08 = 7.01. Roughly 7 days.
So 95% of the orders would reach within + or – 7 days of the planned delivery date.
Note: The attachment has the csv file to load data to the HANA “DELIVERYDATA” table.
Also one can find the normal standard distribution table in the below link:
http://www.mathsisfun.com/data/standard-normal-distribution-table.html
Thanks,
Kowshik
Hi Kowshik,
this is a simple and yet very powerful example of a helpful statistical analysis.
As for the first question, you could double check the result by simply doing
select count(*) from deliverydata where days_between(plandeldate,actdeldate) < 3
Which from your sample data (with 100 records only) was around 67%.
Hi Henrique,
My analysis was for " -3 <= dalay <= 3" days.
Now if we are trying to find the percentage for < 3 days, then here is what it's coming from statistics:
Steps 1: For tracing delay of less than 3 days, I am considering all deliveries having a delay of less than or equal to 2 days. We are not considering any fractional delay in our data.
Step 2: For <= 2 days, the deviation comes out as ( 2 - 0.08 ) / 3.5466 = 0.5413
Step 3: From normal distribution, the percentage comes as 70%. And from the actual data, it's coming 67 %. ( I am using the figure from your analysis. I am away from the system now )
So there is a deviation of 3% between actual and pure statistical output.
Best regards,
Kowshik
Considering this query:
select count(*) from test.deliverydata where days_between(plandeldate,actdeldate) <= 3 AND days_between(plandeldate,actdeldate) >= -3;
I got 52%.
Hi Henrique,
And by pure statistics it's coming 59.89%. That's what it came in the original solution. So a deviation of around 8% between actual and pure stat. Still this is high. It would be good if you can try with a higher data-set and post your findings.
Regards,
Kowshik
Good step by step to tryout - thanks!