# Predictive Analysis using SQL Functions of HANA

Problem Statement:

An online retailer wants to improve it’s shipping/delivery process. Currently, there are deviations noticed between the planned and the actual date of delivery. Causing customer dis-satisfaction.

Currently the order book has 360 pending orders. Retailer wants to know the following:

1. What pecent of orders will get delivered within 3 days of the planned delivery date?
2. Within how many days of planned date will 95% of orders be delivered?

Solution Approach:

Note: A basic knowledge of statistics and normal distribution is required.

Retailer stores the historic delivery data in the “DELIVERYDATA” table in HANA.

Table “DELIVERYDATA” has the following columns:

OrderNo, CustomerID, ProductID,PlanDelDate,ActDelDate

where PlanDelDate –> Planned Delivery Date

ActDelDate   –> Actual Delivery Date

Step 1:  Assuming that actual delivery dates follow a normal distribution. And delay is the difference between the Planned and Actual delivery date.

Step 2:  Need to find the mean delay and standard deviation of delay for the historical delivery data. Here is the SQL written in HANA:

select  avg((days_between(plandeldate,actdeldate)))     as   mean,

stddev((days_between(plandeldate,actdeldate))) as   standard_deviation

from deliverydata

Mean came as 0.08 and standard deviation came as 3.5466. This is directly from the SQL above on HANA.

Step 3: Now the first problem. Need to find the percentage of delivery where the lag will be + or – 3 days. Used the following calculation on paper ( Manual ):

For “+3” days, the corresponding deviation comes as : ( 3 – 0.08 ) / 3.5466 = 0.8233. Looked into the standard deviation table manually, and the probability percentage comes as 29.39%. Please scroll to the end of this document to get the link for the standard deviation table.

In a similar way, for “-3″ day”, the probability percentage comes as 30.5%.

Combined together it comes = 29.39 + 30.5 = 59.89%

Company has 360 pending orders. 59.89% of 360 is ~ 216

So, 216 of these 360 orders are expected to be delivered within 3 days of the planned delivery date.

Step 4: Next problem. Need to know the delay range, by which 95% of the 360 orders will be delivered.

For normal distribution, twice the standard deviation (2 sigma) accounts for 95% of the distribution.

Mean delay was 0.08 days. As some orders reached late and some early. And standard deviation was 3.5467.

+ 2 sigma comes as :  2 x 3.5467 = 7.0934. Add it to the mean of 0.08. So the total comes as 7.1734. For practical purposes, lets take 7.

Similary for – 2 sigma :  2 x 3.5467 – 0.08 = 7.01. Roughly 7 days.

So 95% of the orders would reach within + or – 7 days of the planned delivery date.

Note: The attachment has the csv file to load data to the HANA “DELIVERYDATA” table.

Also one can find the normal standard distribution table in the below link:

http://www.mathsisfun.com/data/standard-normal-distribution-table.html

Thanks,

Kowshik

### Assigned Tags

You must be Logged on to comment or reply to a post.

Hi Kowshik,

this is a simple and yet very powerful example of a helpful statistical analysis.

As for the first question, you could double check the result by simply doing

select count(*) from deliverydata where days_between(plandeldate,actdeldate) < 3

Which from your sample data (with 100 records only) was around 67%.

Former Member
Blog Post Author

Hi Henrique,

My analysis was for " -3 <= dalay <= 3" days.

Now if we are trying to find the percentage for < 3 days, then here is what it's coming from statistics:

Steps 1:  For tracing delay of less than 3 days, I am considering all deliveries having a  delay of less than or equal to 2 days. We are not considering any fractional delay in our data.

Step 2: For <= 2 days, the deviation comes out as ( 2 - 0.08 ) / 3.5466 = 0.5413

Step 3: From normal distribution, the percentage comes as 70%. And from the actual data, it's coming 67 %. ( I am using the figure from your analysis. I am away from the system now )

So there is a deviation of 3% between actual and pure statistical output.

Best regards,

Kowshik

Considering this query:

select count(*) from test.deliverydata where days_between(plandeldate,actdeldate) <= 3 AND days_between(plandeldate,actdeldate) >= -3;

I got 52%.

Former Member
Blog Post Author

Hi Henrique,

And by pure statistics it's coming 59.89%. That's what it came in the original solution. So a deviation of around 8% between actual and pure stat. Still this is high. It would be good if you can try with a higher data-set and post your findings.

Regards,

Kowshik

Good step by step to tryout - thanks!