Of groups of serial shoppers…
based on SAP HANA revision 74 |
>>> update 09.02.2016
>>> if you are trying to work on a similar problem, please make sure to also review the solutions posted in discussion Fetch the first record with 24 hr interval based on first record timestamp
>>> The solutions posted there are much more efficient and solve the problem better in multiple ways.
>>> update 09.02.2016
Ok, I pondered about this one a while and I did not find a satisfying solution.
So I thought: let’s take this to the community and see what others can come up with.
But let’s start at the beginning.
A few days ago a colleague send me an email asking for advice:
I inserted the colors and markers to make the problem a bit easier to understand.
Basically, the report should aggregate shopping transactions based on the time relation between each of them.
For every transaction of every customer there is, it needs to be checked if there had been a transaction happening within a time frame of 14 days before.
If there is at least one such transaction, these two now belong to the same group.
This of course accumulates like a rolling sum, so whenever a transaction still ‘glues’ to the group via the 14 days range rule, it needs to be included in the group.
In the final result however, the date of the very first and very last date of transaction in this group (or chain) of transactions need to be shown along the SUM of all AMOUNTs.
It took me quite some thinking, but after a while I came up with nested window functions – something like this:
At this point I was nearly sure to have nailed it.
Just bring down the START_DATE to the NULLed columns and GROUP BY CUST_ID, START_DATE, right?
Wrong. Unfortunately 🙁
Unfortunately there is no way to get the the first not null value from the preceding records.
Without that, there is no combination of columns available to perform the grouping as required.
We’re toast!
So, the solution I found is this:
drop type customers_tt;
create type customers_tt as table
(cust_id integer, from_date date, to_date date, amount decimal(10,2));
create global temporary column table f14_custom_group ( cust_id integer
, group_id integer
, tx_date date
, amount decimal (10,2));
drop procedure forth_nightly_shopping;
create procedure forth_nightly_shopping ( OUT result customers_tt)
language sqlscript
as
begin
declare cur_group_start date := to_date(’01-01-1950′ , ‘DD-MM-YYYY’);
declare cur_group int := -1;
declare cur_group_cust_id int := -1;
declare cur_cust_id int := -1;
declare cur_tx_date date;
declare last_tx_date date := to_date(’01-01-1950′ , ‘DD-MM-YYYY’);
declare cur_amount decimal (10,2);
declare cursor c_pre_aggr for select cust_id, tx_date, sum(amount) as amount
from customers
group by cust_id, tx_date
order by cust_id, tx_date;
truncate table f14_custom_group;
for cur_row as c_pre_aggr do
cur_tx_date := cur_row.tx_date;
cur_cust_id := cur_row.cust_id;
cur_amount := cur_row.amount;
— new group check: (new cust_id) or
— (old cust_id but tx_date is more than 14 days awys from last transaction date)
if cur_cust_id <> cur_group_cust_id then
— new customer -> new group
— increase the group id counter
— set the current transaction date to be the start date for the group
— set the currrent customer id to the current customer id
cur_group := cur_group + 1;
cur_group_cust_id := cur_cust_id;
cur_group_start := cur_tx_date;
elseif cur_cust_id = cur_group_cust_id
and cur_tx_date >= add_days (last_tx_date, 14) then
— old customer
— but new group
cur_group := cur_group + 1;
cur_group_start := cur_tx_date;
end if;
insert into f14_custom_group(cust_id , group_id , tx_date , amount)
values (cur_cust_id, cur_group, cur_tx_date, cur_amount);
last_tx_date := cur_tx_date;
end for;
result = select cust_id
, min (tx_date) as from_date
, max (tx_date) to_date
, sum (amount) as amount
from f14_custom_group
group by group_id, cust_id
order by cust_id, min(tx_date);
truncate table f14_custom_group;
end;
call forth_nightly_shopping ( ?);
It does the trick and the performance might be acceptable ( 3.2 Mio records grouped and aggregated in 23 secs).
However, the problems are obvious:
- Further parallelism will be difficult to build into this (maybe doable with SP8)
- the implementation is very inflexible
- Any option to add selection criteria will need to be included into the procedure and the call
- Massive copying of data required
Sooo… Does anybody have a better idea to tackle this one?
Cheers,
Lars
Hi Lars, took your initial query and developed it a bit further, think it should do the trick:
select cust_id, start_dt, max(tx_date), sum(amount) from
(select cust_id,
tx_date,
amount,
(select max(start_date) from
(select cust_id,
case
when days_since_last_tx >= 14
or days_since_last_tx is NULL then tx_date
else
NULL
end as start_date
from
(select cust_id,
tx_date,
days_between (lag (tx_date) over (partition by cust_id order by cust_id, tx_date), tx_date ) days_since_last_tx,
amount
from customers)
)
where start_date <= cs.tx_date and cust_id = cs.cust_id
group by cust_id) as start_dt
from customers cs)
group by cust_id, start_dt
Peter
Hi all.
Another, but The same:
Hi Dmitry
I see you revised your initial version 🙂 .
Nice.
Can you provide the SQL as text, so that we can try it out?
- Lars
with a as (select cust_id,tx_date,days_between(lag (tx_date,1,tx_date) over (partition by cust_id order by cust_id, tx_date),tx_date) as dbw,amount from CUSTOMERS),
b as (select cust_id,tx_date,case when dbw>=14 or dbw=0 then tx_date else null end as days_line,amount from a),
c as (select a.cust_id,a.tx_date,a.amount,max(days_line) as days_line from b,a where days_line <=a.tx_date and a.cust_id=b.cust_id
group by a.cust_id,a.tx_date,a.amount)
select cust_id,days_line,max(tx_date),sum(amount) from c group by cust_id,days_line
Thanks for that!
Did you ever try to run this on more data? Like a few million entries?
When I blow up the data set a little bit, the CTE query uses up the whole system and never returns... 🙁
Thanks anyway!
Lars
Lars, Good day.
54.399 seconds - 3 000 000 rows with my query.
Just a bit tune - do CV with the same code:
/********* Begin Procedure Script ************/
BEGIN
var_a = select cust_id,tx_date,days_between(lag (tx_date,1,tx_date)
over (partition by cust_id order by cust_id, tx_date),tx_date) as dbw,amount from
customer;
var_b = select cust_id,tx_date,case
when dbw>=14 or dbw=0 then tx_date
else null end as days_line,
amount from :var_a;
var_c= select a.cust_id,a.tx_date,a.amount,max(days_line) as days_line
from :var_b as b,:var_a as a
where days_line <=a.tx_date and a.cust_id=b.cust_id
group by a.cust_id,a.tx_date,a.amount;
var_out = select cust_id,days_line,tx_date as tx_data,amount as amount
from :var_c ; ;
--CUST_ID
--DAYS_LINE
--aggregation of tx_date - MAX (measure)
--aggregation of amount - SUM (measure)
END /********* End Procedure Script ************/
Statement 'SELECT "CUST_ID", "DAYS_LINE", sum("AMOUNT") AS "AMOUNT", max("TX_DATA") AS "TX_DATA" FROM ...'
successfully executed in 10.233 seconds (server processing time: 10.232 seconds)
Hi Lars,
Here's a general solution outline that I'd propose. As you know, SAP *may* at some point release actual features that support this line of thinking. Not sure if/when, though...
1) Create a generated column "DATE_PLUS_14" calculated as ADD_DAYS("DATE", 14).
2) Create a parent-child hierarchy column view defined as a table self-join defined as follows:
SELECT NULL AS P_START, NULL AS P_END, '0001-01-01' AS C_START, '9999-12-31' AS C_END
FROM DUMMY
UNION ALL
SELECT PARENT.DATE AS P_START, PARENT.DATE_PLUS_14 AS P_END, CHILD.DATE AS C_START, CHILD.DATE_PLUS_14 AS C_END
FROM CUSTOMERS PARENT
INNER JOIN CUSTOMERS CHILD ON
CHILD.DATE BETWEEN PARENT.DATE AND PARENT.DATE_PLUS_14
3) Note that the above logic defines a hierarchy with a root node being "all of time". Also, each "branch" that extends from any level 1 node to the leaves of that hierarchy contains all relevant records.
4) Each node is defined by a pair of dates - i.e. DATE and DATE_PLUS_14. Additional "key" fields would further identify each node uniquely.
4) Note that the join condition could introduce duplicates/overlapping records. If only there were an elegant way to remove duplicates...
5) The solution would be to leverage some kind of function that aggregate values within a hierarchy - i.e. aggregate all sales figures for a particular "branch" of this hierarchy that start at Level 1 (Level 0 being the root node) through to leaf nodes. Group by Level 1 nodes.
Alas, I can continue to hope and wait for some cool features that would support lines of thinking like this....
Potential benefits:
*No iteration
*No row store operations
*Efficient, parallel execution of column-store hierarchical/graph algorithms