Skip to Content
based on SAP HANA revision 74

>>> update 09.02.2016

>>> if you are trying to work on a similar problem, please make sure to also review the solutions posted in discussion Fetch the first record with 24 hr interval based on first record timestamp

>>> The solutions posted there are much more efficient and solve the problem better in multiple ways.

>>> update 09.02.2016

Ok, I pondered about this one a while and I did not find a satisfying solution.

So I thought: let’s take this to the community and see what others can come up with.

But let’s start at the beginning.

A few days ago a colleague send me an email asking for advice:

29-05-2014 21-11-45.png

I inserted the colors and markers to make the problem a bit easier to understand.

Basically, the report should aggregate shopping transactions based on the time relation between each of them.

For every transaction of every customer there is, it needs to be checked if there had been a transaction happening within a time frame of 14 days before.

If there is at least one such transaction, these two now belong to the same group.

This of course accumulates like a rolling sum, so whenever a transaction still ‘glues’ to the group via the 14 days range rule, it needs to be included in the group.


In the final result however, the date of the very first and very last date of transaction in this group (or chain) of transactions need to be shown along the SUM of all AMOUNTs.


It took me quite some thinking, but after a while I came up with nested window functions – something like this:

29-05-2014 21-42-48.png

At this point I was nearly sure to have nailed it.

Just bring down the START_DATE to the NULLed columns and GROUP BY CUST_ID, START_DATE, right?

Wrong. Unfortunately 🙁

Unfortunately there is no way to get the the first not null value from the preceding records.


Without that, there is no combination of columns available to perform the grouping as required.

We’re toast!


So, the solution I found is this:

drop type customers_tt;

create type customers_tt as table

(cust_id integer, from_date date, to_date date, amount decimal(10,2));

create global temporary column table f14_custom_group ( cust_id integer

, group_id integer

, tx_date date

, amount decimal (10,2));


drop procedure forth_nightly_shopping;

create procedure forth_nightly_shopping ( OUT result customers_tt)

language sqlscript

as

begin

declare cur_group_start date := to_date(’01-01-1950′ , ‘DD-MM-YYYY’);

declare cur_group int := -1;

declare cur_group_cust_id int := -1;

declare cur_cust_id int := -1;

declare cur_tx_date date;

declare last_tx_date date := to_date(’01-01-1950′ , ‘DD-MM-YYYY’);

declare cur_amount decimal (10,2);

declare cursor c_pre_aggr for select cust_id, tx_date, sum(amount) as amount

                            from customers

                            group by cust_id, tx_date

                            order by cust_id, tx_date;

truncate table  f14_custom_group;

                            

for cur_row as c_pre_aggr do

cur_tx_date := cur_row.tx_date;

cur_cust_id := cur_row.cust_id;

cur_amount := cur_row.amount;

— new group check: (new cust_id) or 

— (old cust_id but tx_date is more than 14 days awys from last transaction date)

if cur_cust_id <> cur_group_cust_id then

— new customer -> new group

        — increase the group id counter

— set the current transaction date to be the start date for the group

— set the currrent customer id to the current customer id

            cur_group := cur_group + 1;

            cur_group_cust_id := cur_cust_id;

            cur_group_start := cur_tx_date;

elseif   cur_cust_id = cur_group_cust_id

             and cur_tx_date >= add_days (last_tx_date, 14) then

         — old customer

         — but new group   

            cur_group := cur_group + 1;

            cur_group_start := cur_tx_date;

end if;

             

insert into f14_custom_group(cust_id    , group_id , tx_date    , amount)

                                     values (cur_cust_id, cur_group, cur_tx_date, cur_amount);

last_tx_date := cur_tx_date;

end for;

result = select cust_id

               , min (tx_date) as from_date

               , max (tx_date) to_date

               , sum (amount) as amount

               from f14_custom_group

               group by group_id, cust_id

               order by cust_id, min(tx_date);

truncate table f14_custom_group;

end;

call forth_nightly_shopping ( ?);

It does the trick and the performance might be acceptable ( 3.2 Mio records grouped and aggregated in 23 secs).

29-05-2014 21-49-31.png

However, the problems are obvious:

  • Further parallelism will be difficult to build into this (maybe doable with SP8)
  • the implementation is very inflexible
  • Any option to add selection criteria will need to be included into the procedure and the call
  • Massive copying of data required

Sooo… Does anybody have a better idea to tackle this one?

Cheers,

Lars

To report this post you need to login first.

7 Comments

You must be Logged on to comment or reply to a post.

  1. Peter Murphy

    Hi Lars, took your initial query and developed it a bit further, think it should do the trick:

    SQL.PNG

    select cust_id,  start_dt, max(tx_date), sum(amount) from

      (select cust_id,

      tx_date,

      amount,

      (select max(start_date) from

      (select cust_id,

      case

      when days_since_last_tx >= 14

      or days_since_last_tx is NULL then tx_date

      else

      NULL

      end as start_date

      from

      (select cust_id,

      tx_date,

      days_between (lag (tx_date) over (partition by cust_id order by cust_id, tx_date), tx_date ) days_since_last_tx,

      amount

      from customers)

      )

      where start_date <= cs.tx_date and cust_id  = cs.cust_id

      group by cust_id) as start_dt

      from customers cs)

    group by cust_id,  start_dt

    Peter

    (0) 
    1. Lars Breddemann Post author

      Hi Dmitry

      I see you revised your initial version 🙂 .

      Nice.

      Can you provide the SQL as text, so that we can try it out?

      – Lars

      (0) 
      1. Dmitriy Buslov

        with a as (select cust_id,tx_date,days_between(lag (tx_date,1,tx_date) over (partition by cust_id order by cust_id, tx_date),tx_date) as dbw,amount from CUSTOMERS),

             b as (select cust_id,tx_date,case when dbw>=14 or dbw=0 then tx_date else null end as days_line,amount from a),

             c as (select a.cust_id,a.tx_date,a.amount,max(days_line) as days_line from b,a where days_line <=a.tx_date and a.cust_id=b.cust_id

            group by a.cust_id,a.tx_date,a.amount)

             select cust_id,days_line,max(tx_date),sum(amount) from c group by cust_id,days_line

        (0) 
        1. Lars Breddemann Post author

          Thanks for that!

          Did you ever try to run this on more data? Like a few million entries?

          When I blow up the data set a little bit, the CTE query uses up the whole system and never returns… 🙁

          Thanks anyway!

          Lars

          (0) 
          1. Dmitriy Buslov

            Lars, Good day.

            54.399 seconds – 3 000 000 rows  with my query.

            Just a bit tune – do CV with the same code:

            /********* Begin Procedure Script ************/
            BEGIN
               var_a = select cust_id,tx_date,days_between(lag (tx_date,1,tx_date)
                over (partition by cust_id order by cust_id, tx_date),tx_date) as dbw,amount from
                customer;

                 var_b = select cust_id,tx_date,case
                       when dbw>=14 or dbw=0 then tx_date
                       else null end as days_line,
                       amount from :var_a;

                 var_c= select a.cust_id,a.tx_date,a.amount,max(days_line) as days_line
                   from :var_b as b,:var_a as a
                   where days_line <=a.tx_date and a.cust_id=b.cust_id
                  group by a.cust_id,a.tx_date,a.amount;

                 var_out = select cust_id,days_line,tx_date as tx_data,amount as amount
                 from :var_c ; ;
                –CUST_ID
            –DAYS_LINE
            –aggregation of tx_date – MAX (measure)
            –aggregation of amount – SUM (measure)

            END /********* End Procedure Script ************/

            Statement ‘SELECT “CUST_ID”, “DAYS_LINE”, sum(“AMOUNT”) AS “AMOUNT”, max(“TX_DATA”) AS “TX_DATA” FROM …’

            successfully executed in 10.233 seconds  (server processing time: 10.232 seconds)

            (0) 
  2. Jody Hesch

    Hi Lars,

    Here’s a general solution outline that I’d propose. As you know, SAP *may* at some point release actual features that support this line of thinking. Not sure if/when, though…

    1) Create a generated column “DATE_PLUS_14” calculated as ADD_DAYS(“DATE”, 14).

    2) Create a parent-child hierarchy column view defined as a table self-join defined as follows:

    SELECT NULL AS P_START, NULL AS P_END, ‘0001-01-01’ AS C_START, ‘9999-12-31’ AS C_END

    FROM DUMMY

    UNION ALL

    SELECT PARENT.DATE AS P_START, PARENT.DATE_PLUS_14 AS P_END, CHILD.DATE AS C_START, CHILD.DATE_PLUS_14 AS C_END

    FROM CUSTOMERS PARENT

    INNER JOIN CUSTOMERS CHILD ON

         CHILD.DATE BETWEEN PARENT.DATE AND PARENT.DATE_PLUS_14

    3) Note that the above logic defines a hierarchy with a root node being “all of time”. Also, each “branch” that extends from any level 1 node to the leaves of that hierarchy contains all relevant records.

    4) Each node is defined by a pair of dates – i.e. DATE and DATE_PLUS_14. Additional “key” fields would further identify each node uniquely.

    4) Note that the join condition could introduce duplicates/overlapping records. If only there were an elegant way to remove duplicates…

    5) The solution would be to leverage some kind of function that aggregate values within a hierarchy – i.e. aggregate all sales figures for a particular “branch” of this hierarchy that start at Level 1 (Level 0 being the root node) through to leaf nodes. Group by Level 1 nodes.

    Alas, I can continue to hope and wait for some cool features that would support lines of thinking like this….

    Potential benefits:

    *No iteration

    *No row store operations

    *Efficient, parallel execution of column-store hierarchical/graph algorithms

    (0) 

Leave a Reply