# The Journey of an Algorithm : Look-A-Like Customers with Social Patterns

As explained in this blog on Omni-channel consumer engagement, the focus is getting bigger on solutions that can solve unique but complex usecases to cater to the expectations of the Marketing Analysts or the E-Commerce applications, which is to make sense out of sheer volume of data screaming in from various channels (think 3Vs : Volume / Variety / Velocity ).

This is one area where HANA can be used innovatively to solve hitherto unknown problem domains. In this blog, I tried to trace thru one such problem that we solved based on the above-mentioned blog, how we expanded the usecase, how different approaches were experimented and how a unique algorithm was invented on HANA to meet the requirements.

### The Origin

The problem originated from a simple question: What if an e-Commerce site were to recommend to the user what people with similar “interests/Likes” have already bought. This looked like an easy problem to solve, but then, when we look at it realistically from a social media point of view, there can be hundreds of users for each similar Like, each having bought completely different product, we may end up recommending the entire product catalog to this user!

Then, after some more research, we realized that the behavior of the user can be closely matched if we were to combine more than one Like together and see which users overlap this combination the most, i.e, people with likes similar to Wildlife, National Geography and Camera might have more things in common in terms of choosing a lens for the camera, than individual Likes. This solves two problems: identify behaviorally similar set of people, and reduce the choice of recommendation in a rational way.

It finally resulted in the problem statement: “If current user x have y number of Likes (Lx1..Lxy), and if there are other users in the same domain with their own Likes (L11..Lmn), what could be the most overlapped group of Likes, considering both on the number of Likes in that group and the number people following that group.”

Following example gives an illustration: Have a look at the current user’s Likes, and how many other users in the target group share how many similar Likes.

### The problem

Out of the problem domain mentioned above, the key challenge to be solved is “How can we find out, out of the whole space, the (top most) combinations that are common”. Of course, in order to do that, we may need to calculate all combinations of the current user with each and every combination of every other user which would be permutationally taxing. So, we needed to look for a quicker and more performant approach.

Also, based on specific usecases, some would give more weightage to number of Likes in the group, or, some to the number users per group. Then we can extend the problem to support both metrics. Say, if there 5 users having 10 similar likes that of the current user, it would indicate a small number of users but a strong similarity (we call this Match). Or, there could be 100 users having 3 similar likes which indicates huge trend with a limited similarity (we call this Support). Based on the usecases and the Likes involved, it simply depends, and hence we need provide both metrics in a sorted order, so that an analyst can make a better choice.

The following diagram explains the entire workflow of the usecase we want to solve, from the bottom where have the raw data to the top where we have the product recommendations:

### Apriori Modified

We started looking into the Apriori algorithm as it tries to solve a similar problem in the domain of market basket analysis, where individual baskets are recorded as the users buy them, and it can predict the probability of the likely products to be bought.

Say, if we record the following purchases with Apriori

– {P1, P2, P3},

– {P2, P3},

– {P1,P2} )

Now, we can ask the probability of someone buying {P2} if they have already bought {P1} and calculate how much support and confidence we have in that.

In this example,

For {P1} -> {P2} => Support is 2, Confidence is 100%

i.e, people always buy P2 if they buy P1 based on 2 occurrences in the recordset.

But Apriori does not have a concept for the input set, as required by our problem (i.e, the likes of the current user and compare the rest of the sets for matchings), but if somehow we have to figure out a way to bring {OctoberFest, Ferrar, Photography, Nikon} on LHS, it can automatically spit out the necessary support combinations on RHS, which will solve our primary problem of bubbling up all the combinations. But it cannot happen as atleast one value out of that set has to be on RHS (see the Normal method in the below diagram)!

And, one trick we used to make that happen is to add a dummy like called “Target”, which will ensure we get all 4 like on LHS. From here, Apriori will take over and actually gives us all the support combination. In fact it will give all combination, but we can set a pre-defined RHS filter which is “Target”. It works but with a major disadvantage: still the algorithm has to run through all combinations beyond what is necessary for our purpose. And, we were even thinking about extending the Apriori itself at code level until we stuck upon a different path altogether.

### The Final Solution

As we were brainstorming for a better solution which is economical and performant, we thought why not solve the problem from the DB/SQL point of view rather than as a programmatic algorithm, which anyway suits us fine to showcase the capabilities of HANA and its number-crunching abilities.

So, we evolved base of this solution on the premise of providing a unique identity to each Like which will result in a unique identity to each combination, then simply count them. In the above example, let us map each of the input likes to L1, L2, L3 and L4, then the resulting calculations would look like below. With this we have reached calculating the direct combination quite simply and quickly.

Though this aggregation would give us the direct combinations of likes, the “hidden” likes needs to be found out: for example, if we have combinations of

{L1, L2,L3} = 10 occurrences

{L1, L2, L4} = 20 occurrences,

Then we have to effectively deduce that {L1, L2} = 30 occurrences. We overcame this problem using L-script within HANA with a binary comparison method.

### Performance and Summary

The key advantage of this approach is performance when especially run on a flat data model without joins. In our experiments on a typical development box, for 1 million likes transactions can be completed under a second to calculate the matches and supports (with L-Script for hidden matches obviously taking most of the time).

Once we reach this combination, it’s only a matter of getting the right products used by this combination and parameterize it so it can be calculated for high support and/or a high match.

So, the key take-away at least for us is, we can really take the seemingly simple problems but extend their scope in such a way that the problem itself is as innovating as the solution that can be made possible on HANA.

Your thoughts are welcome especially around additional usecases around this space both functionally and technically which can enrich the marketing domain…

Hi Balalji,

Good reading thru! you presented the very technical subject matter in a clear, concise and digestible format..

Regards, Pavan

Thanks for your comments, Pavan !

Hi Balaji,

Good blog describing the different approaches to solve this problem. Nicely documented and here are few calrifications that I have:

1. In the first diagram, felt the "support" and "match" is interchanged?

2. Second diagram, the support on the combinations , in the predict step , think should be 80%, 60%., 60% and 40%? Or did I get this wrong?

3. In the 3rd diagram did not understand how the support here is calculated.

4. 4th Diagram "Add Subset aggregation" could not understand how 19-1 came?

Thanks

Sijesh

Hi Sijesh,

Thanks for your comments, moreover, appreciate you pointing out the typos and anomalies in the diagrams. I've corrected all of them. Let me know if there something else missing.

Regarding 3 around Apriori, the important usage of that algorithm is just bring out the combinations, but Apriori provides two metrics, support (how many users bought P2 after buying P1) and confidence (P2 might have been bought without P1, thus reducing the confidence factor in such cases), I've added these in this diagram. Of course, in our tweak, support will be always 100% as we filter for 'Target' and confidence somewhat resembles Support in our usecase !

Thanks,

Balaji.

Hi Balaji,

Thanks. Now it looks perfect.

Assigning 1,2,4,8... and so on to solve a problem like this is cool.

Thanks,

Sijesh

Hi Balaji,

thanks for this detailed blog sharing a great new algorithm for key business problem, solved by you and the team.

This is a great example of putting HANA into effective use for solid digital use case that is very much a key requirement, the concept often being discussed as "mass personalization", where out of the big data, we can quickly narrow down based on personality traits to match with the right products to be recommended. While the focus was on the social media likes, it can be correlated to any other consumer attribute to be matched across millions of consumers and billions of consumer activities, using SAP HANA.

This has been well received by our CPG and Travel customers worldwide.

Great innovation coming in from SAP Labs Bangalore ðŸ™‚

Thanks

Muthu