Building an SCN Influencer Analysis App using SAP River, HANA and Lumira
Over on SAPHANA.com I posted a blog (it may not be live yet, bear with me!) about measuring Influencer Analysis using SAP River, HANA and Lumira. The other blog deals mostly with the analysis, whilst this blog is about the making of the app.
Where did the idea come from?
After SAP River was released, I came to think about potential use cases and I really wanted to build an app that’s a bit more than the standard “movie casting” app that is in the developer notes. To do this, I needed an interesting data source and I was reminded of the beta SCN API which was created by Matthias Steiner and the SCN team. The SCN API is in beta for testing and legal reasons, so I can’t reveal the means to access it. But, it is largely based on the Jive REST API.
I figured that I could use the code that I wrote a few days ago to integrate Python into River to inject data into SAP River. I thought I’d then start to use the power of the HANA platform by integrating HANA Text Analysis for Sentiment Analysis and then expose it using SAP Lumira. And then, to make it interesting, I gave myself one day to write the app and one day to create a story, document and blog about it.
The whole point of SAP River is that it’s supposed to be easy to use and fast to develop, so this should be possible, right?
Building the River app
In my process this was a bit iterative, as I poked at the SCN API to find the data that I wanted, but here’s my RDL code. It’s pretty simplistic and it describes SCN Spaces, Content and Authors. I decided to put Blogs and Documents as Content, so I could easily aggregate based on both. I defined contentType as an enumerated type, and so when I insert them later, I specify which type of content I’m inserting into HANA.
What’s fantastic about RDL is that RDL then generates for you the HANA tables, views, entity relationships and the OData services. Done. Now we can get on with loading data. Here’s a sample table:
Loading data into HANA
This is pretty easy and I used Python as my language of choice, and Sublime Text for editing – thanks DJ Adams and Brenton O’Callaghan for the advice there. Here’s my code to load into HANA. I’m sure there are better ways of doing this, I’m a hacker not a programmer.
There are a few gotchas:
– The SAP River UTCTimestamp uses the OData format and requires dates in “milliseconds since the epoch” which is very frustrating. That’s the reason for the weird time conversion code. Blame Microsoft for this!
– You have to re-encode the SCN Content and other UTF-8 data in JSON, or it will fail, hence the json.dumps
– I do some funny work to turn the blog URL into an ID for later use
– These aren’t my real hostname, username or password 🙂
– I found for complex views (e.g. give me all the spaces I haven’t downloaded yet from SCN), it can be necessary to create HANA views and manual xsodata services. Not a big deal.
Enabling Text Search and Sentiment Analysis
That’s the best part – and this couldn’t be easier. It’s one command! Note that this uses the Voice of Customer configuration, which includes sentiment analysis as well as text extraction. You can define your own dictionaries if you want to, but I didn’t do this.
Now, this actually creates a new database table called $TA_VOICE. It contains 1m text terms for my 40k pieces of content and it looks like this:
Yes, I filtered on “unambiguous profanity” 🙂
When the underlying table is updated, the text index is updated with it.
Building the HANA Model
Note that I can also build the HANA model inside the SAP HANA Developer Perspective, right inside my RDL project. It’s advantageous to do this because I can keep all my developer artifacts in one place, and transport them together between systems.
I did this the regular HANA way – an Attribute View to join the Time Dimension, and then an Analytic view for my Content. This allows me to quickly aggregate and view data based on date, author, content and space. It takes 100ms to materialize the whole 40k row table.
Now because my Voice of Customertable is also a fact table, I need to create a Calculation View so I can have a single Information Model. I do it like this:
I now have one Information Model that can tell us any question about SCN data that we choose to ask. Unfortunately for either API or privacy reasons there are a bunch of things that I’ve not been able to extract, like Company, Country information or Badges, as well as ratings. It’s a shame but such is life.
Connecting to HANA with Lumira
Now we can connect right on into HANA with Lumira.
Our Influencer Dataset is immediately available and we can see our attributes and measures:
And here’s a sample graphic – Top 20 Blog/Document writers over all of SCN for 2013, also ranked by number of likes and replies. Congratulations Tammy Powlas!
I hope this makes interesting reading, it was certainly very interesting to build this. You can head over to SAPHANA.com if you want to see a more detailed influencer analysis – this is the “building of” blog. It’s worth noting that I started this at 9am on Monday, and it’s now 2.30pm on Tuesday and the River app is built, data is troubleshooted and loaded (data is always the hardest thing), text analysis is complete and HANA models is designed. SAP Lumira analysis has been completed and two blogs have been written describing the process.
This is what I hoped to achieve and this is the point of SAP River!
In 2014, the SAP HANA Application Platform is clearly really going to come of age, and the ability to quickly build transactional apps using SAP River and push Big Data into SAP HANA is a very powerful concept. In addition, the ability to then add text search and analysis, spatial, predictive and graph capabilities to these is very exciting.
A quick thanks to Matthias Steiner and his SCN API, everyone who engaged with me on Twitter last night who gave me ideas to make this blog better, the SAP HANA, River and Lumira teams, all of whom are working with me right now to make the products even better.
Have a very Happy New Year, and I look forward to working with you all in 2014.