Web scraping and text analysis: test driving the HANA Express Edition
It’s that time of year again when great things happen. During SAP TechEd in Las Vegas SAP introduced free HANA access for developers. See Rudi Leibbrandt introducing it here: Free access to SAP HANA, with SAP HANA, express edition.
In short, the HANA Express Edition gives easy access to the most powerfull platform on the market (HANA naturally 😉 ) on your own system of choice. In my case an Intel NUC.
I have been test driving the HXE for a couple of hours now and would like to give you some idea on how to get content on it and do some wicked analysis. I have used parts of previous blogs I’ve written to test out how the HXE compares to my previous AWS system and I can say, it works beautifully!
Some of my blogs I used as a reference:
Introducing the very first web to HANA extractor using import.io
The not so fuzzy “Fuzzy Search”
I am a huge fan of import.io. You can basically use it to mine the web and load it into HANA. For this blog I decided to scrape the UI5 forum on stackoverflow to see what questions are most frequently used, actually to be more specific: which words are used the most often in questions. Just a small test to check the import possibilities of HXE and it’s text analysis capabilities.
First I created an extractor and told it to go 47 pages deep:
The great part of import.io is that you can even get the results back as OData which in turn you can use to load HANA with. For this blog I am going for a quick load using the import part via Eclipse:
After a split second the data is loaded:
Doing some analysis
Using HXE’s text analysis possibilities will give insights on the type of questions are asked in the UI5 forum.
CREATE FULLTEXT INDEX “nameofindex” On “SYSTEM”.”Stack2″(“Excerpt description”)
TEXT ANALYSIS ON
LINGANALYSIS_FULL will go through the questions posted in the forum and break them up into type of words (noun, pronoun, verb,.).
The above command will create an index on my loaded table and create a shadow table with the analysis:
So what are the most frequently used words?:
select top 20 TA_TOKEN , COUNT (*) AS COUNT
From “SYSTEM”.”$TA_nameofindex” where TA_TYPE like ‘noun’ GROUP by TA_TOKEN ORDER by count desc
Impressive performance on the Intell NUC Skull canyon!
So based on the quick analysis, I guess SAP will need to put some extra efford in the UI5 documentation to make sure the awesome guys answering the questions will have an easier time ;-).
I’m looking at you UI5 rockstars:
See you at UI5CON in Eindhoven 🙂 !
Stay tuned for more and in case you did not get in the vibe from this blog, HXE Rocks!
Tx for reading!
Nice to see the "skull" becoming more popular 😎 !
I'm also still very happy with the performance and the level of direct access compared to a "proper HANA server" you get from it.
One thing though, I believe you meant to write "web scraping" and not "web scaping". (hmm... thinking about it, it sound like a portmanteau of web and escape, so maybe that is what you wanted to write after all... 😀 )
Tx Lars! I looked at your blog more then once ;-).
About the "escaping", I blame it on the jetlag ;-), tx for the headsup!
Makes me wanna buy an Intell NUC Skull canyon