Skip to Content

For an overview of what SAP Hana Vora is then please check out:

SAP HANA Vora: An Overview

[SAP HANA Academy] Learn How to Install SAP HANA Vora on a Single Node

SAP HANA Vora – YouTube

SAP HANA Vora 1.0 – SAP Help Portal Page

With the introductions aside, Spark is fast becoming the de-facto data processing engine for Hadoop; it’s fast, flexible and operates “In-Memory” (when the dataset can fit). I like to think of HANA Vora as an add-on to Spark. It provides added business features as well as “best in class” integration with HANA Databases.

Lets now dive right into the typical “hello world” style example for Hadoop – The Simple Word Count.

How often is “Watson” referred to directly in the “The Adventures Of Sherlock Holmes”?

Is the answer:

A) 42

B) 81

C) 136

D) The Sum of the Above

Note: The answer is at the bottom.

Spark and Vora support several languages such as Scala, Python and Java. Since Scala is still slightly ahead, in terms of popularity with Spark, I’ll use that. For utilising Vora you can use the Spark shell or use Notebooks application such as Zeppelin. In this example I use Zeppelin, which is also covered in the installation steps of Vora, as well as in the Hana Academy videos.

Firstly lets download a free copy of the book, strip out all special characters, collect the words, aggregate the results and finally store as, an “in-Memory”, resilient distributed dataset (RDD):

Scala: Process The File

import java.net.URL

import java.io.File

import org.apache.commons.io.FileUtils

//Load External File to HDFS

val HDFS_NAMENODE = “107.20.0.138:8020”

val HDFS_DIR      = “/user/vora”

val tmpFile = new File(s”””hdfs://${HDFS_NAMENODE}${HDFS_DIR}/SherlockHolmes.txt”””)

FileUtils.copyURLToFile(new URL(“https://ia600300.us.archive.org/10/items/TheAdventuresOfHolmesSherlock/DoyleArthurConan-AdventuresOfSherlockHolmesThe.txt“), tmpFile)

//Read Files line as Array[String] into Spark RDD

val textFile = sc.parallelize(FileUtils.readLines(tmpFile).toArray.map(x => x.toString))

println(“—————————————-“)

//Print first 2 Lines of File

textFile.take(2).foreach(println)

//Rows

println(“Rows in File: ” + textFile.count() )

//Perform full word count, strip ou specify chacters

val word_counts = textFile.flatMap(line => line.replaceAll(“[^\\p{L}\\p{Nd}\\s]+”, “”).toLowerCase.split(” “)).map(word => (word, 1)).reduceByKey(_ + _)

//Put results into a resilient distributed dataset (RDD)

case class WordCount(word: String, wordcount: Long)

var wcRDD = word_counts.map(t => WordCount(t._1, t._2))

//First 10 Rows of Word Count

println(“—————————————-“)

println(“First 10 Rows of Word Count:”)

wcRDD.take(10).foreach(println)

In Zeppelin it appears as follows:

bl1.PNG

The Results of Executing are:

bl2.PNG

Next lets use Vora to register the RDD as a temporary “In Memory” table and then perform a SQL Query to find how many times “Watson” appears in the book:

Scala: Use Vora to Register the RDD as a Temporary Table then Query Results

import org.apache.spark.sql._

val sapSqlContext = new SapSQLContext (sc)

val wordCountDataFrame = sapSqlContext.createDataFrame(wcRDD)

wordCountDataFrame.registerTempTable(“wc”)

val results = sapSqlContext.sql(“SELECT word, wordcount FROM wc where word = ‘watson’ “).map{

case Row(word: String, wordcount: Long) => {

      word + “\t” + wordcount

}}.collect()

Execute in Zeppelin:

bl3.PNG

Finally  lets use Zeppelin to Visualise the results:

Visualise with Zeppelin
println(“%table Word\tCount\n” + results.mkString(“\n”))

Execute in Zeppelin:

bl4.PNG

Note:  Zepplin’s Visualisation capabilities are better demonstrated with %vora sql statements if the results have been stored to HDFS.

So the Answer is  B) 81

Did you guess right?

From Tech Ed 2015 SAP Hana Vora has also been demonstrated to process a Petabyte of data for Intel, so hopeful some more challenging Vora examples in SCN will follow from  Mr Appleby and co soon.

But in the meantime it’s still always fun to use a sledge hammer to smash a nut. I hope you enjoyed. 🙂

To report this post you need to login first.

1 Comment

You must be Logged on to comment or reply to a post.

  1. Anil Kumar Karanam

    Dear Aron,

    firstly, the blog was very helpful and informative.

    I tried to follow the same steps in my local hadoop cluster.

    I could run all the steps until “//Read Files line as Array[String] into Spark RDD”

    I get “file not found” exception at this point. Please refer to the image below:

    2016-09-28 17_26_17-10.137.59.223 - PuTTY.png

    I noticed that a forward slash “/” is missing after I initialise tmpFile.

    I am not sure if this is causing the issue.

    could you please provide any thoughts?

    Thank you.

    Anil.

    (0) 

Leave a Reply