Skip to Content
Author's profile photo Tammy Powlas

Share the Knowledge – Big Data and the Real-Time Data Platform Including SAP HANA and Apache Hadoop

This is part of the SAP TechEd 2013: Share the Knowledge – deadline to submit is December 31! Share today!  challenge from Christina Miller

These are my notes (Part 1 only) from SAP TechEd session Big Data and the Real-Time Data Platform Including SAP HANA and Apache Hadoop – you can watch the recording yourself here.

Why is SAP working with Hadoop even though it has HANA?

Introduction to Big Data

What is new?


Figure 1: Source: SAP

Today your business may say “I want to take part of the Internet” as part of the calculations.

What are people saying about this on eBay?  Never mind about images, files, binary content.

The way people want to use it has changed – from analytics, and now it is exploratory.  Storage used to be more expensive.  Today we may say to store raw data and storage is now cheap.

It leads to both explorative and predictive scenarios.

We used to store good data in IT.  Now it is possible to store all data – but is the data valuable?

Reporting once a month? Those days are gone

Need agility to answer questions that we do not know about.


Figure 2: Source: SAP

The SAP speaker said the obvious thing about big data is the volume – a Petabyte is 1000 TB

The average data warehouse is 10 TB.

Data that is coming at high velocity and you have to respond in sub-second time and done some event processing on it real time.

Gas companies may want to combine data from ERP system (current price) with information from shop floor showing energy being generated and optimized in real-time.

On the right of Figure 2 there are different sources of this data, such as machines, like sensors. The rise of the Internet, YouTube, discussion forums, are big sources of data.

Everyone has a mobile device and it generates data.  It links to who you are, what you are doing, and tells you about the consumers and what they do.

Businesses expect IT to deal with this now.

There are a variety of things to deal with now as well.

Big data is a combination of various factors.

If you deal with binary data, you have more processing load, write programs to tell you what the image is telling you (not SQL).

The value question – the drivers of big data are economic drivers.   Store everything mentality – not sure what to do with it.  You may store data that you may never use as long as it is low cost. You want to optimize infrastructure so cost of storing data is low.


Figure 3: Source: SAP

Figure 3 shows that in 2009 NYSE generated 1 TB of data a day and shows areas with massive volumes of data.

In 2 years the number of devices will outnumber us two to one

Key drivers are machine created and internet and 80% of the data is unstructured – a “different kind of data”.

Software as a Service allows for scaling of data.


Figure 4: Source: SAP

There are a lot of Big Data tools out there as shown in Figure 4

SAP HANA Platform for Big Data


Figure 5: Source: SAP

“Big Data” needs to be cheap and there are cases where you may store data not used (sensor data)

Store aggregates in HANA and store details somewhere else that is big, cheap and reliable – this is where Hadoop comes into the picture with batch queries over massive datasets

High value data goes into HANA, and relatively low value or undetermined data can go into Hadoop.



Figure 6: Source: SAP

Figure 6 shows what Hadoop is, and in bold are the important terms.  You don’t have to buy super expensive hardware and it runs on “anything”

Each machine is its own unit in the cluster and is independent.  Hadoop runs the resources of the cluster. It is distributed. Hadoop handles the replication.

You need it to be reliable in the software – when something goes wrong, how recover?

It has to be scalable and add machines to clusters as you go.

It handles the hard tasks including partitioning.


Figure 7: Source: SAP

When people discuss Hadoop, they usually mean what is shown on the bottom of Figure 7

HDFS – Hadoop Distributed File system that makes everything seem in one location

MapReduce is the program written on top

HDFS and MapReduce contain large amounts of data such as CSV files

A SQL interface is better so you can declare your data to retrieve.  Projects like Hive are popular as it allows you to work as if you had a database on top of it.

Business Analysts will understand Hive.  Hive has Hive Query Language but is slow and minimum start up of 20 seconds.  It can take 24 hours.

A trend is people want a faster and complete SQL interface such as Stinger, Gryphon, Impala

HBase is fast, a key value store that works on Hadoop.  It works if you need massive ingest.  Everything can be described by a single key.  It is not a general purpose database with SQL and joins.  Another use case is sensor data with different attributes.


Figure 8: Source: SAP

Figure 8 shows Hadoop is good at handling unstructured data and data that changes over time.  It is open source.

It is not good for everything – it is not good for small scale or if you only need 5 nodes you could go with HANA instead.

It is not real time; HBase is close but it is not easy.  It is challenging for real time.

You will not see a seamless transition to Hadoop; you can query a petabyte of data in Excel.

Governance and user roles are not there yet.


Figure 9: Source: SAP

Figure 9 shows good use cases for HANA + Hadoop, including batch processing when you don’t care about the response times.   An example is satellite image processing – if you run it massive scale, you can use Hadoop.

Post-hoc analysis is when you want to mine data but you don’t have a schema

To be continued….


Try out Big Data yourself for FREE with HANA here: SAP HANA One Labs

I did this yesterday – the instructions were very good.

Happy New Year!

Assigned Tags

      Be the first to leave a comment
      You must be Logged on to comment or reply to a post.