This is part of the SAP TechEd 2013: Share the Knowledge – deadline to submit is December 31! Share today! challenge from Christina Miller
These are my notes (Part 1 only) from SAP TechEd session Big Data and the Real-Time Data Platform Including SAP HANA and Apache Hadoop – you can watch the recording yourself here.
Why is SAP working with Hadoop even though it has HANA?
Introduction to Big Data
What is new?
Figure 1: Source: SAP
Today your business may say “I want to take part of the Internet” as part of the calculations.
What are people saying about this on eBay? Never mind about images, files, binary content.
The way people want to use it has changed – from analytics, and now it is exploratory. Storage used to be more expensive. Today we may say to store raw data and storage is now cheap.
It leads to both explorative and predictive scenarios.
We used to store good data in IT. Now it is possible to store all data – but is the data valuable?
Reporting once a month? Those days are gone
Need agility to answer questions that we do not know about.
Figure 2: Source: SAP
The SAP speaker said the obvious thing about big data is the volume – a Petabyte is 1000 TB
The average data warehouse is 10 TB.
Data that is coming at high velocity and you have to respond in sub-second time and done some event processing on it real time.
Gas companies may want to combine data from ERP system (current price) with information from shop floor showing energy being generated and optimized in real-time.
On the right of Figure 2 there are different sources of this data, such as machines, like sensors. The rise of the Internet, YouTube, discussion forums, are big sources of data.
Everyone has a mobile device and it generates data. It links to who you are, what you are doing, and tells you about the consumers and what they do.
Businesses expect IT to deal with this now.
There are a variety of things to deal with now as well.
Big data is a combination of various factors.
If you deal with binary data, you have more processing load, write programs to tell you what the image is telling you (not SQL).
The value question – the drivers of big data are economic drivers. Store everything mentality – not sure what to do with it. You may store data that you may never use as long as it is low cost. You want to optimize infrastructure so cost of storing data is low.
Figure 3: Source: SAP
Figure 3 shows that in 2009 NYSE generated 1 TB of data a day and shows areas with massive volumes of data.
In 2 years the number of devices will outnumber us two to one
Key drivers are machine created and internet and 80% of the data is unstructured – a “different kind of data”.
Software as a Service allows for scaling of data.
Figure 4: Source: SAP
There are a lot of Big Data tools out there as shown in Figure 4
SAP HANA Platform for Big Data
Figure 5: Source: SAP
“Big Data” needs to be cheap and there are cases where you may store data not used (sensor data)
Store aggregates in HANA and store details somewhere else that is big, cheap and reliable – this is where Hadoop comes into the picture with batch queries over massive datasets
High value data goes into HANA, and relatively low value or undetermined data can go into Hadoop.
Figure 6: Source: SAP
Figure 6 shows what Hadoop is, and in bold are the important terms. You don’t have to buy super expensive hardware and it runs on “anything”
Each machine is its own unit in the cluster and is independent. Hadoop runs the resources of the cluster. It is distributed. Hadoop handles the replication.
You need it to be reliable in the software – when something goes wrong, how recover?
It has to be scalable and add machines to clusters as you go.
It handles the hard tasks including partitioning.
Figure 7: Source: SAP
When people discuss Hadoop, they usually mean what is shown on the bottom of Figure 7
HDFS – Hadoop Distributed File system that makes everything seem in one location
MapReduce is the program written on top
HDFS and MapReduce contain large amounts of data such as CSV files
A SQL interface is better so you can declare your data to retrieve. Projects like Hive are popular as it allows you to work as if you had a database on top of it.
Business Analysts will understand Hive. Hive has Hive Query Language but is slow and minimum start up of 20 seconds. It can take 24 hours.
A trend is people want a faster and complete SQL interface such as Stinger, Gryphon, Impala
HBase is fast, a key value store that works on Hadoop. It works if you need massive ingest. Everything can be described by a single key. It is not a general purpose database with SQL and joins. Another use case is sensor data with different attributes.
Figure 8: Source: SAP
Figure 8 shows Hadoop is good at handling unstructured data and data that changes over time. It is open source.
It is not good for everything – it is not good for small scale or if you only need 5 nodes you could go with HANA instead.
It is not real time; HBase is close but it is not easy. It is challenging for real time.
You will not see a seamless transition to Hadoop; you can query a petabyte of data in Excel.
Governance and user roles are not there yet.
Figure 9: Source: SAP
Figure 9 shows good use cases for HANA + Hadoop, including batch processing when you don’t care about the response times. An example is satellite image processing – if you run it massive scale, you can use Hadoop.
Post-hoc analysis is when you want to mine data but you don’t have a schema
To be continued….
Try out Big Data yourself for FREE with HANA here: SAP HANA One Labs
I did this yesterday – the instructions were very good.
Happy New Year!