Last week was very busy for attendees at SAP TechEd Las Vegas, so fortunately SAP has made some recordings of sessions available. I watched An Integration of Apache Hadoop, SAP HANA, and SAP BusinessObjects, session EA204 today with SAP’s Anthony Waite.
Text Analysis came up a few times last week plus I am familiar with Data Services Text Analysis features. First, a review of how text mining fits in.
Figure 1: Source: SAP
Figure 1 shows we have “lots of unstructured data”, with 80% unstructured. This is data that you cannot run through a business process, according to Anthony Waite.
Unstructured gets messy; think of a MS Doc file – can you run that through your system/process? Some say content management.
The hot topic is social networks and analyzing for sentiment analysis.
Customer preferences can be mined as an example.
Figure 2: Source: SAP
Anthony said you typically don’t go into your BI tool and search for unstructured information.
It is challenging; it is intensive to process and analyze. Text mining is CPU intensive; there is a lot of “noise” in that data.
Introduction to Hadoop
Figure 3: Source: SAP
Hadoop is a big open source framework for being able to use economical boxes in a distributed manner
HDFS – Hadoop Distribution File System are the essence
Figure 4: Source: SAP
HDFS distributes and replicates data across the machines. The programming model takes advantage of parallelization.
HBase is a schema less database
Hive is a SQL interface for data warehouse
It is harder to find experts in Hadoop; cannot equate Hive to regular relational database – it is a subset
Pig is the scripting language for data flow
Mahout for machine learning – feature vector, training set, classification to run data through it
Figure 5: Source: SAP
The advantage is distributed data storage, allowing for scalability – you just add a box as part of the cluster and it is reliable, has libraries available
The disadvantages are that it is not real time and it is slow; batch oriented environment; there are open source projects to make this more efficient
Figure 6: Source: SAP
This is showing where an end user can find a single interest
Data set is predominantly male; age is mostly 20-30 year olds and they have the lowest income and they are interested in music, basketball, and fashion.
The older market is interested in stock market but younger is interested in music and basketball
Figure 7: Source: SAP
Figure 7 suggests using HANA for performance, high value data, for data volume with noise use Hadoop, and ability to visualize use BI.
Figure 8 shows you can use Data Services to load into SAP HANA.
The example they looked at was user behavior analysis by visiting web sites
Use Hadoop at front end for unstructured as considering it low value data at that point
When pull data into HANA it is higher value; you could use Hadoop but depends on where you want to do the work
He suggested taking advantage of using SQL Script and Predictive Analytics Library (PAL)
Figure 9: Source: SAP
Figure 9 shows getting information about user behavior on a web site
Figure 9 shows using Hadoop to train and classify (machine learning)
Fetch text in Hadoop, create feature vectors and train via machine learning and then run it through the classifier the feature vectors that have segmented the words
You tie that in with the URL of interest such as “basketball” “health” “books”
Then you transfer the data to HANA; all use HANA for is analyzing.
Figure 10: Source: SAP
This solution uses Hadoop less – using Hadoop to create the features and HANA is doing more work. You do the modeling and training in HANA
Classifying using two methods: SQL Script and PAL to show difference in performance
Figure 11: Source: SAP
Figure 11 shows the performance comparison
Hadoop has 10 notes, 8 cores on each of the nodes with 16 GB
HANA was running on a single appliance with 32 cores
Hadoop has cheaper hardware
The amount of data 180 million rows per day; this was being analyzed by day with 10 M unique users and 66 interest numbers
Solution 1 in Hadoop takes 285 seconds – Hadoop is doing training and classifying here
Solution 2 – training & classifying is in HANA – time is less – 130 seconds (SQL script for classifying – URL’s mapped to interest and run the calculation based on interest)
Solution 3 – training & classification is lessusing PAL
Hadoop only was 24 hours
What do you want to do in Hadoop? What do you want to do in HANA? It depends on your environment
Figure 12: Source: SAP
This is looking at other technologies and doing sentiment analysis