Big Data Tools(Like Hadoop Framework) Integration with SAP
Hadoop, Bigdata, SAP HANA are all some of the buzz words in the Data management/ Enterprise data ware housing space.
SAP has been working to ensure that there is a good decent integration between SAP Analytical tools and the big data frameworks like Hadoop.
For a POC, we are trying to leverage various integration options between SAP and Hadoop and through this document, I would like to share with you the integration options that we have seen till now.
Our Developments and configurations are still in very early stage. So, mostly I would be taking you through the theoretical part only. The actual lessons learned, the real pain points, the challenges, the limitations – all will be added to the document wherever possible at later stages.
Would certainly like to hear from you as well, in case if you manage to find more integration options.
I am lazy person and hence would like to loop in various links in between, which explains more on that particular topic.
So, let’s see in detail;
1) Using BODS (Business Objects Data services)
SAP HANA Academy – Using Data Services to import data from a HADOOP system — https://www.youtube.com/watch?v=ls_MGp8R7Yk
The above HANA Academy video explains the connectivity in detail.
In BODS, we have format named HDFS Files.
We just have to give the name node host and name node Port details initially and further we have to provide the Root directory and File name.
Name Node –> The Name Node is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
We have some Pig scripting related options aswell.
Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java. Pig’s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL.
Sample Use case:
In BODS, we can have to basically create a project –> Create a Job –> Create a Dataflow –> Drag in HDFS file as the source –> Add a Query transformation –> Create a HANA data store as the target.
2) SAP VORA
SAP HANA Vora is a new in-memory query engine which leverages and extends the Apache Spark execution framework to provide enriched interactive analytics on Hadoop.
John Appleby has given a lot of information in the following blog:
The following are the key features of VORA.
SAP HANA Vora includes a unique set of features and capabilities:
• An in-memory query engine that runs on Apache Spark execution framework
• Compiled queries for accelerated processing across Hadoop Distributed File System (HDFS) nodes
• Enhanced Spark SQL semantic to include hierarchies to enable OLAP and drill-down analysis
• Enhanced mashup application programming interface (API) for easier access to enterprise application data for machine learning workloads
• Support for all Hadoop distributions
• An open development interface.
Lot of VORA topics can be seen from the HANA Academy videos:
More details can be found in this document:
3) Universe IDT –> Connection to Hadoop JDBC Drivers
Following are three of the wonderful documents (though a little old) that explains this integration in detail
In the following Wiki, Jacqueline Rahn, has quite extensively explained the connection of Hadoop Hive with IDT
Obviously, if we are able to reach up to universe level, then we can further take the same to the various BO reporting tools/dashboards.
4) Hadoop connectivity using SDA
SDA is a new method of SAP HANA for accessing the data stored in remote data source.
*Here we can see an adapter with the name HADOOP (ODBC)
Leo has explained the details in the following blog:
Debajit has explained in detail on the SDA access using Hive/Hadoop in the following document:
Lot of Hadoop/Hive/Spark/SDA topics can be seen from the HANA Academy videos:
5) Hadoop connectivity using Lumira
Please find some useful links here:
The following document shows us the connection using “Open with SQL” Method.
These days we can observe a direct connectivity to Hadoop:
This is a Collaborative document. My humble request to all of you is to add more points/options to this document. Let us all work together and make this a very useful repository which talks in and out of Bigdata Hadoop like framework integration with our SAP Tools.
Once again thanks a lot for reading my document.
Like Always, a very interesting read.
This give us the basics to explore more on the SAP-Bigdata integration.
Thanks a ton...
Apache's Hadoop project has become nearly synonymous with Big Data. It has grown to become an entire ecosystem of open source tools for highly scalable distributed computing. Operating System: Windows, Linux, OS X.
Part of the Hadoop ecosystem, this Apache project offers an intuitive Web-based interface for provisioning, managing, and monitoring Hadoop clusters. It also provides RESTful APIs for developers who want to integrate Ambari's capabilities into their own applications. Operating System: Windows, Linux, OS X.
This Apache project provides a data serialization system with rich data structures and a compact format. Schemas are defined with JSON and it integrates easily with dynamic languages. Operating System: OS Independent.
Cascading is an application development platform based on Hadoop. Commercial support and training are available. Operating System: OS Independent.
Based on Hadoop, Chukwa collects data from large distributed systems for monitoring purposes. It also includes tools for analyzing and displaying the data. Operating System: Linux, OS X.
Flume collects log data from other applications and delivers them into Hadoop. The website boasts, "It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms." Operating System: Linux, OS X.
Designed for very large tables with billions of rows and millions of columns, HBase is a distributed database that provides random real-time read/write access to big data. It is somewhat similar to Google's Bigtable, but built on top of Hadoop and HDFS. Operating System: OS Independent.
8. Hadoop Distributed File System
HDFS is the file system for Hadoop, but it can also be used as a standalone distributed file system. It's Java-based, fault-tolerant, highly scalable and highly configurable. Operating System: Windows, Linux, OS X.
Apache Hive is the data warehouse for the Hadoop ecosystem. It allows users to query and manage big data using HiveQL, a language that is similar to SQL. Operating System: OS Independent.
Hivemall is a collection of machine learning algorithms for Hive. It includes highly scalable algorithms for classification, regression, recommendation, k-nearest neighbor, anomaly detection and feature hashing. Operating System: OS Independent.
According to its website, the Mahout project's goal is "to build an environment for quickly creating scalable performant machine learning applications." It includes a variety of algorithms for doing data mining on Hadoop MapReduce, as well as some newer algorithms for Scala and Spark environments. Operating System: OS Independent.
An integral part of Hadoop, MapReduce is a programming model that provides a way to process large distributed datasets. It was originally developed by Google, and it also used by several other big data tools on our list, including CouchDB, MongoDB and Riak. Operating System: OS Independent.
This workflow scheduler is specifically designed to manage Hadoop jobs. It can trigger jobs by time or by data availability, and it integrates with MapReduce, Pig, Hive, Sqoop and many other related tools. Operating System: Linux, OS X.
Apache Pig is a platform for distributed big data analysis. It relies on a programming language called Pig Latin, which boasts simplified parallel programming, optimization and extensibility. Operating System: OS Independent.
Enterprises frequently need to transfer data between their relational databases and Hadoop, and Sqoop is one tool that gets the job done. It can import data to Hive or HBase and export from Hadoop to RDBMSes. Operating System: OS Independent.
An alternative to MapReduce, Spark is a data-processing engine. It claims to be up to 100 times faster than MapReduce when used in memory or 10 times faster when used on disk. It can be used alongside Hadoop, with Apache Mesos, or on its own. Operating System: Windows, Linux, OS X.
Built on top of Apache Hadoop YARN, Tez is "an application framework which allows for a complex directed-acyclic-graph of tasks for processing data." It allows Hive and Pig to simplify complicated jobs that would otherwise take multiple steps. Operating System:
Windows, Linux, OS X.
This administrative big data tool describes itself as "a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services." It allows nodes within a Hadoop cluster to coordinate with each other. Operating System: Linux, Windows (development only), OS X (development only).
Big Data Analysis Platforms and Tools
Originally developed by Nokia, Disco is a distributed computing framework that, like Hadoop, is based on MapReduce. It includes a distributed filesystem and a database that supports billions of keys and values. Operating System: Linux, OS X.
An alternative to Hadoop, HPCC is a big data platform that promises very fast speeds and exceptional scalability. In addition to the free community version, HPCC Systems offers a paid enterprise version, paid modules, training, consulting and other services. Operating System: Linux.
Owned by Altamira, which is known for its national security technologies, Lumify is an open source big data integration, analytics and visualization platform. You can see it in action by trying the demo at Try.Lumify.io. Operating System: Linux.
The Pandas project includes data structures and data analysis tools based on the Python programming language. It allows organizations to use Python as an alternative to R for big data analysis projects. Operating System: Windows, Linux, OS X.
Now an Apache project, Storm offers real-time processing of big data (unlike Hadoop, which only provides batch processing). Its users include Twitter, The Weather Channel, WebMD, Alibaba, Yelp, Yahoo! Japan, Spotify, Group, Flipboard and many other companies. Operating System: Linux.
Formerly known as "Bigdata," Blazegraph is a highly scalable, high-performance database. It is available under an open source or a commercial license. Operating System: OS Independent.
Originally developed by Facebook, this NoSQL database is used by more than 1500 organizations, including Apple, CERN, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netfilx, Reddit and others. It can support incredibly large clusters; for example, Apple's deployment includes more than 75,000 nodes with more than 10 PB of data. Operating System: OS Independent.
Developed by Twitter, FlockDB is a very fast, very scalable graph database that is good at storing social networking data. While it is still available for download, the open source version of this project has not been updated in quite a while. Operating System: OS Independent.
This Erlang-based project describes itself as "a distributed, ordered key-value store with strong consistency guarantee." It was first developed by Gemini Mobile Technologies and is used by several telecommunications carriers in Europe and Asia. Operating System: OS Independent.
Used by eBay, Baidu, Groupon, Yelp and many other Internet companies, Hypertable is a Hadoop-compatible big data database that promises fast performance. Commercial support is available. Operating System: Linux, OS X.
Cloudera claims that its SQL-based Impala database is "the leading open source analytic database for Apache Hadoop." It can be downloaded as a standalone product and is also part of Cloudera's commercial big data products. Operating System: Linux, OS X.
31. InfoBright Community Edition
Designed for analytics, InfoBright is a column-oriented database with a high compression rate. InfoBright.com offers paid, supported products based on the same code. Operating System: Windows, Linux.
Downloaded more than 10 million times, MongoDB is an extremely popular NoSQL database. An enterprise version, support, training and related products and services are available at MongoDB.com. Operating system: Windows, Linux, OS X, Solaris.
Calling itself the "fastest and most scalable native graph database," Neo4j promises massive scalability, fast cypher query performance and improved developer productivity. Users include eBay, Pitney Bowes, Walmart, Lufthansa and CrunchBase. Operating System: Windows, Linux.
This multi-model database combines some of the capabilities of a graph database with some of the capabilities of a document database. Paid support, training and consulting are available. Operating system: OS Independent.
35. Pivotal Greenplum Database
Pivotal boasts that Greenplum is a "best-in-class, enterprise-grade analytical database" that can perform powerful analytics on very large volumes of data very quickly. It's part of the Pivotal Big Data Suite. Operating System: Windows, Linux, OS X.
"Full of great stuff," Riak comes in two versions: KV is the distributed NoSQL database, and S2 provides object storage for the cloud. It's available in open source or commercial editions, with add-ons for Spark, Redis and Solr. Operating System: Linux, OS X.
Now sponsored by Pivotal, Redis is a key-value cache and store. Paid support is available. Note that while the project doesn't officially support Windows, Microsoft has a Windows fork on GitHub. Operating System: Linux.
38. Talend Open Studio
Downloaded more than 2 million times, Talend's open source software offers data integration capabilities. The company also makes paid big data, cloud, data integration, application integration and master data management tools. It counts organizations like AIG, Comcast, eBay, GE, Samsung, Ticketmaster and Verizon among its users. Operating System: Windows, Linux, OS X.
Used by organizations like Groupon, CA Technologies, USDA, Ericsson, Time Warner Cable, Olympic Steel, The University of Nebraska and General Dynamics, Jaspersoft offers flexible, embeddable BI tools. In addition to the open source community edition, it comes in paid reporting, AWS, professional and enterprise versions. Operating System: OS Independent.
Owned by Hitachi Data Systems, Pentaho offers a variety of data integration and business analytics tools. The link above will take you to the free community version; see Pentaho.com for information on paid, supported versions. Operating System: Windows, Linux, OS X.
Called an "open source leader" by market analysts, Spago offers BI, middleware and quality assurance software, as well as a Java EE application development framework. The software is all 100% free and open source, but paid support, consulting, training and other services are available. Operating System: OS Independent.
Short for "Konstanz Information Miner," KNIME is an open source analytics and reporting platform. Several commercial and open source extensions are available to increase its capabilities. Operating System: Windows, Linux, OS X.
BIRT stands for "Business Intelligence and Reporting Tools." It offers a platform for creating visualizations and reports that can be embedded into applications and websites. It is part of the Eclipse community and is supported by Actuate, IBM and Innovent Solutions. Operating System: OS Independent.
The successor to jHepWork, DataMelt can do mathematical computation, data mining, statistical analysis and data visualization. It supports Java and related programming languages including Jython, Groovy, JRuby and Beanshell. Operating System: OS Independent.
Short for "Knowledge Extraction based on Evolutionary Learning," KEEL is a Java-based machine learning tool that provides algorithms for a variety of big data tasks. It's also helpful for assessing the effectiveness of algorithms for regression, classification, clustering, pattern mining and similar tasks. Operating System: OS Independent.
Orange believes data mining should be "fruitful and fun," whether you have years of experience or are just getting started in the discipline. It offers visual programming and Python scripting tools for data visualizations and analysis. Operating System: Windows, Linux, OS X.
RapidMiner boasts more than 250,000 users, including PayPal, Deloitte, Ebay, Cisco and Volkswagen. It offers a wide range of open source and paid versions, but note that the free, open source versions only support data in CSV or Excel formats. Operating System: OS Independent.
Rattle stands for "R Analytical Tool To Learn Easily." It provides a graphical interface for the R programming language, simplifying the processes of creating statistical or visual summaries of data, creating models and performing data transformations. Operating System: Windows, Linux, OS X.
SPMF now includes 93 algorithms for sequential pattern mining, association rule mining, itemset mining, sequential rule mining and clustering. It can be used on its own or incorporated into other Java-based programs. Operating System: OS Independent.
The Waikato Environment for Knowledge Analysis, or Weka, is a set Java-based machine-learning algorithms for data mining. It can perform data pre-processing, classification, regression, clustering, association rules and visualization. Operating System: Windows, Linux, OS X.
This Apache project allows users to query Hadoop, NoSQL databases and cloud storage services using SQL-based queries. It can be used for data mining and ad hoc queries, and it supports a wide variety of databases, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage and Swift. Operating System: Windows, Linux, OS X.
Similar to the S language and environment, R was designed to handle statistical computing and graphics. It includes an integrated suite of big data tools for manipulation, calculation and visualization. Operating System: Windows, Linux, OS X.
Enterprise Control Language, or ECL, is the language developers use for creating big data applications on the HPCC platform. An IDE, tutorials and a variety of related tools for working with the language are available on the HPCC Systems website. Operating System: Linux.
Big Data Search
Java-based Lucene performs full-text searches very quickly. According to the website, it can index more than 150GB per hour on modern hardware, and it includes powerful and efficient search algorithms. Development is sponsored by the Apache Software Foundation. Operating System: OS Independent.
Based on Apache Lucene, Solr is a highly reliable and scalable enterprise search platform. Well-known users include eHarmony, Sears, StubHub, Zappos, Best Buy, AT&T, Instagram, Netflix, Bloomberg and Travelocity. Operating System: OS Independent.
This Apache project describes itself as "a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash technologies." The platform includes data grid, compute grid, service grid, streaming, Hadoop acceleration, advanced clustering, file system, messaging, events and data structure capabilities. Operating System: OS Independent.
Calling its BigMemory technology "the world's premier in-memory data management platform," Terracotta boasts 2.1 million developers and 2.5 million deployments of its software. The company also offers commercial versions of its software, plus support, consulting and training services. Operating System: OS Independent.
58. Pivotal GemFire/Geode
Earlier this year, Pivotal announced that it would be open-sourcing key components of its Big Data Suite, including the GemFire in-memory NoSQL database. It has submitted a proposal to the Apache Software Foundation to manage the core engine for the GemFire database under the name "Geode." A commercial version of the software is also available. Operating System: Windows, Linux.
Powered by Apache Ignite, GridGrain offers in-memory data fabric for fast processing of big data and a Hadoop Accelerator based on the same technology. It comes in a paid enterprise version and a free community edition, which includes free basic support. Operating System: Windows, Linux, OS X.
A Red Hat JBoss project, Java-based Infinispan is a distributed in-memory data grid. It can be used as a cache, as a high-performance NoSQL database, or to add clustering capabilities to frameworks. Operating System: OS Independent.
Nice start!!! I'll add my inputs as well