Understanding the basics of Big Data, Hadoop and S...

former_member184611 · ‎07-18-2017

The bubble around Big Data has assuredly started to burst and it remains the driving force behind many ongoing waves of digital transformation, including artificial intelligence, data science and the Internet of Things (IoT). SAP has been developing Data Management solutions over the past decades and they have released SAP VORA (formerly known as SAP Hana Vora), which provides enriched interactive analytics on Big Data stored in Hadoop. Vora is basically a query engine with in-memory capabilities and it plugs into the Apache Spark execution framework and helps to combine Big Data with enterprise data in a fast and simple manner.

Well, most of us are now more than familiar with terms like Big Data, Hadoop, Spark, NO-SQL, Hive, Cloud, and SAP Vora. Being from an SAP ECC background, the above concepts were totally new to my scope, and in the beginning I found it bit difficult to understand the above ocean of concepts, as Big data is not a single technology, rather it’s an umbrella term used for the combination of all technologies needed to gather, organize, process, and gather insights from large datasets. The motivation behind this blog is to a provide a single document for SAP developers who are new to Big data - Hadoop concepts and interested in understanding more about analytics on both enterprise and Hadoop data using SAP Vora. I have tried this document to be simple and it aims to explain Big Data on a fundamental level from an SAP perspective, and a high-level look at SAP VORA and define common concepts you might come across while researching the subject.

The document will cover the following topics in brief.

What is Big Data?

How Big Data is generated?

How ‘big’ is Big Data?

How Big Data is stored?

Hadoop

HDFS and its key features

Apache Spark and other Hadoop Tools

Why SAP Vora?

Evolution of SAP Vora

Key features of SAP Vora

Is Vora the only way to integrate Hana and Hadoop?

How Vora is different from other SAP tools?

Who will benefit by using SAP HANA VORA?

Basic Hands-on
- Requisites
- SAP Cloud based Hana Vora Developer Edition running on Amazon AWS
- Apache Ambari
- Apache Zeppelin
- Jupyter Notebook
- Vora Tools
- Data browser
- SQL editor
- Modeler

Major improvements with Vora Version 1.3
- Bringing table/views from HANA and Hadoop
- Creating OLAP model on both data sources

Major improvements with Vora Version 1.4
- Time-series engine
- Graph
- Document store
- Disk engine

ALL IMAGES ARE COURTESY OF SAP and property of SAP.

First we talk about Big Data.

What is Big Data?

‘Big Data’ as name implies is big volume of data that inundates a business every minute of every day. It could be in both structured and unstructured format. Big data can be analyzed for insights that lead to better decisions and strategic business moves, and will fundamentally change the way businesses compete and operate.

How Big Data is generated? Most of the digital process and social media exchange produces Big Data. There are new kinds of data signals that have emerged in the past 10+ years. Every Mobile phones, sensors, website and application across the Internet generates massive amount of data every minute. Social data - comes from the Likes, Tweets & Retweets, Comments, Video Uploads, and general media that are uploaded and shared via the world’s favorite social media platforms. This kind of data provides invaluable insights into consumer behavior and sentiment and can be enormously influential in marketing analytics. The public web is another good source of social data, and tools like Google Trends can be used to good effect to increase the volume of big data. Corporate data – like server logs. They have a large number of data centers, and it is important for them to capture these logs so that they can look at what is happening in the future and try to resolve problems before a catastrophe actually happens in the data center, like node failure, a disk failure, a memory failure, etc. Machine data - information which is generated by industrial equipment, sensors that are installed in machinery, and even web logs which track user behavior. This type of data is expected to grow exponentially as the internet of things grows ever more pervasive and expands around the world. Sensors such as medical devices, smart meters, road cameras, satellites, games and the rapidly growing ‘Internet Of Things’ generates high velocity, value, volume and variety of data. Geospatial data - we all have handheld devices, and we are giving out information about the "where" aspect, the latitude and longitude, what we're doing and where and this information is very important for a lot of companies, whether it's retail companies or other businesses, to be able to look at what my consumer is doing and provide the best services that he or she needs, depending on where they are at that point in time.

Companies apply Big Data analytics on these huge data, which is the process of examining large and varied data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful information. On a broad scale, data analytics technologies and techniques will help organizations make more-informed business decisions that leads to various business benefits, including finding new revenue opportunities, improved operational efficiency, more effective strategic marketing, better customer service and competitive advantages over rivals. In short, anyone who makes or sells anything can take the advantage from big data analytics to make their manufacturing and production processes more efficient and their marketing more targeted and cost-effective.

How ‘big’ is Big Data?

As of IBM, 2.5 billion gigabytes of data was generated every day in 2012. Facebook’s Hive data warehouse itself holds 300 PB data with an incoming daily rate of about 600 TB back in 2014. This is really big by any standards. And the 90% of the data in the world today has been created in the last two years! This data comes from sensors, social media posts, digital pictures and videos, shopping records, and GPS signals, etc. About 75% of data is unstructured form, which are coming from sources such as text, voice and video.

How Big Data is stored?

Most traditional data was structured, or neatly organized in databases. All these new kinds of 'Big Data' are adding that data deluge and they cannot store this data in my traditional database as most of the generated data is in unstructured format. Information that does not have pre-defined data model or format are referred as unstructured data. Traditional data processing systems (e.g. relational data warehouse) may handle large volume of rigid relational data but they are not flexible to process semi-structure or unstructured data. These data sets are now so large and complex that we need new tools and approaches to make the most of them. So large businesses have to look at a much simpler and more cost-efficient way of storing this data. That's where Hadoop comes into play in terms of providing that Hadoop file system layer where you can store the data in its raw format.

Many businesses today are using a hybrid approach in which their smaller structured data remains in their relational databases, and large unstructured datasets get stored in Hadoop or some distributed SQL databases like SAP HANA, Google F1 (built on top of Spanner), Facebook Presto, or NoSQL databases like Cassandra, MongoDB, CouchDB or Neo4j. Basically they all handle vast amounts of data efficiently, but they are used for different purposes. For example, Cassandra is a NoSQL data store based on a key-value pairing system, where value is then further structured into a columnar like store, and it is best suited for a search engine scenario. If the data will be used in an analysis to find connections between data sets, then the best solution is a graph database. Neo4j is an example of a graph database. MongoDB and CouchDB are NoSQL data store based on key-value pairing system where value is JSON documents. MongoDB has its own unique querying language, whereas CouchDB uses a combination of HTTP, Javascript, and map-reduce for querying. Hadoop is not a single component, instead it is an ecosystem of integrated distributed computing tools, including its own columnar (hBase) and SQL like data storage (Hive) platforms, which make it much bigger in scope. Companies use any of these or multiple technologies in order to achieve different tasks. For example, Hadoop is just one of many Big Data technologies employed at Facebook. We are focusing more on Hadoop in this document as SAP HANA Vora is integrated with the Hadoop.

Hadoop

Hadoop is not a database or it is not a substitute for a database. It is an entire ecosystem of integrated distributed computing tools, with Hadoop Distributed File System (HDFS) which is the storage part, and a processing part which is a MapReduce programming model. Hadoop Map-Reduce is a processing technique and a program model for easily writing applications which process vast amounts of data in parallel on thousands of nodes of commodity hardware in a reliable, fault-tolerant manner, or in other words, Map-Reduce is an API for processing all that stored data in Hadoop.

Like other open source projects, Hadoop also has various flavors backed by enterprise providers such as Cloudera, HortonWorks, Amazon Web Services Elastic MapReduce Hadoop Distribution, Microsoft, MapR, IBM InfoSphere Insights, etc. SAP also provides a Big Data solution that uniquely includes Hadoop and Spark operations services (Don’t get this mixed up with SAP Vora. SAP Vora is an in-memory query engine to provide enriched interactive analytics on data stored in Hadoop, whereas SAP Cloud Platform Big Data Services are comprehensive services, including deployment, automated operations management, and proactive support and it feature the key capabilities for Big Data – Apache Hadoop, Apache Spark, Apache Hive, and Apache Pig. They also support third-party applications like H2O, Alation, AtScale, and more.) .

SAP Vora supports the major Hadoop vendors, including Cloudera, MapR, Hortonworks, and SAP Cloud Platform Big Data Services.

As a software framework, Hadoop is composed of numerous modules. At a minimum, Hadoop uses:

Hadoop Common - A kernel to provide the framework's essential collection of common utilities and libraries.

HDFS - Hadoop Distributed File System is the core of Hadoop and storage part. It provides scalable and reliable data storage that span across thousands of commodity servers.

Hadoop YARN - ‘Yet Another Resource Negotiator’, basically a resource manager and allows multiple data processing engines such as interactive SQL, real-time streaming, batch processing and data science.

Hadoop MapReduce – The API for processing data stored in Hadoop.

Some of the key features of HDFS are:

Variety and Volume of Data - HDFS can store huge volumes of data i.e. Terabytes & Petabytes, and it can store any types of data whether it is structured or unstructured.

Cost effective and Scalable - Hadoop is very economical as the HDFS is deployed on commodity hardware (normal hardware). We can scale the cluster by adding more nodes.

Reliability and Fault Tolerance- HDFS is very reliable and fault tolerant as it divides the given data into data blocks, then replicates it and stores it in a distributed fashion across the Hadoop cluster.

We can extend the basic capabilities of Hadoop by using different tools which are available in its related projects or in simple words Hadoop itself is a collection of many tools.

We have tools provided by different Hadoop vendors for monitoring and managing like Ambari, Cloudera Manager or MapR Control System.

And depending on the kind of workloads that you're doing you can use the respective project.

Hive is a data warehouse that facilitates ad-hoc queries and summarization of data for large datasets stored in HDFS, via an SQL-like interface. Pig allows to write complex data transformations without knowing the Java language. Pig Latin is SQL like scripting language which is very simple and is very useful to developers who are already familiar with SQL and scripting languages. If you're doing NoSql, then you use things like HBase or Accumulo. Or if you want to stream data, then you use Storm, or in Apache Spark you have Spark streaming. If you want to do machine learning, Spark support lots of machine learning algorithms that you can use and they are easy to implement.

We can use Kafka, Flume, or Sqoop to ingest the data to Hadoop from your transaction system, whether it's IoT, a sensor, or any other kind of system. And for scheduling Hadoop jobs we can use Oozie.

In terms of security, we can use open-source projects like Ranger or Sentry. And for governance and integration, for example, data lifecycle management and governance, we have projects like Falcon and Atlas.

And finally, once we have the data, we could use Zeppelin, which is a developer friendly tool for you to run Scala code or Python code on top of that data that enables interactive data analytics.

Why SAP VORA?

Most of the industries are running already on very distributed data landscapes. It could be starting from Hadoop or some distributed SQL databases, SAP HANA, etc. And the customers who were using HANA, at the same time, they were looking at extending their data platform to run petabytes of data. Or they want the business coherency that is required across enterprise data and Big Data, especially when trying to build these mash-ups of enterprise data which could be sitting in an SAP system or a non-SAP system. They have their enterprise data, and then they have these new kinds of data signals, which are stored in Hadoop landscape. They need to combine these two to make sense of what's happening in social circles of their consumer environment and tie that into what's happening with their point of sales.

Or it could be a real time scenario where enterprise data and Big Data has to be correlated. A simple example scenario is, while processing a sales order, the business want to give special discounts to customers based on their transactions history (say for last 30 years). The recent transaction data are available in Hana and the very old data are moved to Hadoop. In fact the analysis has to be done over data made by correlating the recent data residing in Hana and old data residing in Hadoop and it has to be processed on-the fly in faster fashion. Being an in-memory database Hana runs very faster. SAP Vora is also an in-memory query engine and it can access data from Hadoop and process some of them in a faster way.

The existing products within SAP at that point of time, does not allow customers to be able to scale and run this data processing when having data that has to spawn to tens of thousands of nodes. Though Hadoop offered a low cost storage for large data, enterprises had some reluctance to adopt it, because it is hard to deal with the unstructured data in the data lakes. And that is one of the things that Vora provides. SAP developed Vora as a way to address specific business cases involving big data.

There were other open-source products available in the market, but some of the customers were not feeling comfortable putting them in their production system. And that's where SAP came in and provided Vora, which allows you to process the data that is within those open-source Hadoop and Apache Spark frameworks.

SAP Vora builds structured data hierarchies for the unstructured data in Hadoop and integrates it with data from HANA, and then through the Apache Spark SQL interface it enables OLAP-style in-memory analysis on the combined data. Vora serves the purpose as a mediator when the compatibility question rises in between SAP HANA and Hadoop. Spark is not well compatible with HANA Systems and HANA Clouds, so SAP built something which follows the Spark framework and also have HANA Adapters for data connectivity. Typically, all this data in Hadoop is unstructured and SQL cannot be run immediately on top of that. And that's where Vora adds value and also could be a bridge between HANA & Hadoop.

Evolution of SAP Vora

Before going directly to Vora, some insight on the evolution of Vora will be very useful. Back in 2012 SAP introduced fast connection from SAP HANA into Hadoop and this was provided using the virtual tables to Hive. So in Hadoop, a relational structure is built on structured data that is sitting in HDFS and you call those the Hive tables. SAP provided Smart Data Access in SPS 06 which allowed to expose these Hive tables as virtual tables to be joined with data that is in HANA. And then calc views are created which combine the data from Hadoop and HANA and then make it available for visualizations or for your applications. With SPS 07, this Hive connectivity is enhanced by providing a remote caching feature, which allows you to materialize the data on the Hive side.

The data in HANA is in the main memory and the data in Hadoop back then was mostly on disk, so when you provide this join between HANA data and Hadoop data, a MapReduce job is executed on the fly when a join was executed as part of a BI query. This was very slow and time-consuming, so what SAP provided was the ability to materialize the Hadoop data as a Hive table, and use that Hive table to join with HANA when you are building this Calc view. This overcomes some of the performance issues, but customers did understand that there was a slight delay when joining this data, when they were connecting using remote resources. And most of the customers were okay with that. In some cases they had to physically ETL the data from Hadoop into HANA. In SPS 08 and SPS 09 of HANA, SAP provided these MapReduce functions as part of a virtual user-defined function. So typically as part of your calculation view, when you're building your calc view, you create a virtual user-defined function which is going to invoke a MapReduce job on the Hadoop side. This bypasses the Hive layer, and this was roughly back in late 2013 when Apache Spark started receiving a lot more interest. With Spark, you can provide real-time access to the data that is in Hadoop and there is no need to go through batch mode. You can run some of the real-time functions since Spark allows you to store the data in memory. SAP introduced this connectivity to Spark using Apache Shark, which was the predecessor of Spark SQL. They used a third- party ODBC driver offered by Simba, to connect to Shark and connect to the data that is in Spark. But with SPS 10, SAP created our own Spark controller, which is a Scala code, in order to provide the connectivity between HANA and Hadoop. It eliminates the requirement to install third- party ODBC drivers on the HANA side and also helped to not have any bottlenecks in terms of one of the core processes within the HANA index server being stabilized by these non-SAP delivered, third-party ODBC drivers.

Some of the things that SAP provided as part of Spark controller: It is installed on the Hadoop nodes. It interacts with the Hive metastore. So whether you have data stored in the Hive metastore or in the Spark catalog, you can expose those as virtual tables. And again, from a workflow standpoint you continue to use the HANA modeler with either the studio or the Web-based modeling tool Web IDE, to model your data irrespective of whether the data is in HANA or in Hadoop, and then make those available for visualization. Some of the key benefits as part of this are obviously that it provides deep integration for both storage and processing, and with Spark controller it went from read-only access, which is what it had with ODBC and Hive(Hive didn't have insert update capabilities back in those days, but now it also provides insert update capabilities). Spark controller also provide the ability to write data into Hadoop as well using the tool called data lifecycle management. SAP also provided a unified administration tool by enabling the HANA cockpit to display tiles for your Hadoop administration, for example Ambari, as part of your HANA cockpit you can add a new tile, and in that tile you can now show your Hadoop nodes, and you can go in there and monitor your Hadoop nodes as well. So it's kind of a unified administration as part of the SAP HANA cockpit to be able to not only manage your HANA assets, but also your Hadoop assets. So SAP used several layers, several interfaces and there were multiple iterative innovations provided as part of the connectivity between HANA and Hadoop and finally they settled on the Spark-specific integration using the Spark controller.

Spark controller simplifies access to Hadoop, as in you can browse through the Hadoop data structures, so you can go through the Hive metastore, you can look at all the different tables that are in Hive metastore. It enables the creation of virtual tables through the remote source interface, so you could be exposing a Hive table, you could be exposing an RDD, which is a Spark RDD (Resilient Distributed Datasets is a fundamental data structure of Spark), or a data frame. All those will be available as virtual tables to be consumed from the HANA side. So this provides unidirectional connection to be consumed from HANA when the data is in Hadoop.

Now let's switch to how Vora enhances this.

Vora takes to the next level in terms of where SDA stops and Vora can take over to provide much more optimized connectivity between HANA and Hadoop.

Vora is a combination of Hadoop/YARN and Spark (in memory query engine) and also it has enhanced Spark SQL to handle OLAP analysis and hierarchical queries very well. Vora runs on either on-premise or on the cloud and provides features like drilldowns on HDFS.VORA can exist on standalone basis with one of the Hadoop nodes but can also integrate with HANA. Classic HANA integration will need its own infrastructure and of-course will incur infrastructure cost but Hadoop integration requires no dedicated hardware infrastructure should cost next to nothing in terms of infrastructure cost.

The key features of SAP Vora are:

An open development interface

An in-memory query engine that runs on Apache Spark framework

Support for major Hadoop distributions

Compiled queries for accelerated processing across Hadoop Distributed File System nodes

Enhanced Spark SQL semantic to include hierarchies to enable OLAP and drill-down analysis

Enhanced mash up application programming interface for easier access to enterprise application data for machine learning workloads

Provides bidirectional connectivity between Hana and Hadoop.

Is Vora the only way to integrate Hana and Hadoop?

No, we have other options.

We have the following methods to integrate between those two components and it basically will depend on the use case you have.

•ETL tools - such as SAP BODS. Using Spark or other libraries the unstructured data in Hadoop is processed and then stored as structured data. This will be used as source for BODS using Hive adapters, known as Hive tables and this structured data is then loaded into SAP HANA. But loading large data sets into Hana will result in memory load and more expense.

•Smart Data Access – We can use SAP HANA Smart Data Access (SDA), to read data out of Hadoop. SDA is widely used in hybrid model scenarios i.e. SAP HANA + SAP NetWeaver BW powered by SAP HANA. WE can access a “table” in a different repository from SAP HANA without actually having to bring the data over to SAP HANA. So you could have your data in SAP HANA and your data into Hadoop and using SDA a simple UNION would bring data from both “tables” together. With Smart Data Access, we could only consume the Hadoop data from HANA, but you could not consume the SAP data or HANA data from Hadoop.

•SAP BusinessObjects Universe – If your requirement is to report in Hadoop data out of SAP BusinessObjects Suite, then you can combine data from any source using the SAP BusinessObjects semantic layer - Universe. We can setup rules, relationships, etc.

•SAP Lumira –If there is only a less complex data handling and transformation and you only need front-end integration you can access/combine data from Hadoop (HDFS Data Set, Hive or Impala Data Set or a SAP Vora Data Set) and SAP HANA using SAP Lumira. SAP Lumira is basically a data visualization tool.

How Vora is different from other SAP tools?

SAP Vora allows to process the enterprise ‘hot’ data (structured data residing in databases) and ‘cold’ Big data (structured/unstructured data in Hadoop) for real-time business applications and analytics by providing enterprise-class, drilldown insight into raw data and in a very cost-effective manner. It allows to combine business data with external data sources, by blending incoming data from customers, partners, and smart devices into enterprise processes, thus helping the business to make better decisions through greater context.

The word HANA in "HANA Vora" is misleading because Vora is actually a stand-alone product that does not need HANA to run. With the release of Vora 1.4 in March 2017, SAP has officially renamed it from ‘SAP Hana Vora’ to ‘SAP Vora’.

Vora is an extension of Apache Spark and allows you to process data from HDFS in memory. Also SAP Vora does not rely on SAP HANA, and one of the key features with Vora is that it integrates well with HANA. It can join its local tables with tables from HANA or vice-versa. It helps to correlate SAP HANA and Hadoop data for quick insight that helps to make contextually-aware decisions that can be processed either on Hadoop or in SAP HANA.

The following features makes Vora different from other similar SAP tools.

The first thing is that Vora is running natively on your Hadoop node, so it is a first-class citizen in that Hadoop infrastructure. What that means is you use the Hadoop administration tools to install Vora and monitor them. And also Vora has the data locality of where the individual data pieces are across this Hadoop node. So it can use that Vora sequel engine to process that data faster and using LLVM (compiler framework) optimization, which is able to process the data much faster and then join the data that's coming from the distributed nodes.

SAP Vora provides enriched features for the Spark execution framework, such as data hierarchies that enable drill-down processing of raw data. To make data easier to use in business settings, Vora also provides features such as currency conversion, unit-of-measure conversion and time-dependent hierarchies.

Vora provides that bidirectional connectivity by enhancing Smart Data Access. With Smart Data Access, we could only consume the Hadoop data from HANA. But with Vora, you can also consume the SAP data or HANA data from Hadoop or Spark using Vora's capabilities. Before Vora, for 'outside in'(which means user is coming in from Hadoop, Spark, or Vora and they want access to the data that is sitting in my HANA system)) scenarios, the only way to get access to this SAP data was to use ETL technologies to physically move the data from HANA into Hadoop, and then have your data scientists execute whatever algorithms they are building on top of Spark. But now with Vora you can expose this as a virtual artifact, which means you don't have to worry about delta loads or data consistency because the data keeps changing here, and as and when there is new data loaded. The data is not physically moved between systems. If you were doing ETL, you had to do ETL data moment, delta loads into Hadoop as well. But providing virtualized access means you don't have to worry about all those things. The view is always available for Vora or Spark, and it can consume the data as and when the data is updated on the ERP side or on the HANA side. So it could be an S/4HANA application, which is an ERP application, or it could be SAP BW. In a BW scenario, if you have InfoCubes that are on your HANA system, you can expose those InfoCubes as a calc view. Once you expose this as a calc view, the calc view can be consumed by Vora using the data source API.

Other major benefits of Vora are it replaces existing ODBC and JDBC connectivity with direct access to Hadoop and Spark through HANA Vora using the Spark controller. When the third-party ODBC drivers were used, sometimes the ODBC drivers would cause inconsistencies on the HANA index server process. The index server is one of the core processes in HANA. Vora provides deeper integration with Hadoop because its engine runs natively on each of the Hadoop and Spark nodes. So there's tight integration between Spark and Vora, where from the Vora side we've created our own SQL context. Just like you have the Spark SQL context, we've provided an SAP SQL context which allows you to use the Vora SQL engine to process the data that is stored in HDFS, S3, or any of the data formats. Future integrations to Hadoop and Spark from SAP HANA will be driven through Vora. Vora delivers features for data consumption from both Hadoop and Spark, and from SAP HANA natively using the calculation used, where it's a CDS view or if it's a BW InfoCube which can be exposed as a calc view. And this extends the platform to data scientists.

Again, when you're coming in from the ‘outside in’ scenario, the data scientists can consume SAP without having to physically load that data into the Hadoop layer. Customers who have been using SDA to connect to Hadoop can continue using Smart Data Access and Spark controller. But if they're looking at deeper, more optimized integration, then our recommendation is to look at SAP HANA Vora and what it offers. Beyond HANA integration, Vora also delivers OLAP-style reporting on Hadoop by taking advantage of the enterprise analytics features like hierarchies and supporting the native Hadoop file formats like Parquet and ORC. Vora also delivers performance optimizations. Since it runs natively in the Hadoop ecosystem, and it takes advantage of technologies like LLVM capabilities to translate or convert SQL code into C code. This is what Vora integration connectivity is at the high level.

SAP HANA Vora provides a distributed computing framework at enterprise scale to be able to accelerate, innovate, and simplify your data. So, being able to bring these two together, Vora provides that layer on top of Hadoop. It allows you to enable high-performing and flexible analytics in terms of innovation. It allows you deliver enterprise-grade and advanced analytic capabilities. So being able to deliver these enterprise-grade analytics. Enterprise grade analytics means not just doing basic reporting SQL on top of it, but bringing in complex functions.

You can use Vora to write Hadoop data to HANA but "write to Hadoop" is a roadmap feature planned for future releases.

Who will benefit by using SAP HANA VORA?

Data scientists can try new modelling techniques with a combination of business and Hadoop data, to discover patterns. They can perform this without duplicating data copies within data lakes.

Business analysts can use interactive queries across both business and Hadoop data, to perform root cause analysis to better understand business context.

Software developers can use their familiar programming tools to deploy a query engine within applications that can span enterprise and Hadoop systems.

Now let's get more familiar with Vora application and its related tools. We need the following Vora services installed and configured, in order to run Vora 1.2.

SAP HANA Vora V2Server: SAP HANA Vora engine

SAP HANA Vora Base: All libraries and binaries required by SAP HANA Vora

SAP HANA Vora Catalog: A distributed metadata store for SAP HANA Vora

SAP HANA Vora Discovery: Hashicorp's Discovery Service manages service registration and installs on all nodes.

SAP HANA Vora Distributed Log: A distributed log manager providing persistence for the SAP HANA Vora catalog

SAP HANA Vora Thriftserver: A gateway compatible with the Hive JDBC Driver

SAP HANA Vora Tools: A web-based user interface with a SQL editor and OLAP modeler.

In order to enable Developers to test drive SAP HANA Vora, SAP is providing access to a Cloud based Developer Edition running on Amazon AWS which provides a simple way of provisioning your own Vora cluster running on top of Hortonworks HDP2.3 and Apache Spark1.4.1.

SAP Vora, developer edition, on SAP Cloud Appliance Library (CAL) comes with SAP Vora Tools pre-installed. To open the Tools web UI click on Connect in your SAP Vora instance in CAL. The popup will display application links like Ambari, Vora web tools and Zeppelin.

Apache Ambari, which provides an easy-to-use web UI to manage and monitor Apache Hadoop clusters and components, including SAP Vora software. From a developers perspective we don’t have to deal much with the monitoring using Ambari. If you want to have look, just click on the Ambari application link logon using the user admin and the master password you defined during process of the creation of the instance in CAL.

In the Ambari screen, you can see SAP Vora components along with other services. From this interface, you can start/stop cluster components if needed during operations or troubleshooting.

Vora Web tools will be discussed below in detail. Another option in the above CAL instance popup is Apache Zeppelin.

We have different choices for connecting to the Vora 1.2 infrastructure and coding with it. Apache Zeppelin, Spark Shell, Jupyter Notebook, etc are some of them. Apache Zeppelin, which is an open-source Interactive browser-based notebook, shipped along with SAP HANA and is used in conjunction with a Spark shell to create SAP HANA Vora tables.

(Interactive browser-based notebooks enable data scientists, data engineers and data analysts to develop, organize, execute, and share data code and visualize results. Using Notebooks data workers don’t need to refer the command line or cluster details and also it enables them to interactively work with long workflows. ) Apache Zeppelin can be used to create, load, explore, visualize , share and collaborate features to Hadoop, Spark and SAP Vora using the various engine capabilities such as time series, document engine, graph engine, disk engine, but also access external data sources such as SAP HANA.

Apache Zeppelin is pre-installed on SAP Vora, developer edition, and you can try it opening http://IP_ADDRESS:9099 from any browser.

Scala is a highly used functional programming language and it is used in some of the Hadoop components like Apache Spark, Kafka, etc. It's not possible to execute Scala code from Vora SQL editor. So if you want to run Scala code, you could use Spark shell.

Jupyter Notebook is also a Web UI to manage metadata and to and share documents, run data cleaning and transformation, and import libraries, and it's integrated with Spark and SAP HANA Vora. You could also use Python and Scala to code with it, or you could use the Spark shell, which is natively shipped with Spark installation. In order to use a Spark shell, you should wrap your Vora SQL in Scala code.

Now let's check how the Vora Modeler look like.

In the SAP Vora, developer edition in CAL, the port of Vora Tools web UI has been preconfigured. As well its port has been opened as one of the default Access Points.

As mentioned earlier, we don't need to install any software separately to have the modeler. It is part of the Vora Tools – one of the components which was installed during your Vora installation. You can access the modeler via your browser using the IP address for the node where you installed Vora Tools, as well as via the port 9225.

And Vora modeler consists of three main components:

Data browser

SQL editor

Modeler

Now I will show you how to connect to the Vora modeler.

From the Vora instance on CAL (SAP Cloud Appliance Library), copy the IP address that and paste it in browser using the port 9225 to connect to the Vora modeler or you can use Vora application link from CAL instance popup.

Once Vora Tools opens up in browser, check whether the Connection Status is OK (by pointing the respective icon in top right of screen) before working with any of its tools.

Vora modeler consists of three main perspectives: Data browser, SQL editor, and Modeler. Each tools can be opened either from home screen or from shortcut icon on the left.

Data browser

The data browser allows you to quickly display the relations such as tables, views, and cubes in SAP HANA Vora without having to write a query.

In the data browser, you can select a table or view from the navigation pane, and it will show you the first 1,000 rows in a tabular data view. You will have access to both the data and the schema of the table, the name of the columns, their types, and the constraints on them. The browser allows you to filter or sort the displayed data, to hide columns using the settings menu, and to export the data displayed as a CSV file. Using the data browsers becomes very handy when developing the data scenarios and you can go back and forth between your models to check the schemas without needing any code.

SQL Editor

In the SQL editor, you can execute SQL only. It is not possible to execute Scala code from the editor. To run the Scala code – as I mentioned – you could use a Spark shell. What you run on the SQL editor is saved in the local storage of the browser, and so cannot be shared between different browsers, and obviously you lose the content if you delete the local storage of your browser. You can navigate to the SQL editor from Vora modeler home screen. You can type in the SQL queries and execute those queries by clicking the execute button after selecting the respective query.

In the SQL editor, on the right-hand side, you can see the interpreter complaints and result, such as the execution was completed successfully, or if there is an error message. If you select the relation, you can see the content and save it as a CSV. Also, if you hover over the table name, the table schema will pop up and give you information about the column names and types.

Modeler

Data modeling refers to an activity of refining or dividing data in database tables by creating views that portrays a business scenario. You can query these views and use the query result for any reporting and decision making purposes.

The views can be simple dimensions or cubes. Creating views modeling simulation of entities (such as customer, product, sales and so on.) and their relationship. The SAP HANA Vora data modeler provides capabilities that helps you to create and edit views.

The Vora modeling perspective gives you access to your relations, allows you to create three types of views – SQL, dimension, and cube. The modeler allows you to build drawings and unions, create the calculated columns, add the alias in annotations to tables and columns, and assign the semantics. For creating the nested queries, you can use subselect artifacts. Whatever you create in modeler becomes a SQL, and you can use that SQL query to run and get the result. You learn more about modeler from https://www.sap.com/developer/tutorials/vora-modeler-getting-started.html

Below you could see a view in the Modeler.

After you've created your Vora view or table, the relation will persist in Vora Catalog. Now you could use the created view in a visualization and analytics tool such as Lumira or Zeppelin and continue further data scenarios and build on top of your SQL or OLAP modeling.

Major improvements with Vora Version 1.3

SAP released Vora 1.3 in December 2016 which featured major improvements over the previous 1.2 version. Vora Modeler offers range of new features introduced with Vora 1.3 and we are just covering the major ones here.

With 1.3 onwards you can also call HANA tables/views from Vora modeler. The image below shows the VORA and HANA tables on the left.

Creating OLAP model on both data sources. The following model is bringing data from HANA source and HDFS. This shows how enterprise data which sits on SAP system or non-SAP system can be merged/combined with data lake scenarios in Hadoop. VORA offers a lot more functions which spark cannot and VORA is very performant than Spark engine processing.

Building Hierarchies: A level based hierarchy or parent-child hierarchy can be created. Creating hierarchies on VORA modeler is very easy and it offers many functions to create and query hierarchy at any level.

Major improvements with Vora Version 1.4

In March 2017, by the release version 1.4 SAP ‘Hana’ Vora became SAP Vora. The major improvements are:

Till 1.3 there was only the relational in-memory support and 1,3 started supporting multiple engine. Version 1.4 removed the standalone relational in-memory support and adds it back as an engine. Except disk engine which lifts data to memory for processing and drops it afterwards, other engine store and keep their data in memory.

Time-series engine

This engine is optimized to store and analyze time-series data. It also support compression of time-series, partitioning, getting histograms and much more.

The below SQL is an example of a table definition using this engine. It creates a table called sales with a time-series with equidistant points at an interval of 3 hour.

We now query the table data with more SQL and result will be as shown below:

Graph

The Graph engine allows to run graph operations like degree, distance on data stored in Vora. Graphs in Vora has Nodes and Edges. Nodes have a type, incoming and/or outgoing edges and optional properties. Edges only have a type and they don’t have properties.

The data must be specified in JSG file in HDFS. JSG files are line-based JSON files, where each line contains a full JSON record. You have to convert your existing data into JSG file format to use it.

The following SQL will create a graph table Brands in Vora. This one is backed by a file called brands.jsg in HDFS.

The following SQL will query it. The result of this query would be a list of all brands that have more than one connection to other brands, as shown below.

Document store

This is a document-oriented, JSON based, SQL database.

The source data needs to be JSON files with one JSON document per line.

The data can then be queried using SQL and nested data can be accessed using the point notation.

Disk engine

Version 1.3 had the limitation of only being available on a single node. Version 1.4 removed this restriction, in order to accelerate data access and the features are now same with the relational in-memory engine. We can push the data (which are loaded from data sources like HDFS) into this disk engine. The disk engine is in fact a columnar relational store, querying data from an optimized data store is faster than reading and interpreting data from a data source.

Understanding the basics of Big Data, Hadoop and SAP 'HANA' VORA

SAP PI for Beginners

ABAP 7.40 Quick Reference

Fiori: technical installation and configuration of one app from A - Z