SAP Hana Vora 1.3 was released last week.  It contains some major improvements over the previous version.  You can read all about them here.  I’d like to describe one of these features – the introduction of specialized analysis engines – in more detail

One of the key benefits of Hadoop is the ability to store any type of data – structured, semi-structured and unstructured – in your data lake.  This means, for example, that you can store time series data streamed from IoT applications alongside web log and clickstream data collected from your customer facing web site and applications, alongside user profile, request and usage data from your mobile and web applications.

Why is all this data collected?  There are a few reasons.  In some cases, it’s often required to keep this data in the case where you need to adhere to regulatory and legal requirements, in other cases it is collected for a specific purpose (eg. machine learning, or security tracking), and in still many other cases, it is collected for some unknown future purpose – enterprises choosing to store the data until they can figure out how to extract value from it.  The problem arises when it comes time to actually query the data based on a legal request, or simply explore the data to look for trends, troubleshoot problems, test a new algorithm, or otherwise try to derive some value from it.  Enterprises face a significant IT investment and learning curve to enable analysis of each type of data.  In many cases the technology required to perform this analysis requires separate hardware, different sets of (sometimes conflicting ( libraries and services to install and configure, and supports different OS versions.  All of these things make it a nightmare for the IT group to set up and manage from an infrastructure perspective.  Even if using cloud services, the different client tools and programming languages required place many new and often significant learning curves on your development teams.

This is where the new Vora analysis engines come in.  A major new feature of Vora 1.3, the introduction of different data analysis engines allow you expand the breadth of data in your data lake that Vora can ingest and analyze, without exponentially increasing the management requirements of your IT group or the programming knowledge of your developers.  All of the new engines run as Vora services and can be deployed into your existing Hadoop and Spark infrastructure, and all the engines are accessed via one of the most common data querying languages – SQL.

Additionally, it is important to note that all of the engines are distributed.  You can run an engine on as many nodes as you need to provide the resources required to load and analyse the data, and define the way you wish to partition your data across those nodes.  The Vora transaction co-ordinator provides distributed query processing capabilities to ensure queries get to the right nodes, and a complete result set is returned from your query.

The basic architecture of Vora as it relates to the new engines can be seen below:

In short, you deploy the engine services you wish to use to the nodes in your cluster, and access them via the same Vora access layer as the existing in-memory relational engine.  The Vora distributed processing layer takes care of routing your queries, gathering the results and sending them back to the client.

We have introduced 4 new engines in this release.

Time Series Engine

The time series engine provides specific processing and functionality for time series data, which is any measurement taken at a regular time interval.  The engine provides significant data compression in-memory, even if the time intervals do not have to be equidistant (taken at exactly the same time interval), which means you can often analyse significantly more raw time series data than you might expect.  In addition, the time series engine provides specialized functions to examine things like correlations and trends in the data, approximate values to fill in or extrapolate the series, group and count values and even change the granularity of the time series measurements.

To use the time series engine, you deploy it to one or more nodes in your cluster using the Vora Manager service.  Once deployed, you can access it using SQL via your favorite dev. tool to create partitions and define time series tables.  You can read more about the details of the syntax in the Vora developer’s guide (https://help.sap.com/hana_vora), but here is an example the creates a time series table from an ORC formatted file using a predefined partition rule, and an explicitly declared series definition.

 

CREATE TABLE my_ts_table (
    ts TIMESTAMP,
    measure double,
    comment varchar(1024)
)
SERIES ( PERIOD FOR SERIES ts START TIMESTAMP '2015-01-01 00:00:00' 
        END TIMESTAMP '2016-01-01 00:00:00'
EQUIDISTANT INCREMENT BY 30 MINUTE DEFAULT COMPRESSION use (APCA error 3.0 percent)
    compression on (measure) use (SDT error 5.0 percent) )
PARTITION BY PS1( ts )
USING com.sap.spark.engines
OPTIONS (
    files "/user/hdfsuser /data.orc",
    format “orc”,
    storagebackend "hdfs"
)

Once created, you are free to access the table using standard SQL, and the time series-specific functions we provide.  However, the time series engine is not completely integrated into Spark SQL at this point, so to execute your statements, you must access it as a ‘Raw’ datasource.  This simply means that you must enclose any statements you execute against the time series engine in back quotes ( `, ascii 0x60 ).  For example, here is a select statement as you might issue it in the spark shell.

import org.apache.spark.sql._
val sqlContext = new SapSQLContext( sc )
sqlContext.sql("""``SELECT TREND(measure)
     | FROM my_ts_table
     | WHERE PERIOD BETWEEN TIMESTAMP '2015-01-01 00:00:00'
     |     AND TIMESTAMP '2016-01-01 00:00:00'``
     | USING com.sap.spark.engines""".stripMargin).show()

 

Graph Engine

The graph engine provides analysis functions that examine and analyse the relationships between entities.  Examples of the type of data used in graph analysis include network topologies, social networks and knowledge graphs.  Typical graph analysis functions provided by the graph engine include the degree of a node (how many connections it has), the (shortest) distance between two nodes, and whether or not two nodes are connected.

Similar to the time series engine, you deploy the graph engine service to one or more nodes in your cluster, and then access it using SQL.  For example, here is how a table is created (details on the syntax can be found in the Vora developer’s guide):

 

CREATE GRAPH MOVIES PARTITION BY PS1(NODEID)
    USING com.sap.spark.engines
    OPTIONS (
        files "/user/<duser>/movies.jsg",
        storagebackend "hdfs")

As with time series, the graph engine is not completely integrated into Spark SQL, and so must be accessed using the ‘raw’ SQL syntax.  In addition, in Vora 1.3, in order for the graph engine to be able to load graph data, it must be provided in JSG format (JSG is a common semi-structured JSON based file format for describing graph data).  Here is a sample query as it might be executed from a spark shell:

 

import org.apache.spark.sql._
val sqlContext = new SapSQLContext( sc )
sqlContext.sql("``SELECT DEGREE(A) FROM ACTOR A USING GRAPH MOVIES``
     | USING com.sap.spark.engines").show()

Document Store Engine

The document store engine stores JSON documents in collections in-memory and provides SQL-based analysis of this data.  The JSON documents in general are semi-structured – meaning that each record in the document can contain different sets of fields.  Once loaded, you can use standard SQL to query this data, and return relational or json formatted result sets.  An additional clause ‘IS MISSING’ allows you to search for documents where fields are missing from a document. In addition, you can use nested dot notation to refer to fields in a document, as well as bracket notation to access arrays within JSON documents.

For example, here is how you might create schema in the document store:

 

CREATE COLLECTION t
PARTITION BY PS2("state")
USING com.sap.spark.engines
OPTIONS (files 'mydata.json')

For more traditional SQL/RDBMS users, you can think of a collection loosely as a table with variable schema, and a JSON document as a row in that table, with each field in the document representing a column. As with the previous 2 engines, you use raw SQL to access the document store engine.  Here is what a query would look like if you were retrieving in a standard SQL result set format.  Notice the dot notation and array indicators in the where clause that help to specify which part of the document is being compared.

 

import org.apache.spark.sql._
val sqlContext = new SapSQLContext( sc )
sqlContext.sql("``SELECT ct as city, population as pop FROM <collection>
    |        WHERE employeecollection.address[1].street = ‘Maple’
    |            OR author.address IS MISSING``
    | USING com.sap.spark.engines").show()

This will generate a standard SQL result set, with a column for each field and a row for each document in the collection.  To return a JSON document instead of a SQL result set, you simply need to change the select list as follows:

sqlContext.sql(““SELECT {“ct”: “city”, “population”:”pop”} …

Note that the JSON document is returned as a single row, single column long varchar field.

Disk Engine

The disk engine provides an alternative way to analyse data when you do not have enough memory available. You can load the data into the disk engine, which processes and integrates the data into the Vora distributed query processing system.  This ensures you can analyse the data using Vora, join results with other engines if required and to ensure good query performance.  The disk engine is fully integrated into Spark SQL and can be accessed in the same way as the in-memory relational tables.

Cross Engine Querying

What makes this environment even more powerful is the fact that you can join query results from different engines to provide a more comprehensive analysis and generate new insights.

For example, the following is an example of how we might join a table from the time series engine with a table from the in-memory relational engine.

 

SELECT * FROM
        (``select "state_id", trend(measure) from time_series_table order by "_id" asc `` 
              USING com.sap.spark.engines) S1,
        (select state_id, state_name from region ) S2
WHERE S2.state_id = S1.state_id AND S2.state_name = 'MA'

Aside from joining result sets from different analysis engines, if you encounter a requirement for a built-in function that is not yet supported by a particular engine, you can use another engine to provide that functionality.  For example, the time series engine does not support the concat() function, so in order to use it, we combine our time series query with a standard Spark SQL query to get the desired result set:

 

SELECT concat( cast(BIN_START as char(8)), ' TO ', cast(BIN_END as char(8)) ) as BIN_ID, 
        FREQUENCY 
FROM (``SELECT * FROM HISTOGRAM ( SERIES ts_series, 5, DESCRIPTOR( ts_measure ) ) HIST``
            USING com.sap.spark.engines) AS T1

In conclusion, the new Vora 1.3 analysis engines provide a way to analyse many new types of data in a unified querying environment using SQL, opening the door to deeper analysis and new insights from your data lake, without requiring significantly more IT investment and developer training.

 

To report this post you need to login first.

2 Comments

You must be Logged on to comment or reply to a post.

  1. Aman Singh

    thanks Jason for the overview of the engines. Can you shed more light on the Disk Engine ? How is the data stored in a ‘disk table’? does it reside in HDFS ? where is the meta data stored ? and how if the performance of this engine vs the ‘in memory’ engine of VORA ? It would be great if you can show some lab results to compare the performance of the two.

    (0) 
  2. Jason Hinsperger Post author

    Hi Aman,

    The disk engine stores the data in a format optimized for use with Vora in a file on disk.  Right now, the data resides on the local node where the instance of the disk engine is running, guaranteeing data locality for performance, but at the cost of using up local disk. As far as disk engine performance compared to the in-memory performance, it always depends on the scenario, but generally the in-memory engine is significantly faster.  I am preparing some blog posts now that look at Vora performance.  I hope to start publishing them in the next few weeks.  Stay tuned!

    (0) 

Leave a Reply