Many of you are aware that SAP set a Guinness record in 2014 for the World’s Largest Audited Data Warehouse at 12PB, with SAP IQ and SAP HANA.  During the course of that effort, we determined that an enhancement to the IQ query optimizer would result in radically improved performance for certain kinds of queries on hash partitioned or hash-range partitioned tables.  The particular queries I am referring to are applicable to Big Data scenarios, such as analytics on large amounts of sensor data.  Here, you might be collecting sensor data at many geographic locations, and want to analyze the data from one particular location only.  The data modeling for this use case lends itself to data partitioning – where each partition is mapped to a particular site location and the data in that partition represents the sensor data associated with that location.  When doing analytics on sensor data localized to a particular site, database queries that quickly eliminate all partitions but the one of interest will perform dramatically better than a standard query using indexes alone.

The query engine optimization we implemented in IQ 16 SP08 PL3 leverages the information available in a hash-partitioned or a hash-range-partitioned table to improve the performance of some equality conditions.   The optimization is applied when the SQL statement contains a condition of the form:  <column> = <constant-expression>, and the column is the sole basis for the hash partitioning of the table.  When these conditions are true, then the query optimizer will infer an additional condition of the form:

      
HashPartitionNumber( <column> ) = <integer-constant>

where the integer constant is the hash partition number for the constant expression in the original condition.  This condition will appear in the query plan like any other inferred condition.  By executing this inferred condition before the original equality condition, the original equality condition needs to scan only the cell values for rows contained within that one hash partition.

We have gathered some impressive performance statistics on large volumes of hash partitioned data with this query optimizer enhancement.  Here is a definition of a hash partitioned table:

create table sensor_data (

     measurement_datetime   datetime        NULL,

     rig_id                 VARCHAR(20)     NULL,

     sensor_id              VARCHAR(20)     NULL,

     measurement_value      integer

)

PARTITION BY HASH (rig_id);

No additional secondary indices were added to this table.  The table was loaded with 1.8 trillion records, and the following test queries were executed, before and after applying the query optimization enhancement:

  • select count(*) from sensor_data where rig_id=’rig_29′
  • select sensor_id, measurement_value, measurement_datetime from sensor_data where
    rig_id=’rig_101′
  • select sum(measurement_value) from sensor_data where rig_id=’rig_102′

All of these queries executed between 50 and 200x faster after the query optimizer improvement was applied. 

SAP has been investing significant R&D in IQ 16 scale out improvements.  The query optimizer improvement described here was released as part of IQ 16 SP08 PL3.   It is a testament to the solid design and architecture of IQ that the engineering team was able to implement this enhancement in the span of about 1 month.  Note that we have other scale out improvements planned for IQ 16 – most noteworthy is an IQ shared nothing MPP configuration.

We are excited about continuing innovations in SAP IQ, targeted at very large scale data analytics. You can read more about SAP IQ on our SCN site: http://scn.sap.com/community/iq

To report this post you need to login first.

4 Comments

You must be Logged on to comment or reply to a post.

  1. Rey Wang

    Can I do update to the hash key here?

    Such as :

    update sensor_data set rig_id = ‘a new value’;

    Could this be moving partition data to a new partition?

    (0) 
    1. Mark Mumy

      Unfortunately, no. Most database engines dont allow changing the partition key columns data in hash or range partitioned tables.

      Mark

      (0) 
      1. Rey Wang

        1. Partition by hash ( year_month, account_id)

        or

        2. Partition by hash ( account_id, year_month)


        Assume number of distinct for year_month (500) & account_id (10,000), which partition key is better here?

        (0) 
        1. Courtney Claussen Post author

          From our engineering team:

              The current partition elimination optimization is limited to hash partitioned and hash-range partitioned tables which have a single column as the hash partitioning basis, and it applies only to queries containing a local condition on that partition basis column.

              For tables with multiple columns as the hash partition basis, it does not matter which order those columns are listed within the CREATE TABLE partitioning clause.  The partitioning mechanics ensure that the partitions are the same regardless of that ordering, and all the hash partitioned query algorithms can handle any ordering.

          (0) 

Leave a Reply