Index Strategy for SAP HANA Cloud, Data Lake (and SAP IQ)
The SAP HANA Cloud, data lake was released in early 2020. However, the technology has been in use for quite some time. If you are familiar with SAP IQ, then you already know quite a bit about the SAP HANA Cloud, data lake as it is based on SAP IQ.
While most of the concepts of SAP IQ transfer to the SAP HANA Cloud, data lake, not all do. This blog will focus on the indexing strategies that are available in the SAP HANA Cloud, data lake (SAP IQ) and when to use them.
As SAP IQ is the underlying technology for SAP HANA Cloud, data lake, SAP IQ will be used when describing the technical features.
When creating a table, a default index is placed on every column. This index is referred to as a Fast Projection (FP) index and has built in compression. NBit is the compression scheme employed. It uses <n> bits to index the dictionary where the data is stored and is enabled on every column except LOB and BIT column types.
Besides compression, the FP index also employs a feature called zone maps that can improve performance when executing queries.
A zone map, also known as a storage index, data skipping, or constraint exclusion, optimizes scan processing by ignoring complete pages of values if the server can guarantee that there is no value of interest to the predicate(s) being searched on that page. A zone map is a data structure containing the min/max values of each page of column data, and is stored as part of the FP index.
A zone map is most effective if no other index (besides an FP index) exists on a column or the values in a column are arranged in a sequential order (for example, by transaction date). A zone map is less effective for tables with few rows, or for columns that have a uniform distribution of values across all pages. Zone maps require the values in the predicate(s) be fixed; not derived from a function, and supports four classes of predicates: equality (=), inequality (<, >, !=), between, and not between.
In addition to the basic FP indexes, SAP IQ supports these indexes:
- High Group (HG) – used for join columns and searchable columns whose data has more than 2,000 distinct values
- Word (WD) – Use for whole word searching of character based data
- Text (TEXT)– Use for full text searching of character based data
- Date (DATE) – Use for fields that have a DATE data type
- Time (TIME) – Use for fields that have a TIME data type
- Datetime (DTTM) – Use for fields that have a DATETIME/TIMESTAMP data type
- Compare (CMP) – Indexes the relationship (<, >, =) between any two distinct columns with identical data types, precision, and scale
The following indexes are included only for the reference of customers that have migrated from SAP IQ. Both of these indexes have been deprecated and should be replaced by the HG index as it offers better performance.
- Low Fast (LF) – used for join columns and searchable columns whose data has fewer than 2,000 distinct values
- High Non Group (HNG) – used for range searches (except for date, time, and datetime data types) and for single column aggregations
When Are Indexes Used?
In most relational database engines, the optimizer chooses the best single index for a given table in the query. SAP IQ takes the optimization process to the next level by looking not only at each column in the query, but at all joins and search arguments.
For example, take this query:
select a, b, c from table_a, table_b where table_a.date_field = table_b.date_field and table_a.date_field between ‘2019-01-01’ and ‘2019-02-01’ and month( table_a.date_field ) = 1
The optimizer could choose up to three different indexes for the date_field column on table_a. First, an index for the join to table_b would be needed. Second, an index for the range search in the BETWEEN statement would be needed. Lastly, the date function MONTH() can also use an index for the comparison to 1 (January).
Given the above index types and their use, the optimizer would prefer to use the HG index on the date field for the join. It would also prefer to use the DATE index to search on the month datepart as well as the BETWEEN range search.
In short, the optimizer will look at each join, where clause, aggregation, etc. to determine the best index (including zone maps) for each operation. Not just one index per table and not just one index per column. But rather, one index for each use of each column in the query!
In traditional relational database engines, this means that we will design an indexing strategy that best traverses the table in most or all cases. This is enhanced by having indexes that cover multiple columns, or multi-column indexes. As mentioned, SAP IQ will choose the best index for each operation in the WHERE clause and does not carry the overhead of many indexes that most database engines do. Consequently, we don’t see nearly the need for multi-column indexes in SAP IQ. Just because we would use a multi-column index in another technology does not mean that we should apply that same logic to SAP IQ as it is a different engine that breaks barriers by not conforming to the norm.
While SAP IQ does support multi-column indexes, the general recommendation is that they not be used or be used sparingly. More often than not, SAP IQ performance when using single column indexes will outperform having a multi-column index.
Additionally, we need to consider the data manipulation ramifications. Having a multi-column index means that SAP IQ will consume more storage and will take more time and resources when loading, updating, or deleting data.
The approach that has worked in SAP IQ for quite some time, is to index the individual columns, using single column indexes. After monitoring and tuning, only then should we start to look at a multi-column index to solve a performance issue. Very rarely do we ever get to the second step of testing and using multi-column indexes. This is due to the speed and performance that SAP IQ brings to market.
Should We Be Concerned With Over Indexing?
I used to say that we should not be worried about over indexing. This has changed with the release of SAP IQ 16.0 and SAP IQ 16.1 (as well as the HANA Cloud, data lake).
As it relates to query activity only, having more indexes never hurts. We don’t live in a query only world, though. We have to load data and maintain the system. It is in the non-query area where overindexing can be felt.
We do need to be concerned with over indexing outside of queries. It increases the size of the table (and database) that increases maintenance activities like backups and consistency checks. It increases the load times as we have to maintain another structure. The good news is that IQ is quite efficient so that the impact of this is minor compared to other similar database engines.
My Index Opinion
We will get into this more in the following section, but my philosophy is that we do need indexes in certain circumstances (referential integrity, joins, date fields, possibly group by or order by columns). These are what I call the primary use of indexes and they likel cover 75-80% of the indexes ever needed for good performing queries. Generally, this should add no more than 3-5 indexes to a table. Then we get into what I call the secondary indexes. Secondary because their use is not easily quantifiable. A column that is used infrequently, as opposed to a join column that is used all the time, is likely something that may not benefit by having an index.
The IQ index advisor makes many recommendations on index creation. What we cannot quantify, though, is the impact that an index will have. In fact, our documentation rightly states that adding an index may have no noticeable affect on a query. This is because the index advisor looks at patterns and its costing algorithms. If an index will save any amount, however negligible, we recommend that it be added. That negligible amount may save 1 ms of runtime. You must now balance the impact to the non-query area (loads, backups, etc) and decide whether following the advice is good for your business. My strong recommendation is that you do not blindly follow the index advisor. Be frugal with your IQ resources and make an informed decision before placing those indexes in production.
On the flip side, we also have the workload monitor that will tell you which indexes (among other things) are used or have not been used. This feature will tell you which indexes are used or not used. It does not tell you how effective the index was when it was used.
This is all a long way of saying that the features we have that tell you when to add indexes or when the indexes are used do nothing to tell you how effective that was. This is where you need to test the change. You then need to decide whether the query improvement is worth the overhead that ripples through the system in the form of longer load times and a larger database object.
I will say, though, that the optimizer and fundamental changes to SAP IQ 16.1 do a really good job of traversing your data without needing everything to be indexed. In fact, having 3 to 5 to 10 indexes per table will likely suffice. Any indexes on the table should be ones that are used in 80% or more of all queries accessing that table. If it is infrequently used, then it is likely not beneficial in the long term.
When to Use Indexes?
Now that we have some basics about indexes, the big question is when to use them. Here are some simple guidelines that will help cover every situation. The general idea is to ask ourselves questions about how the data will be used to search (or join). The answers to the questions will dictate whether or not indexes are needed and which one(s) are best.
As the technology has matured, the way that indexes are used and how they are thought of has changed. Historically, we would simply index any field in the database that would be used for searching or aggregation. With modern systems growing into the hundreds of terabytes and petabyte scale, this leads to significantly larger systems. A 10% growth in database size due to additional indexes is significant when dealing with very large systems. Additionally, the optimizer and default storage structures became significantly smarter and allowed for data to be unindexed and still perform as if it were.
As a result, the idea of when to use indexes became significantly easier and more refined. I prefer to focus on only those columns that I feel will be used more than 50% to 75% of the time, preferably only those columns used to search or join in more than 75% of the queries. I do this to reduce the overhead for infrequently used columns and indexes. Additionally, with the NBit and zone map features, not having an additional index on a column is not as detrimental as the default storage does offer level of indexing for all columns.
Secondly, if the column has a data type of DATE, TIME, TIMESTAMP, or DATETIME, we need to add a DATE, TIME, or DTTM index in addition to an HG index. In my opinion, all date/time/datetime fields should have an HG index and one of the date indexes as systems quite frequently use date fields for searching, filtering, and aggregations.
If we simply stopped here, well over 90% of all use cases would be covered.
The remaining indexes have specific use cases: WD, TEXT, and CMP indexes. To cover these indexes and their use, consider these questions:
- Will whole word searching (word containment) be done on the character data?
- NO: Ignore the column
- YES: Use WD (WORD) index. No HG index is needed if word searching is the only use of this column.
- Will text searching be done on the character data?
- NO: Ignore the column
- YES: Use a TEXT index. No HG index is needed if text searching is the only use of this column.
- Will this column be compared to another column in the same table?
- NO: Ignore the column
- YES: Use a CMP (compare) index that includes both columns
To provide a bit more on the rationale of the indexing scheme described, we have to consider that there are two ways of classifying the indexes in SAP IQ: those that have detailed statistics and those that have minimal statistics. The HG, TEXT, and WD indexes contain detailed statistics. The CMP, DATE, TIME, and DTTM indexes contain minimal statistics.
The optimizer is greatly helped by having detailed statistics to make any decisions. Consequently, the HG index should be on any column that is used a majority of the time (50-75%). If we consider the previous example of a column that is a DATE data type, using a DATE index is appropriate. However, the DATE index does not have the detailed statistics for the optimizer, so adding an HG index is strongly encouraged. In short, any time a column is used in the WHERE clause, we want that column to have an HG index on it to provide detailed statistics for the optimizer.
Of course, the caveat to this would be the WD and TEXT indexes as they contain the statistics necessary for the optimizer. Additionally, these fields are generally wide text fields with a significant percentage of unique values. Having an HG index on those fields would be expensive in terms of space and time to load the data. If using a WD or TEXT index, do not use an HG index unless the field unless it is used in other types of searching.
The process to choose indexes may seem daunting especially if the data model has a significant number of tables and columns. The process outlined above assumes that very little is known about the application and its use of the data. This is, generally, not the case as most DBAs and data modelers do know more about the data and its use.
Should you know more about the model and data, the process can be simplified. I prefer to look at columns that are known to be used in WHERE clauses of most queries and reports. Indexes on join and search criteria are the single biggest factor in performance, so focusing solely on those will allow IQ to be indexed for nearly every situation.
Certainly some columns and indexes will be missed with this approach. However, since the focus was on columns that are heavily used, the impact of missing a few minor, or less frequently used, columns would be relatively low.
How to Create Indexes
With SAP HANA Cloud, data lake, you can gain access to the underlying data lake instance by using the REMOTE_EXECUTE procedure. Using the above SQL query as an example, indexes would be created like this:
call "SYSRDL#CG".REMOTE_EXECUTE(' create HG index table_a_date_field_HG on table_a ( date_field ); create HG index table_b_date_field_HG on table_b ( date_field ); create DTTM index table_a_date_field_DTTM on table_a ( date_field ); create DTTM index table_b_date_field_DTTM on table_b ( date_field ); ');
The underlying SAP IQ technology used the SAP HANA Cloud, data lake is quite forgiving when it comes to indexes. We don’t need to be completely accurate to begin with. We can simply focus in on those columns we know will be used a majority of the time. This approach allows for minimal effort to achieve a near maximum outcome of a finely tuned data lake.
Continue on to this blog to read about how to optimize the index build process for the data lake.