Indexes and Index Usage for SAP IQ and SAP HANA Cloud, data lake
Are you new to SAP IQ? Have you used the product but are still not clear on what indexes to use or when to use them? Have you just migrated to SAP IQ 16 and are wondering if there are any indexing changes? This blog will discuss indexes and strategies for SAP IQ as well as for SAP HANA Cloud, data lake (based on SAP IQ technology).
When creating a table, a default index is placed on every column. This index is referred to as a Fast Projection (FP) index and has built in compression. NBit is the compression scheme employed. It uses <n> bits to index the dictionary where the data is stored and is enabled on every column except LOB and BIT column types.
Besides compression, the FP index also employs a feature called zone maps that can improve performance when executing queries.
Also known as a storage index, data skipping, or constraint exclusion, it optimizes scan processing by ignoring complete pages of values if the server can guarantee that there is no value of interest to the predicate(s) being searched on that page. A zone map is a data structure containing the min/max values of each page of column data, and is stored as part of the FP index.
A zone map is most effective if no other index (besides an FP index) exists on a column or the values in a column are arranged in a sequential order (for example, by transaction date). A zone map is less effective for tables with few rows, or for columns that have a uniform distribution of values across all pages. Zone maps require the values in the predicate(s) be fixed; not derived from a function, and supports four classes of predicates: equality (=), inequality (<, >, !=), between, and not between.
In addition to the basic FP indexes, SAP IQ supports these indexes:
- High Group (HG) – used for join columns and searchable columns whose data has more than 2,000 distinct values
- Word (WD) – Use for whole word searching of character based data
- Text (TEXT)– Use for full text searching of character based data
- Date (DATE) – Use for fields that have a DATE data type
- Time (TIME) – Use for fields that have a TIME data type
- Datetime (DTTM) – Use for fields that have a DATETIME/TIMESTAMP data type
- Compare (CMP) – Indexes the relationship (<, >, =) between any two distinct columns with identical data types, precision, and scale
The following indexes are included only for the reference of customers that have migrated from SAP IQ. Both of these indexes should be replaced by the HG index as it offers better performance.
- Low Fast (LF) – used for join columns and searchable columns whose data has fewer than 2,000 distinct values
- High Non Group (HNG) – used for range searches (except for date, time, and datetime data types) and for single column aggregations
When Are Indexes Used?
In most relational database engines, the optimizer chooses the best single index for a given table in the query. SAP IQ takes the optimization process to the next level by looking not only at each column in the query, but at all joins and search arguments.
For example, take this query:
Select a, b, c from table_a, table_b where table_a.date_field = table_b.date_field and table_a.date_field between ‘2020-01-01’ and ‘2020-02-01’ and month( table_a.date_field ) = 1
The optimizer will choose three different access paths to the date_field column in table_a. The optimizer could choose up to three different indexes for the date_field column on table_a. First, an index for the join to table_b would be needed. Second, an index for the range search in the BETWEEN statement would be needed. Lastly, the date function MONTH() can also use an index for the comparison to 1 (January).
Given the above index types and their use, the optimizer would prefer to use the HG index on the date field for the join. It would also prefer to use the DATE index to search on the month datepart as well as the BETWEEN range search.
In short, the optimizer will look at each join, where clause, aggregation, etc. to determine the best index (including zone maps) for each operation. Not just one index per table and not just one index per column. But rather, one index for each use of each column in the query!
Should We Be Concerned With Over Indexing?
The short answer is no! As mentioned previously, typical relational database engines pick the best single index to access data in a table. Second, when adding indexes to most database engines the load times tend to increase due to how data is processed for the additional indexes. Third, most database engines store data and indexes in different structures so that as more indexes are added the table size swells. All of these lead to trying to minimize the total number of indexes on a table in most engines.
SAP IQ is not prone to these limitations. We pick as many indexes as needed to process the known set of queries or expected use of the data. Indexes in SAP IQ are not as large as the data being indexed. Each of the indexes employs various levels of vectors, bitmaps, data representations, and compression to dramatically decrease the size. Lastly, when loading data, SAP IQ can consume all CPU resources on the machine that is processing the load.
These features lead to an engine that can handle as many indexes as necessary for all query (canned or ad hoc) activity. It is quite common to have 30-60% of all columns indexed with the advanced indexes (in some cases, multiple indexes per column).
With all this said, though, SAP IQ has a tremendous number of features that allow us to reduce the number of indexes over previous versions. One thing that I impress upon SAP IQ users is that you can actually reduce the number of indexes and not lose a significant amount of query performance. In fact, you may gain performance in other areas such as data loading. Additionally, fewer indexes means a smaller database, which then means much shorter maintenance tasks like consistency checks and backups.
When to Use Indexes?
Now that we have some basics about indexes, the big question is when to use them. Here are some simple guidelines that will help cover every situation. The general idea is to ask a series of questions about each column. The answers to the questions will dictate whether or not indexes are needed and which one(s) are best.
As the technology has matured, the way that indexes are used and how they are thought of has changed. Historically, we would simply index any field in the database that would be used for searching or aggregation. With modern systems growing into the hundreds of terabytes and petabyte scale, this leads to significantly larger systems. A 10% growth in database size due to additional indexes is significant when dealing with very large systems. Additionally, the optimizer and default storage structures became significantly smarter and allowed for data to be unindexed and still perform as if it were.
As a result, the idea of when to use indexes became significantly easier and more refined. First, if we think that the column be used in a significant percentage of the queries (more than 75% of the time), this it should be indexed for optimal performance. If the query is infrequently used, an index would only add overhead (data loading and storage) and remain relatively unused. Additionally, with the NBit and zone map features, not having an additional index on a column is not as detrimental as the default storage does some level of indexing for all columns.
Secondly, if the column has a data type of DATE, TIME, TIMESTAMP, or DATETIME, we need to add a DATE, TIME, or DTTM index in addition to the HG index. All date/time/datetime fields should have an HG index and one of the date indexes as systems quite frequently use date fields for searching, filtering, and aggregations.
If we simply stopped here, well over 90%, or more, of all use cases would be covered.
The remaining indexes have specific use cases: WD, TEXT, and CMP indexes. To cover these indexes and their use, consider these questions:
- Will whole word searching (word containment) be done on the character data?
- NO: Ignore the column
- YES: Use WD (WORD) index. No HG index is needed if word searching is the only use of this column.
- Will text searching be done on the character data?
- NO: Ignore the column
- YES: Use a TEXT index. No HG index is needed if text searching is the only use of this column.
- Will this column be compared to another column in the same table?
- NO: Ignore the column
- YES: Use a CMP (compare) index that includes both columns
To provide a bit more on the rationale of the indexing scheme described, we have to consider that there are two ways of classifying the indexes in SAP IQ: those that have detailed statistics and those that have minimal statistics. The HG, TEXT, and WD indexes contain detailed statistics. The CMP, DATE, TIME, and DTTM indexes contain minimal statistics.
The optimizer is greatly helped by having detailed statistics to make any decisions. Consequently, the HG index should be on any column that is used a majority of the time (more than 75% of the time).
The latest features of SAP IQ in the default index structure are much more tailored to efficiently handle situations where HG indexes do not exist. In fact, in many use cases we have seen that the HG index and the default index perform close to the same. They each attack the problem in very different ways with the net result being much better performance.
If we consider the previous example of a column that is a DATE data type, using a DATE index is appropriate. The DATE index does not have the detailed statistics for the optimizer, so adding an HG index is strongly encouraged. In short, any time a column is used in the WHERE clause, we want that column to have an HG index on it to provide detailed statistics for the optimizer.
Of course, the caveat to this would be the WD and TEXT indexes as they contain the statistics necessary for the optimizer. Additionally, these fields are generally wide text fields with a significant percentage of unique values. Having an HG index on those fields would be expensive in terms of space and time to load the data. If using a WD or TEXT index, do not use an HG index unless the field unless it is used in other types of searching.
The process to choose indexes may seem daunting especially if the data model has a significant number of tables and columns. The process outlined above assumes that very little is known about the application and its use of the data. This is, generally, not the case as most DBAs and data modelers do know more about the data and its use.
Should you know more about the model and data, the process can be simplified. I prefer to look at columns that are known to be used in WHERE clauses of most queries and reports. Indexes on join and search criteria are the single biggest factor in performance, so focusing solely on those will allow IQ to be indexed for nearly every situation.
Certainly some columns and indexes will be missed with this approach. However, since the focus was on columns that are heavily used, the impact of missing a few minor, or less frequently used, columns would be relatively low.
Lastly, know that IQ provides a safety net for us! IQ has an index advisor that can be enabled by setting the option Index_Advisor to ON.
set option public.Index_Advisor=’on’;
Once on, IQ will print index advice to query plans (if they are on) as well as to a system table that can be accessed via the stored procedure sp_iqindexadvice. The index advisor output will note under what circumstance the index is needed. The stored procedure will also aggregate the number of times the advice was made. Should that procedure be used to trend data over a period of time (days, weeks, even months) the number of times the same advice was made can help focus efforts on the most recommended advice.