From the Archives: Analyzing Clustered Indexes

JasonHinsperger · ‎02-11-2015

In this post, originally written by Glenn Paulley and posted to sybase.com in September of 2010, Glenn talks about how the optimizer takes advantage of clustered indices and describes some of the statistics associated with clustered indices. Understanding this information can help you to determine whether or not a declared clustered index is being used effectively and to decide whether a rebuild of the index is a good idea to improve performance.

SQL Anywhere has offered support for clustered indexes since the 8.0.2 release. There is no physical difference between clustered and non-clustered indexes in SQL Anywhere; any index can be marked as clustered, including those indexes implicitly created for the maintenance of referential integrity constraints. Moreover, one can apply or remove the "clustering" attribute of an index using the ALTER INDEX statement. Since clustering is a hint, sorting is still required if a query contains an ORDER BY clause.

If an index is marked as clustered - and there can be at most one clustered index per table - then the server will attempt to ensure that the physical ordering of the rows in the table's pages matches, as closely as possible, the ordering of the corresponding index entries in that clustered index when the rows are first inserted. The advantage, of course, of a clustered index is that the server can take advantage of its clustering property during range scans, as clustered indexed retrieval during query processing should require reading the minimal number of table pages.

Clustering statistics

Beginning with the SQL Anywhere 10 release, the server maintains clustering statistics about each index in the database regardless of whether or not the index is declared CLUSTERED. These statistics can be found in the SYS.SYSPHYSIDX system view, and can be queried using a query such as the one below:



SELECT tbl.creator, tbl.table_name, ix.table_id, ix.index_name,

      COALESCE((IF tbl.clustered_index_id = ix.index_id THEN 'Y'

                ELSE 'N' ENDIF), 'N') AS clustered,

      pix.depth, pix.seq_transitions,

      pix.rand_transitions, pix.rand_distance,

      pix.key_value_count, pix.leaf_page_count

FROM SYS.SYSIDX ix JOIN SYS.SYSPHYSIDX pix ON (ix.table_id = pix.table_id

            AND ix.phys_index_id = pix.phys_index_id )

    JOIN SYS.SYSTAB tbl on (tbl.table_id = ix.table_id)

WHERE tbl.creator = 1 AND pix.depth > 1

ORDER BY tbl.table_name;

The statistics in the SYS.SYSPHYSIDX system view are correct as of the last database checkpoint.

The values depth, key_value_count and leaf_page_count are straightforward; these represent the number of levels in the index, the number of distinct key values, and the number of leaf pages in the index respectively. The other values queried above, namely seq_transitions, rand_transitions, and rand_distance, are defined as follows.

Consider any two adjacent (in lexicographic order) index entries in an index. If the two base table rows corresponding to these index entries are on the same table page, then the index entries are said to involve zero transitions. If the two base table rows are on adjacent table pages, then these two rows constitute a sequential transition. If they are further apart than adjacent pages, the two rows constitute a random transition. The number of table pages between the rows is termed the random distance.

With these statistics, the query optimizer can assess the clustering characteristics of any index, declared "clustered" or not, and adjust its cost estimates accordingly. Here's an example:

The indexes on lines 4, 5, and 6 in the display above are over the same base table, which has approximately five million rows. The index on line 4, which contains 4,956,540 distinct key values, is quite clustered even though it isn't declared as such: there are 41,747 sequential transitions and only 5,134 random ones (the missing number is the number of zero transitions). With these statistics, the query optimizer will assume the clustering property holds at execution time, resulting in the minimal number of table page retrievals when accessing the data through that index.

In contrast, consider the index on line 6, which is declared CLUSTERED. This index is problematic; it has only 6,123 sequential transitions and 4,806,149 random ones. Hence this index isn't really clustered at all; if we take the random distance amount and average it over all of the key values, each pair of adjacent index entries point to base table rows 2 pages apart. This extra retrieval cost is taken into account by the optimizer when determining the cost of an access plan that uses this index. As such, the DBA may determine that this index is a candidate for reorganization, either via the REORGANIZE INDEX statement or via the ALTER INDEX REBUILD statement.

The index on line 5 has approximately equal numbers of sequential and random transitions, indicating an equal mix of clustered and non-clustered rows. Reorganization may be beneficial in this case as well, though not to the same degree as the index displayed on line 6.

Index density and skew

The index statistics described above are readily available from the SYS.SYSPHYSIDX system view, as they are computed on-the-fly during index maintenance operations. There are three other statistics that can be retrieved from a SQL Anywhere server, but these involve scanning index and/or table pages in real time to determine their values.

The first is the output of "dbinfo -u", the dbinfo utility. If you specify "-u" on the dbinfo command line, dbinfo will compute page usage statistics for the entire database, which can give the DBA an idea of the amount of internal page fragmentation that the database contains.

The second and third statistics are index density and index skew, which are returned by the sa_index_density() system procedure. Calling this procedure results in a complete index scan and the procedure returns a result set containing the index density and skew values for the specified index.

Index density is the fraction (between 0 and 1) of the amount of space utilized in each index page. For B+tree indexes a density of 0.60 to 0.80 is typical, depending on a variety of factors including database page size, the size (and variance of the size) of the key, and the amount of key compression that can be attained.

Index skew is a measure of the "balance" of an index. Like many other B+tree index implementations, in SQL Anywhere indexes are not re-balanced upon delete operations. Consequently, in a table with significant "churn" it is possible for portions of the index to become relatively sparse, whereas index pages for other ranges of values may be nearly 100% full. The skew measure that is returned by sa_index_density() is the standard deviation of the natural logarithm of the number of index entries per page. By definition, skew cannot be less than 1.0.

Here's an example:

In the above example, we called sa_index_density() with no parameters so that density and skew statistics were returned for every index in the database.

Line 259 in the display above corresponds to the CLUSTERED index described above (line 6). The density is 0.54, meaning that approximately 50% of each index page is being utilized, which is below average. On the positive side, the skew is approximately 1.27, which is not too bad - recall that the natural logarithm of 100 is 4.605 and of 500 is 6.21, so a standard deviation of 1.27 isn't entirely unreasonable.

However, line 260 in this display represents the same unclustered index as on line 5 above, which has merely 24 unique values in five million rows. Here, the density is also low - 0.54 - but more importantly the skew is 2.105, indicative of an index that is more unbalanced and perhaps requiring reorganization.