From the Archives: Self-Managing Index Clustering Statistics
This post was originally written by Glenn Paulley and posted to sybase.com in September of 2008.
One of the improvements that my team has made to self-managing statistics maintenance in SQL Anywhere is the automatic maintenance of index clustering statistics. This means that the server’s query optimizer can now make better index selection choices, by analyzing the clustering characteristics of the index and adjusting the physical access cost accordingly. It is important to note, however, that this increased accuracy helps for all kinds of indexes, not simply clustered ones. I’ll illustrate the optimizer’s use of these clustering statistics with a real-world customer example. Here is the example query:
SELECT * FROM TJ_TransactionJournal LEFT OUTER JOIN TJ_TransactionJournalDetail WHERE trn_endtime > '2008-06-01' and tli_lit_fk = 1
Two notes about the query: First, the outer join condition between the two tables is implied (a SQL Anywhere feature) by a PK-FK relationship between the two tables. Second, the
OUTER JOIN is rewritten by the optimizer to an inner joins, due to the null-intolerant predicate on the
The TJ_TransactionJournal table contains 6.77 million rows, the TJ_TransactionJournalDetail table just over 122 million rows. At first glance, a strategy utilizing a clustered index on the
trn_endtime column appears quite useful, since the selectivity of the predicate is estimated at 2.36% (actual is 1.85%). Here is the graphical plan for the query, showing the Details pane with these numbers:
Note, however, that rather than a clustered index scan, the optimizer has chosen a sequential scan over the TJ_TransactionJournal table instead. Why? Here is the Advanced Details pane for the TJ_TransactionJournal quantifier:
The Advanced Details pane lists each physical access path assessed by the query optimizer for this quantifier. In this example, the first physical access path listed is a sequential scan. The second in the series represents the index on the
trn_endtime column, which is denoted as a clustered index. Here are the values in that section:
Selectivity 2.36310% Index name trn_endtime Clustered index yes Depth 3 Estimated leaf pages 11773 Sequential Transitions 28597 Random Transitions 6329390 Key Values 368170
“Key values” (line 11) are the number of unique keys in this index – there are 368K unique values in this trn_endtime index (we do not know from this whether or not this index is composite – that is, over multiple columns).
“Sequential transitions” (line 9) are the number of key-value pairs where adjacent keys in the index point to rows that lie on adjacent table pages. The server does not increment any statistic where adjacent index entries point to rows on the same page. Residence on the same page, or adjacent pages, of rows for adjacent index entries is a primary characteristic of clustering. Hence for a truly clustered index, the “Sequential transitions” value should be very small – this statistic would be incremented only for index entries at the end of each leaf page.
The value of “Random transitions” (line 10) denotes index entries where the rows pointed at by adjacent index entries are separated by more than one table page. In a truly clustered index, this value should be zero.
Precisely how “random” the values are distributed across the rows is given by a third statistic, “Random Distance” (not currently displayed in a graphical plan). This is a cumulative sum of the number of table pages between rows pointed at by adjacent index entries. Its maximum value is the number of rows in the table, multiplied by the number of table pages. All three statistics are maintained by the server and are available in the catalog in the SYSPHYSIDX system view: the column names are seq_transitions, rand_transitions, and rand_distance.
Now to the problem at hand. The index on
trn_endtime did not get used by the optimizer because the number of random transitions (6329390) is extremely large, denoting a very badly clustered index. Due to the lack of clustering, the optimizer estimated that scanning the TJ_TransactionJournal table sequentially would be cheaper than accessing the pages randomly, since the index’s clustering was so poor. There is no evidence from this single graphical plan as to why this index is not clustered, but the easiest way is to simply annotate the index as “clustered”, using the
ALTER INDEX statement, without following up with reorganizing the physical layout of the rows in the table. One way to accomplish the necessary reorganization is to issue a
REORGANIZE TABLE statement.
At present, the transition and distance values for indexes are documented in the documentation for the SYSPHYSIDX catalog table as “system use only”. I’ll be working with our documentation team to correct this oversight in our next maintenance release. In addition, we’ll correct the omission of the “Random Distance” metric from the graphical plan’s Advanced Details tab.