How the ASE optimizer estimates row counts

simon_ogden · ‎08-08-2014

The premise behind this document is to attempt to explain the calculations in ASE 15.x/16.x behind the estimate row count used in the optimizer costing evaluation (the same row count as observed in 'set statistics plancost on' output). This is an area that has never really had a reliable explanation.

I've been trying to work out a definitive explanation for years and most of the time it worked like a charm, but there were certain use cases where it fell over. I only recently managed to plug those gaps so I'd thought I'd share my findings.

With this knowledge and (some) understanding of these calculations it can really help understand:

· Which statistics are used.

· Which statistics might be critical.

· How data skew could drastically effect plan selection.

· What can be done to massage statistics to help your queries.

· Why it is necessary (and unavoidable) to force specific queries.

· How fundamentally different the calculations are to previous versions of ASE.

This document is only covering a simple single table query with multiple search arguments. I will expand up to joins and queries with additional clauses/constructs in time.

It's also not for the faint-hearted!

We start with a base table GENERIC_TABLE, this is a renamed real business data reference table.

The extracts in the file are taken from outputs generated with optdiag &

set option show long

set option show_missing_stats on

set statistics time,io, plancost on

set showplan on

set show_sqltext on

set switch on 3604

set statement_cache off

I haven't included the full outputs as there's enough here to keep you entertained for a while! Apologies for formatting if it's not quite right (and length, there's no really simple way of summarising this!). If the formulae are a little small, sorry, that was as big as it would let me insert them.

Query 1

select * from GENERIC_TABLE

where sarg1=10000010

and sarg2='USD'

and sarg3='KS'

We want see how the various available statistics effect estimates and highlight inconsistencies. There are also diagnostic limitations when a clustered index is in place so I have removed this from the schema (I make mention of this at the relevant point).

Table initially has 4 non-clustered indices (note that the clustered index has been dropped for demonstration purposes but data is still in CI order, clustered on keyd)

GENERIC_TABLE_001 (keya, sarg3)

GENERIC_TABLE_002 ( keyb, sarg1, sarg2, keyc)

GENERIC_TABLE_003 (keyd, keyb, sarg1, sarg2)

GENERIC_TABLE_004 (keye)

The remainder of this table's schema is not relevant for the purpose of this document. The query plan of this query is also not relevant, the idea is to highlight the calculations along the way.

Tests

Test 1

Regular update statistics creating histograms on lead columns only and column group stats.

Test 2

Update index statistics creating histograms on all indexed columns and column group stats.

---

Now we want to start manipulating what stats actually get used, intention here is to show that the optimizer can and will use statistics that will never be reported as 'missing'

The index prefix subset {sarg1, sarg2} is available via GENERIC_TABLE_002 or GENERIC_TABLE_003.

Test_3

Index statistics but then deletion of (keyd, keyb, sarg1, sarg2) group stats

Test 4

Index statistics but then deletion of (keyb, sarg1, sarg2) group stats

Test 5

Index statistics but then creation of group densities for (sarg1, sarg2, sarg3)

Test 6 (see test 4 for details)

Index statistics but then deletion of (keyb, sarg1, sarg2) and (keyb, sarg1, sarg2, keyc) group stats.

Test 1 Query 1

Regular update statistics creating histograms on lead columns only and column group statistics.

Observations and explanations

Plancost

/

IndexScan

GENERIC_TABLE_003

(VA = 0)

r:0 er:665

l:12216 el:18032

p:0 ep:2283

During initial statistics creation we have missing column stats as expected, none of our search arguments are the lead key in an index.

NO STATS on column GENERIC_DB..GENERIC_TABLE.sarg2

NO STATS on column GENERIC_DB..GENERIC_TABLE.sarg1

NO STATS on column GENERIC_DB..GENERIC_TABLE.sarg3

Almost immediately you will see the LogProps section, this is only displayed under 'set option show long' or 'set option show_logprops on'.

NO STATS on density set for GENERIC_DB..GENERIC_TABLE={sarg2, sarg1, sarg3}

TreeLogProps({0}) - Proj: {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68} JoinsToDo: {} UsfOrd: {}

UsfPartLst ->

UsfPartNodeLst ->

Statistics:

Estimated rows =664.605, Average row size =377.0283, Page Count =125.1061

Largest Partition Page Count =125.1061, Partition Count =1, Concurrency Threshold =-1

RID-oriented Logical Properties

TreeLogProps({0}) - Proj: {0} JoinsToDo: {0} UsfOrd: {0}

UsfPartLst ->

UsfPartNodeLst ->

Notice this is where it looks for the column group statistics that cover the 3 columns in the predicate. It is in this LogProps section where all the row count and selectivity calculations take place. We're going to concentrate on row counts here, lio and pio is costed separately using data page cluster ratios, index page cluster ratios , group densities etc
What this section is saying (and the equivalent section will appear in the initial i/o costing as well as the search engine) is that given my predicate search values, how many rows do I expect back from this scan. This will take into account the number of rows from the outer table in a join, either via a histogram merge or by a simple product of rows from outer table multiplied by the filter selectivity of the inner table.

In this instance the row count is straightforward as we have no statistics available for the 3 filter predicate columns.

The table (Allpages) has 664605 rows, 137615 pages,

Data Page Cluster Ratio 1.0000000

sarg1 = 10000010

Estimated selectivity for sarg1,

selectivity = 0.1,

sarg2 = 'USD'

Estimated selectivity for sarg2,

selectivity = 0.1,

sarg3 = 'KS'

Estimated selectivity for sarg3,

selectivity = 0.1,

Search argument selectivity is 0.001.

using table prefetch (size 16K I/O)

Large IO selected: The number of leaf pages qualified is > MIN_PREFETCH pages

in data cache 'generic_data_cache' (cacheid 2) with LRU replacement

Note that this section ONLY appears when there is no clustered index, this can lead to circumstances whereby the final selectivity of a table that was used to estimate a row count appear nowhere in the diagnostics (but can be calculated from what is there).
You will notice an LIO costing for all indices of the table (that have a key referenced in the query), despite none of them having a lead column that is in the predicate. This is due to versions from 15.x by default costing all potential indices, even if it means a full scan. The reason for this is that if there is low selectivity on any of the minor attributes, this could well reduce the amount of actual data pages that are scanned. I.e. a full index scan with good filtering on minor attributes may well take less LIO that a tablescan. Full index scans being costed is a fundamental behaviour change over 12.5.4. In addition for DOL tables, there is an additional feature called intelligent index scan adjustment which will adjust the costings to reflect an alternate method of using the index, essentially multiple filtering lookups using the distinct values for the lead index key.
For all non-matching scans you will always see a scan selectivity of 1 as it knows it has to scan every leaf page, the filter selectivity will be <=1 if there are search arguments on that index's minor attributes.

Estimating selectivity of index 'GENERIC_DB..GENERIC_TABLE.GENERIC_TABLE_002', indid 3

sarg1 = 10000010

sarg2 = 'USD'

Estimated selectivity for sarg1,

selectivity = 0.1,

Estimated selectivity for sarg2,

selectivity = 0.1,NO STATS on density set for GENERIC_DB..GENERIC_TABLE={sarg2, sarg1}

scan selectivity 1, filter selectivity 0.01

664605 rows, 17423 pages

Data Row Cluster Ratio 0.1834779

Index Page Cluster Ratio 0.9956707

Data Page Cluster Ratio 0.09864684

using index prefetch (size 16K I/O)

Large IO selected: The number of leaf pages qualified is > MIN_PREFETCH pages

in index cache 'generic_data_cache' (cacheid 2) with LRU replacement

Note you can see a no stats message here on a column grouping, more on this later.
The final row count estimate here is simply 0.1 * 0.1 * 0.1 * 664605 ~ 665 rows.
This is the same conclusion 12.5.4 would come to with no statistics.
The key difference in this scenario is that 15.x will pick a full index scan of GENERIC_TABLE_003. This is in fact considerable cheaper than a table scan and will cost (per scan), leaf number of pages plus index height plus datapage lio cost (which is the estimate number of datapages that will need to be read for a scan using that index based on the search arguments).

( PopRidJoin ( PopIndScan GENERIC_TABLE_003 GENERIC_DB..GENERIC_TABLE ) ) cost:159727.8 T(L18031.05,P2282.89,C665934.2) O(L6646.05,P830.7562,C1329.21) tempdb:0 order: none

Estimating selectivity of index 'GENERIC_DB..GENERIC_TABLE.GENERIC_TABLE_003', indid 4

sarg1 = 10000010

sarg2 = 'USD'

Estimated selectivity for sarg1,

selectivity = 0.1,

Estimated selectivity for sarg2,

selectivity = 0.1,

scan selectivity 1, filter selectivity 0.01

664605 rows, 11382 pages

….

Data Page LIO for 'GENERIC_TABLE_003' on table 'GENERIC_DB..GENERIC_TABLE' = 6646.05

/

IndexScan

GENERIC_TABLE_003

(VA = 0)

r:0 er:665

l:12216 el:18032

p:0 ep:2283

11382+3+6646.05 = 18031.05

Note that the final estimate row count is nowhere near the actual which is zero.
Row count estimates have many reasons why they don’t match actual, it could be:-

A. Out of date statistics.

B. Missing statistics.

C. No possible statistic in the current optimizer costing framework that could give an accurate estimate. Skewed data falls into this category. If searching across multiple attributes relative skew is almost a certainty.

In this instance, it is C). No matter what statistics are created (and some will help a little), the optimizer cannot know that there are no rows that have sarg1=10000010, sarg2='USD' AND sarg3='KS'. It can't even begin to have a clue mainly as sarg3 is not in a composite index with any of the other arguments. It has zero knowledge of the (in) dependence of these search arguments.

The only way of accurately costing this query would be to have knowledge of distributions across multiple attributes. This would require an alternate way of generating and using statistics and may happen at some point in the future of ASE. It may require a different approach to histogramming/data representation.

When estimates are comparable to actual, but the query is far more expensive than an alternative plan, then this can be an indication of a bug rather than a costing limitation. I.e. it has picked a plan that it KNOWS is expensive above a cheaper one.

This begs the question as to why it bothers doing the complex costing you'll see in the later examples given it has no real clue about the relationship between the search arguments.

Test 2 Query 1

Update index statistics creating histogram on all indexed columns and column group stats.

This time we have index statistics in place, things get interesting.

Plancost

/

IndexScan

GENERIC_TABLE_003

(VA = 0)

r:0 er:305

l:12216 el:25689

p:0 ep:3241

Observations and explanations

Now the estimate is less than half of what it was before, so we're actually a little closer to the actual of zero, the reason for this is that the optimizer has available to it some additional statistics.

Statistics:

Estimated rows =304.3119, Average row size =377.0283, Page Count =57.72479

Largest Partition Page Count =57.72479, Partition Count =1, Concurrency Threshold =-1

RID-oriented Logical Properties

TreeLogProps({0}) - Proj: {0} JoinsToDo: {0} UsfOrd: {0}

UsfPartLst ->

UsfPartNodeLst ->

Statistics:

Estimated rows =304.3119, Average row size =377.0283, Page Count =57.72479

Largest Partition Page Count =57.72479, Partition Count =1, Concurrency Threshold =-1

LogProps now has 304 rows, but how has it come up with that figure? Let's look for section with 'search argument selectivity' again.

Search argument selectivity is 0.0004578839.

0.0004578839 * 664605 = 304.3119 rows

So the logic stands from the previous test, the question is how has it come up with this figure?

If we do a diff between the test 1 and test 2 outputs, the only NO STATS message which have disappeared are the single column ones.

The missing density set statistics are still reported:

NO STATS on density set for GENERIC_DB..GENERIC_TABLE={sarg2, sarg1, sarg3}

NO STATS on density set for GENERIC_DB..GENERIC_TABLE={sarg2, sarg1}

This is correct but a little misleading, the implication of these messages is that the column statistics are the only stats it has used in addition to the previous run this is not the case.

If we perform a product of the selectivities we would get the following:

The table (Allpages) has 664605 rows, 137615 pages,

Data Page Cluster Ratio 1.0000000

sarg1 = 10000010

Estimated selectivity for sarg1,

selectivity = 0.02296552,

sarg2 = 'USD'

Estimated selectivity for sarg2,

selectivity = 0.3687574,

sarg3 = 'KS'

Estimated selectivity for sarg3,

selectivity = 0.0212758,

0.02296552 * 0.3687574 * 0.0212758 = 0.0001801784833034970784, but this is nowhere near our 0.0004578839 figure.

If we had no column group density statistics, it would generate the selectivity estimation this way (as per 12.5.x).

We now come to another fundamental difference of the 15x optimizer - complex multi-attribute costing.

The premise behind this feature is that the column group densities (in combination with histograms for all columns concerned) can go some way to try to estimate the (in)dependence of search arguments against multiple keys of an index (or unrelated attributes if the stats are created).

To put it simply if column C1 has an overall density of 0.5 and column C2 has an overall density of 0.5, then if these two columns were entirely independent then the column group density would be 0.25. If they were entirely dependant then the column group density would be 0.5. A good example of this can be found in the patent documentation here.

If you searched for State=California and City =Los Angeles then a product of the selectivities would be an incorrect estimate.

If you searched for State=California and Gender='F' a product would be the correct selectivity.

In reality the answer is somewhere in between and the column group densities can help in working out on average what is the dependence of two search arguments.

The formula it uses to do this is iterative and is applied against each minor attribute search argument in turn:

Now it gets a little convoluted and I want to try to not get overly mathematical.

Each iteration, it assesses which density is the dominant density, i.e. which density is the most selective. Once this is determined it massages the submissive density in order that weight is added to the selectivity being multiplied dependant on how (in)dependent the attributes are.

The dominant selectivity is the respective selectivity for the dominant density.

Newsel = selectivity to be passed onto the next iteration.
Dsel = dominant selectivity (which is either the selectivity of the new search argument or the selectivity passed in from the previous iteration)
Ssel = submissive selectivity (which is either the selectivity of the new search argument or the selectivity passed in from the previous iteration)
Currden = current column group density inclusive of the attribute being evaluated (this will be the higher of range cell density or total density)
Dden = dominant density (which will either be the previous column group density or the current attribute density)
Sden = submissive density (which will either be the previous column group density or the current attribute density)

You should be able to see that if currden=Dden, then this represents total dependence so the formula simplifies to:

newsel=Dsel * (Ssel + ((1-SSel)*1))

newsel=Dsel * 1 – i.e. no additional selectivity.

If currden=Sden * Dden, then this represents total independence, so the formula simplifies to:

newsel= Dsel * (Ssel + ((1-SSel)*0))

newsel=Dsel*Ssel – i.e. a product

The scale in between the two is linear. It's not fundamentally clear why these densities were considered to be the ones that should play a part in this scaling, but the general premise is the selectivities are massaged based on fixed consistent criteria relating to the relative attribute densities.

Anyway let's pull some figures out and see what this actually means. Sarg1 and sarg2 are in the same index so this costing can be done based on column group density statistics despite the fact that there are no statistics available for (sarg1 , sarg2) as a standalone group.

There are statistics for (keyb, sarg1) and (keyb, sarg1, sarg2), so there are statistics that hold some representation of the addition of sarg2 to sarg1.

sarg1 selectivity for 10000010 = 0.02296552

sarg1 density = 0.0260784927385415

sarg2 selectivity for 'USD' = 0.36875740

sarg2 density = 0.2116585948776906

(keyb, sarg1) range cell density = 0.0000024221914100

(keyb, sarg1) total density = 0.0000015587470359

(keyb, sarg1, sarg2) range cell density 0.0000022319660415

(keyb, sarg1, sarg2) total density 0.0000015382730277

sarg3 selectivity = 0.0212758

The figures in bold play a part in the selectivity calculations in this instance, the latter selectivity for sarg3 can only be included as a product as there are no usable column group densities.

Again, this makes you wonder why it need bother if all it can do with sarg3 is multiply it.

So, the first selectivity is 0.02296552 for sarg1, we need to see which is the dominating (most selective density) for the first iteration of the formula.

(keyb, sarg1) density = 0.0000024221914100

(keyb, sarg1, sarg2) density = 0.0000022319660415

So (keyb, sarg1, sarg2) density is Dden

sarg2 selectivity for 'USD' = 0.36875740

sarg1 selectivity for 10000010 = 0.02296552

So sarg1 selectivity is Dsel

Note here that this formula only fits if we actually DON'T use the respective selectivity for the dominant density. So it appears that at least for this use case there is some sort of heuristic added.

The first iteration of this formula is in fact using the column group density starting with the index prefix keyb. As we have no search argument for this column the first iteration is not required as selectivity is 1. If we had a 6 key composite index with 6 search arguments this formula would end up being called 5 times.

Newsel = 0.02152135032694444868032387965327

So what we see here is that as currden is almost equal to prevden, the addition of sarg2 introduces almost no additional selectivity.

Now, let's verify the figures by going back to the outputs.

Estimating selectivity of index 'GENERIC_DB..GENERIC_TABLE.GENERIC_TABLE_002', indid 3

sarg1 = 10000010

sarg2 = 'USD'

Estimated selectivity for sarg1,

selectivity = 0.02296552,

Estimated selectivity for sarg2,

selectivity = 0.3687574,NO STATS on density set for GENERIC_DB..GENERIC_TABLE={sarg2, sarg1}

scan selectivity 1, filter selectivity 0.02152135

664605 rows, 17423 pages

Data Row Cluster Ratio 0.1834779

Index Page Cluster Ratio 0.9956707

Data Page Cluster Ratio 0.09864684

using index prefetch (size 16K I/O)

Large IO selected: The number of leaf pages qualified is > MIN_PREFETCH pages

in index cache 'generic_data_cache' (cacheid 2) with LRU replacement

using table prefetch (size 16K I/O)

Large IO selected: The number of leaf pages qualified is > MIN_PREFETCH pages

in data cache 'generic_data_cache' (cacheid 2) with LRU replacement

Data Page LIO for 'GENERIC_TABLE_002' on table 'GENERIC_DB..GENERIC_TABLE' = 13448.03

Estimating selectivity of index 'GENERIC_DB..GENERIC_TABLE.GENERIC_TABLE_003', indid 4

sarg1 = 10000010

sarg2 = 'USD'

Estimated selectivity for sarg1,

selectivity = 0.02296552,

Estimated selectivity for sarg2,

selectivity = 0.3687574,

scan selectivity 1, filter selectivity 0.02152135

664605 rows, 11382 pages

Data Row Cluster Ratio 1

Index Page Cluster Ratio 0.9970881

Data Page Cluster Ratio 1

using index prefetch (size 16K I/O)

Large IO selected: The number of leaf pages qualified is > MIN_PREFETCH pages

in index cache 'generic_data_cache' (cacheid 2) with LRU replacement

using table prefetch (size 16K I/O)

Large IO selected: The number of leaf pages qualified is > MIN_PREFETCH pages

in data cache 'generic_data_cache' (cacheid 2) with LRU replacement

Data Page LIO for 'GENERIC_TABLE_003' on table 'GENERIC_DB..GENERIC_TABLE' = 14303.2

There is nothing at all in any optimizer trace to indicate how this figure is calculated nor which statistics were used.
We now need to see how that final 'search argument selectivity' was calculated. It is in fact the filter selectivity as per the complex costing for the first two search arguments multiplied by sarg3 selectivity.

The table (Allpages) has 664605 rows, 137615 pages,

Data Page Cluster Ratio 1.0000000

sarg1 = 10000010

Estimated selectivity for sarg1,

selectivity = 0.02296552,

sarg2 = 'USD'

Estimated selectivity for sarg2,

selectivity = 0.3687574,

sarg3 = 'KS'

Estimated selectivity for sarg3,

selectivity = 0.0212758,

Search argument selectivity is 0.0004578839

0.02152135 * 0.0212758 = 0.00045788393833

0.00045788393833 * 664605 = 304.3119 rows

The statistics used here are used to calculate the filter selectivity of both GENERIC_TABLE_003 and GENERIC_TABLE_002 indices.

This seems like overkill to calculate selectivity on a combination of search arguments whose dependence could never be known based on the statistical model.

Test 3 Query 1

On this test we remove statistics on (keyd, keyb, sarg1, sarg2)
Intuitively we know (or think) that this statistic is not used so we don't expect any difference over test 2.

Plancost

/

IndexScan

GENERIC_TABLE_003

(VA = 0)

r:0 er:305

l:12216 el:14398

p:0 ep:1829

OK, so the row count is reassuring, but the lio costing is significantly different. The reason for this is the column group density statistics we have just deleted are those for the full GENERIC_TABLE_003 index. As the overall density is no now longer available, it cannot come up with as accurate an estimate for the LIO costing (need to revisit the data page lio costing, but not here).

Test 4 Query 1

On this test we remove the alternate set of statistics that we believe it would use if the previous column group density stats were not available.

delete statistics GENERIC_TABLE (keyb, sarg1, sarg2) , so now we believe the only column density statistics in could use are (keyd, keyb, sarg1, sarg2).

Plancost

/

IndexScan

GENERIC_TABLE_003

(VA = 0)

r:0 er:227

l:12216 el:22025

p:0 ep:2783

Statistics:

Estimated rows =226.3541, Average row size =377.0283, Page Count =43.14529

So now the row count estimate is 227.
If we look at (keyd, keyb, sarg1, sarg2) density statistics in comparison to (keyd, keyb, sarg1) we see:-

Statistics for column group: "keyd", "keyb", "sarg1"

Last update of column statistics: Jul 3 2014 11:00:40:190AM

Range cell density: 0.0000015046531398

Total density: 0.0000015046531398

Statistics for column group: "keyd", "keyb", "sarg1", sarg2"

Last update of column statistics: Jul 3 2014 11:00:40:190AM

Range cell density: 0.0000015046531398

Total density: 0.0000015046531398

So, we can see here that with respect to this index the addition of sarg2 introduces no more selectivity into the equation so we should expect that the final search argument selectivity is that of sarg1 * sarg3.

The table (Allpages) has 664605 rows, 137615 pages,

Data Page Cluster Ratio 1.0000000

sarg1 = 10000010

Estimated selectivity for sarg1,

selectivity = 0.02296552,

sarg2 = 'USD'

Estimated selectivity for sarg2,

selectivity = 0.3687574,

sarg3 = 'KS'

Estimated selectivity for sarg3,

selectivity = 0.0212758,

Search argument selectivity is 0.0003405845.

using table prefetch (size 16K I/O)

Large IO selected: The number of leaf pages qualified is > MIN_PREFETCH pages

in data cache 'generic_data_cache' (cacheid 2) with LRU replacement

The filter selectivity of the non-matching index scans should be the selectivity of sarg1.

Estimating selectivity of index 'GENERIC_DB..GENERIC_TABLE.GENERIC_TABLE_003', indid 4

sarg1 = 10000010

sarg2 = 'USD'

Estimated selectivity for sarg1,

selectivity = 0.02296552,

Estimated selectivity for sarg2,

selectivity = 0.3687574,

scan selectivity 1, filter selectivity 0.01600807

However, we can clearly see that the filter selectivity has been made slightly better by the addition of sarg2. Why this is the case is not totally clear as we have no other statistic that holds any information about the density of this column combination.

We do have this statistic:-

Statistics for column group: "keyb", "sarg1", "sarg2", "keyc"

Last update of column statistics: Jul 3 2014 11:00:40:190AM

Range cell density: 0.0000015055760839

Total density: 0.0000015057595070

But surely it can't use this as we have no statistics available for ("keyb", "sarg1", "sarg2"), we deleted them?
In reality it DOES use this statistic and it uses it to calculate the final filter selectivity passing in the density of ("keyb", "sarg1") as the previous density.
I believe this is a bug as this statistic does not hold anything with respect to the dependency of sarg1 and sarg2, it holds information about the dependency between (sarg1, sarg2) and keyc, it cannot possibly know whether any extra selectivity has come from the addition of sarg2 to sarg1. I'm reasonably confident there could well be some bugs lurking when there are multiple column group densities available containing our required combination. Either that or there a lot of conditions which determine which one of the many densities to use.

To prove it uses it we can do two things, one, verify the formula holds true:

newsel = 0.01600807

So this ties in with the diagnostic output.

0.01600807 * 0.0212758 (sarg3) * 664605 = 226.35415876868613

Next we could delete this statistic entirely and then surely it MUST use the other statistic, true enough it does. You can clearly see the addition of sarg2 introduces no more selectivity. This is Test 6 , query 1.

Statistics:

Estimated rows =324.7325, Average row size =377.0283, Page Count =61.54381

/

IndexScan

GENERIC_TABLE_003

(VA = 0)

r:0 er:325

l:12216 el:26648

p:0 ep:3361

Estimating selectivity of index 'GENERIC_DB..GENERIC_TABLE.GENERIC_TABLE_002', indid 4

sarg1 = 10000010

sarg2 = 'USD'

Estimated selectivity for sarg1,

selectivity = 0.02296552,

Estimated selectivity for sarg2,

selectivity = 0.3687574,

scan selectivity 1, filter selectivity 0.02296552

Estimated selectivity for sarg3,

selectivity = 0.0212758,

Search argument selectivity is 0.0004886097.

True enough it does, you can see no extra selectivity.

Test 5 query 1

This time we create the explicit column group statistics required for the optimizer to attempt to come up with best possible estimate, so now we have density stats on all the predicate sets, finally we lose our missing stats message .

Statistics for column group: "sarg1", "sarg2"

Last update of column statistics: Jul 3 2014 11:04:43:186AM

Range cell density: 0.0005493546926016

Total density: 0.0227888252879490

Range selectivity: default used (0.33)

In between selectivity: default used (0.25)

Unique range values: 0.0001288883832435

Unique total values: 0.0016949152542373

Average column width: default used (2.00)

Statistics version: 4

Statistics for column group: "sarg1", "sarg2", "sarg3"

Last update of column statistics: Jul 3 2014 11:04:43:186AM

Range cell density: 0.0005447374914165

Total density: 0.0223011172588974

Range selectivity: default used (0.33)

In between selectivity: default used (0.25)

Unique range values: 0.0001271583378309

Unique total values: 0.0012642225031606

Average column width: default used

Statistics version: 4

So now, what has become fairly clear is that on average sarg3 introduces almost no more selectivity into the equation. It also makes you question whether the previous use of the column group statistics with the keyb prefix is a good idea or not. In this scenario it actually makes little difference as on average sarg1 and sarg2 look fairly dependant.

However, picture this scenario:-

TableA

C1	C2	C3
1	A	1
2	A	2
1	A	2
1	A	2
2	B	2

C1 density is ((3*0.6)+(2*0.4))/5= 0.52 (easiest way to calculate this is 0.6² + 0.4²)

C2 density is ((4*0.8)+(1*0.2))/5 = 0.68

C3 density is also 0.68

C1,C2 density is (((3*0.6)+(1*0.2)+(1*0.2))/5 = 0.44

C1,C2,C3 density is (((1*0.2)+(1*0.2)+(2*0.4)+(1*0.2))/5 = 0.28

C2,C3 density is ((1*0.2)+(3*0.6)+(1*0.2))/5=0.44

Now imagine you have a query searching on C2 and C3. If the only statistics you had were on C2, C3 and (C1,C2,C3) then you would have 0.68, 0.68 and 0.28 for the group density, so it would believe that the introduction of C3 introduces a lot more selectivity, whereas with the actual densities for the columns and the defined group would be 0.68, 0.68 and 0.44, so this differs substantially, half the additional selectivity comes via the relationship between C1 and C3 (and we have no C1 in our query). That said, whilst this method is less precise it must give an inkling with regards to the relationship so should help estimation. It also shows that the C2, C3 density can be more selective than the product of C2 * C3 densities. These boundary limits are (hopefully) taken care of in the calculations.

The use of the column density statistics where our required columns are in the middle of the column group is something else and I can't see how that could benefit the estimations.

Let's go back to the optimizer traces:

The table (Allpages) has 664605 rows, 137615 pages,

Data Page Cluster Ratio 1.0000000

sarg1 = 10000010

Estimated selectivity for sarg1,

selectivity = 0.02296552,

sarg2 = 'USD'

Estimated selectivity for sarg2,

selectivity = 0.3687574,

sarg3 = 'KS'

Estimated selectivity for sarg3,

selectivity = 0.0212758,

Search argument selectivity is 0.02064584.

using table prefetch (size 16K I/O)

Large IO selected: The number of leaf pages qualified is > MIN_PREFETCH pages

in data cache 'generic_data_cache' (cacheid 2) with LRU replacement

And the plancost output.

/

IndexScan

GENERIC_TABLE_003

(VA = 0)

r:0 er:13722

l:12216 el:25107

p:0 ep:3168

So now, due to the statistics demonstrating dependency across these 3 attributes, the estimate is actually far worse then when we didn’t have such accurate statistics!

Estimating selectivity for table 'GENERIC_DB..GENERIC_TABLE'

Table scan cost is 664605 rows, 137615 pages,

The table (Allpages) has 664605 rows, 137615 pages,

Data Page Cluster Ratio 1.0000000

sarg1 = 10000010

Estimated selectivity for sarg1,

selectivity = 0.02296552,

sarg2 = 'USD'

Estimated selectivity for sarg2,

selectivity = 0.3687574,

sarg3 = 'KS'

Estimated selectivity for sarg3,

selectivity = 0.0212758,

Search argument selectivity is 0.0004392567.

using table prefetch (size 16K I/O)

Large IO selected: The number of leaf pages qualified is > MIN_PREFETCH pages

in data cache 'generic_data_cache' (cacheid 2) with LRU replacement

Statistics:

Estimated rows =291.9322, Average row size =377.0283, Page Count =55.40957

Largest Partition Page Count =55.40957, Partition Count =1, Concurrency Threshold =-1

Now we have to iterate through our formula twice (sorry!) to come up with the final figure as we have all the statistics required to do a full complex multi-attribute costing. This is the ideal behind the recommendation drilled into the customer base behind running 'update index statistics'. If we had a composite index on all three search arguments then we would have these statistics available (assuming index stats has been run).

sarg1 selectivity for 10000010 = 0.02296552

sarg1 density = 0.0260784927385415

sarg2 selectivity for 'USD' = 0.36875740

sarg2 density = 0.2116585948776906

sarg3 selectivity for 'KS' = 0.02127580

sarg3 density = 0.0225914891446323

(sarg1, sarg2) range cell density = 0.0005493546926016

(sarg1, sarg2) total density = 0.0227888252879490

(sarg1, sarg2, sarg3) range cell density = 0.0005447374914165

(sarg1, sarg2, sarg3) total density = 0.0223011172588974

So, initial selectivity (for sarg1) is

0.02296552

Next we have to take sarg2 into the equation.

So:

newsel = 0.02064584

Now we factor in the final search argument on sarg3.

Final sel = 0.0210017

Oh dear, it appears to now be less selective. This is one of the conditions that is accounted for. The selectivity from the first iteration is passed forward. This figure is ignored.

The reason for this is that the dominant density does not in fact have the dominant selectivity. We're looking at such tiny variations here. Using these values contradicts the behaviour in Test 2. There is no combination of values that can be plugged into the formula that gives the final selectivity we see so either this contradiction is by design or unintended; which of the two I don't know.The dominant density is 0.0225914891446323 for sarg3 as the previous multi-attribute density for (sarg1, sarg2) was 0.0227888252879490.However, the previous selectivity was 0.02064584 and the current selectivity for sarg3 is 0.02127580, so the previous selectivity was in fact the dominant one.We can test what would happen if the dominant density were the previous density by stats modification, let's adjust the total density for sarg3 to be a fraction higher than 0.0227888252879490

sp_modifystats GENERIC_TABLE, sarg3, MODIFY_DENSITY, total, absolute, "0.0228"

/

IndexScan

GENERIC_TABLE_003

(VA = 0)

r:0 er:13428

l:12216 el:14276

p:0 ep:1814

Notice the row count has now reduced ever so slightly.

Estimating selectivity of index 'GENERIC_DB..GENERIC_TABLE.GENERIC_TABLE_002', indid 4

sarg1 = 10000010

sarg2 = 'USD'

Estimated selectivity for sarg1,

selectivity = 0.02296552,

Estimated selectivity for sarg2,

selectivity = 0.3687574,

scan selectivity 1, filter selectivity 0.02064584

664605 rows, 17423 pages

Data Row Cluster Ratio 0.1834779

Index Page Cluster Ratio 0.9956707

Data Page Cluster Ratio 0.09864684

using index prefetch (size 16K I/O)

Large IO selected: The number of leaf pages qualified is > MIN_PREFETCH pages

in index cache 'generic_data_cache' (cacheid 2) with LRU replacement

using table prefetch (size 16K I/O)

Large IO selected: The number of leaf pages qualified is > MIN_PREFETCH pages

in data cache 'generic_data_cache' (cacheid 2) with LRU replacement

Data Page LIO for 'GENERIC_TABLE_002' on table 'GENERIC_DB..GENERIC_TABLE' = 11726.52

Estimating selectivity for table 'GENERIC_DB..GENERIC_TABLE'

Table scan cost is 664605 rows, 137615 pages,

The table (Allpages) has 664605 rows, 137615 pages,

Data Page Cluster Ratio 1.0000000

sarg1 = 10000010

Estimated selectivity for sarg1,

selectivity = 0.02296552,

sarg2 = 'USD'

Estimated selectivity for sarg2,

selectivity = 0.3687574,

sarg3 = 'KS'

Statistics for this column have been edited.

Estimated selectivity for sarg3,

selectivity = 0.0212758,

Search argument selectivity is 0.02020331.

If we consider what we've done here, we've said that overall sarg3 is less unique than it was previously (by increasing total density), yet we've actually made this query more selective.
To validate the calculations: Now for our final iteration the dominant density/selectivity have switched places.

newsel = 0.02020330523666930073836352966023

Statistics:

Estimated rows =13427.22, Average row size =372.8475, Page Count =2484.102

Largest Partition Page Count =2484.102, Partition Count =1, Concurrency Threshold =-1

RID-oriented Logical Properties

TreeLogProps({0}) - Proj: {0} JoinsToDo: {0} UsfOrd: {0}

UsfPartLst ->

UsfPartNodeLst ->

Statistics:

Estimated rows =13427.22, Average row size =372.8475, Page Count =2484.102

Largest Partition Page Count =2484.102, Partition Count =1, Concurrency Threshold =-1

0.02020331 * 664605 = 13427.22

-----------------------

There is no ultimate conclusion for this one apart from one of bafflement! It still amazes me the lengths it goes to calculate exceptionally specific estimates when in most circumstances it cannot really know whether they will be accurate or not.
The optimizer could conceivably use any column group density statistic that holds any combination or sub-set of search arguments as specified in the query.
The objective is really to try to help you understand how ASE 15/16 can use the statistics available. The observations are detailed along the way.
The complex costing can be disabled to revert back to product based selectivities by setting multi_attribute_density_costing to on (or turning on traceflag 437). This is only available from 15.5 ESD#2 / 15.0.3 ESD#4 onwards. NOTE - it cannot currently be set in an abstract plan.

set multi_attribute_density_costing on --yes it is meant to be 'on' to disable the newer calculation method, i.e. revert to old method.

If you have a query you think should be using a particular index and you think the reason it is not being selected is because it is costed too highly you might like to try turning this option on.

Good luck!

How the ASE optimizer estimates row counts

SAP PI for Beginners

ABAP 7.40 Quick Reference

Fiori: technical installation and configuration of one app from A - Z