Out of range histogram adjustments

Former Member · ‎06-18-2013

The original intention of this feature was to help out in the case of some kind of date search argument whereby additional rows have made it into a table

but there hasn't yet been a window to update the statistics, business_date=<today> for example.

It has been around since 15.0.2 ESD#2.

If there were many rows matching <today> but the statistics hadn't yet been updated to reflect this, then the cost of the this search argument would have

been as good as nothing.

This could lead to inefficient plan selection.

I was recently asked about this feature after one of our customers had been looking into some costing output printed by the ASE optimizer....it wasn't clear

how we calculated the figures.

set option show_lio_costing on

...to be specific.

This will print information with regards to the logical i/o costing for the various available methods (logical operators) of accessing tables in a query.

In essence the estimate cost in terms of data pages for a particular scan of a particular node of a particular access.

It will provide a similar output to TF 302 as used in versions of ASE prior to 15.0.

Out of range histogram adjustment takes place for search arguments (=, >=,>) that fall outside the upper bound of the last cell of a histogram.

It will apply to pretty much any data type and any data distribution.

I'm not going to go into any detail in this post about the wider calculations. I'll concentrate on some out of range information and how it is calculated.

The idea is that the out of range functionality assigns some selectivity to the search argument.

In this particular instance they were asking about the figures generated when there was a IN clause with 2 entries.

The query in question was equivalent to:-

select * from a where a1 in ('T', 'Y')

for this set of data (as printed by optdiag)

Out of range Histogram Adjustment is DEFAULT.

Step Weight Value

1 0.00000000 < "C "

2 0.01000000 = "C "

3 0.00000000 < "N "

4 0.99000001 = "N "

Note out of range adjustment will be on by default.

The lio_costing for one of these search arguments (the query with the IN clause) was:-

Beginning selection of qualifying indexes for table 'a',

Estimating selectivity of index 'a.ix1', indid 2

a1 = 'T'

Estimated selectivity for a1,

Out of range histogram adjustment,

selectivity = 0.1666667,

scan selectivity 0.1666667, filter selectivity 0.1666667

1666.667 rows, 10 pages

Index covers query.

Data Row Cluster Ratio 0.9954797

Index Page Cluster Ratio 1

Data Page Cluster Ratio 0.9285714

using no index prefetch (size 2K I/O)

in index cache 'default data cache' (cacheid 0) with LRU replacement

The plan estimates that about 33% of the table will match the two OR terms (2*0.1666667).

Clearly if there are fewer rows than that, this estimate is not going to be accurate.

In this instance there were zero rows that match. The impact was a plan built involving a reformat as it expected 33% of a largish table.

Entirely inefficient when there are in fact zero rows that qualify.

The (out of range) OOR stuff will apply to any value that exists outside of the top end of the histogram, and it is primarily based on

the number of unique values in the histogram (1/'Unique total values').

The (non)uniformity of the domain, skew or level of cardinality don't really come into it (at least when it comes to deciding whether to do OOR).

It will apply if the histogram has 1 entry or a 1000 entries.

The OOR histogram adjustment is done entirely in memory and is not stored in sysstatistics or anywhere else.

If the two values in the histogram above were evenly weighted at 50% each the OOR histogram cell would be the same as that calculated

when 99% and 1%.

We are dealing with an OR operator here, so we have multiple additional histogram entries to create in-memory.

This results in a normalized histogram that has to include the other OR terms.

So if the clause were a single equality search argument (a1 = 'T') then we would have an OOR selectivity of 0.25.

This is calculated as follows.

The initial out of range is calculated as 1/ (unique values +1) = 0.33. The 1 represents the number of out of range clauses.

We then need adjust the weight of each cell (*including the newly created one*) so it adds up to 1.

From a mathematical standpoint the histogram must still add up to 1.

(1/(2+1)) / (1+ (1/(2+1)) ) = 0.25

0.33 / 1.33 = 0.25 --this becomes our out of range selectivity.

So in essence what we have done there is take an out of range selectivity of 0.33 and then adjust each cell evenly, including the 0.33.

So this leaves 0.75 of the table represented by the other 2 values, so the full in-memory normalized histogram is:-

'C' 0.0075000

'N' 0.7425000

'T' 0.2500000

---

Now, when we come to add additional OR terms, we have an extra histogram step that needs to be considered

for each OR term.

So if we have:-

in ('T','Y') --shorthand of (<column> = 'T' OR <column> = 'Y')

We now have:-

1/(unique values +2 )= 0.25. The 2 represents the number of out of range clauses.

Which then normalizes to:-

(1/(2+2)) / (1+ (1/(2+2))+(1/(2+2)) ) = 0.16667

0.25 / 1.5 = 0.16667 --we need to add two weights of 0.25 as we have two additional values to consider.

So then the full normalized histogram becomes:-

'C' 0.0066667

'N' 0.6600000

'T' 0.1666667

'Y' 0.1666667

With 3 OR terms we then end up with 0.125 and so on..

(1/(2+3)) / (1+ (1/(2+3))+(1/(2+3))+(1/(2+3))) = 0.125

It's clear that in this circumstance out of range adjustment does not help.

This can be flipped by running:-

update statistics a (a1) using out_of_range off --must specify a column in the command

Note that this does NOT actually update the statistics, it merely flips a status bit in sysstatistics to say whether it should use

out of range adjustments at optimization time. It will be sticky, meaning subsequent update statistics will not reset

the OOR off status.

TF 15355 will disable the feature but individual columns can still be flipped to use out of range adjustment.

Check out the c1 column in sysstatistics for formatid=100 if you're really keen on seeing what changes with OOR on\off.

Next post is about data skew with a juicy example of the impact it can have and the circumstances when you can or can't do something about it. That'll be fun :smile:

Out of range histogram adjustments

SAP PI for Beginners

ABAP 7.40 Quick Reference

Fiori: technical installation and configuration of one app from A - Z