[Oracle] Mess up CBO cardinality estimates by using DBMS_STATS.COPY_TABLE_STATS with one distinct value in first partition column
This is a short blog post about a nasty bug (#14607573) in PL/SQL procedure DBMS_STATS.COPY_TABLE_STATS, that i troubleshot in a non SAP application environment at client site last week.
It is finally fixed by an one-off patch since 126.96.36.199 or included in Oracle patch set 188.8.131.52, but unfortunately there is no patch available for the DBMS_STATS.COPY_TABLE_STATS backport on Oracle 10g. This means a manual hack is necessary in consequence 🙂
PL/SQL procedure DBMS_STATS.COPY_TABLE_STATS is usually used for large partitioned tables, if you add a new partition and want to avoid an immediate DBMS_STATS.GATHER_TABLE_STATS call or the data in that new partition is not representative at that point in time. However it is pretty easy to run into out-of-range or insufficient statistic issues, if you don’t do anything at all after adding the new partition and using it in SQL statements. The blog post “Oracle Optimizer Blog – Maintaining statistics on large partitioned tables” by Maria Colgan (check reference section for details) describes one of these situations very well.
“Out of Range” means that the value supplied in a where clause predicate is outside the domain of values represented by the [minimum, maximum] column statistics. The optimizer prorates the selectivity based on the distance between the predicate value and the maximum value (assuming the value is higher than the max), that is, the farther the value is from the maximum value, the lower the selectivity will be. This situation occurs most frequently in tables that are range partitioned by a date column, a new partition is added, and then queried while rows are still being loaded in the new partition. The partition statistics will be stale very quickly due to the continuous trickle feed load even if the statistics get refreshed periodically. The maximum value known to the optimizer is not correct leading to the “Out of Range” condition. The under-estimation of selectivity often leads the query optimizer to pick a sub optimal plan. For example, the query optimizer would pick an index access path while a full scan is a better choice.
The “Out of Range” condition can be prevented by using the new copy table statistics procedure available in Oracle Database 10.2.0.4 and 11g. This procedure copies the statistics of the source [sub] partition to the destination [sub] partition. It also copies the statistics of the dependent objects: columns, local (partitioned) indexes etc. It adjusts the minimum and maximum values of the partitioning column as follows; it uses the high bound partitioning value as the maximum value of the first partitioning column (it is possible to have concatenated partition columns) and high bound partitioning value of the previous partition as the minimum value of the first partitioning column for range partitioned table.
The last part of the blog quote is also the problematic / buggy one at the same time. It works pretty well, if the partition bound span over several distinct values, but it is buggy, if the partition column has only one distinct value (NDV=1). This is usually the case in the following scenarios: date range partition, each partition holds data for a day, the partition lower bound matches to the low and high value. The cost based optimizer estimates a cardinality of 1 (by filtering on that partition key column) no matter of the other relevant statistics (that may represent the real world data), which results in heavily underestimates and possible insufficient execution plans (e.g. wrong join method / order).
Demo of the defect / anomaly
The following demo and work-around was run on an Oracle database (10.2.0.5.0) on OEL 6.4 (2.6.39-400.109.1).
Creating the base table and data
SQL> create table PARTTEST (QID NUMBER, TESTNUM NUMBER, TEXT VARCHAR2(20)) partition by range (QID) (partition P1_L2 values less than (2), partition P2_L3 values less than (3), partition P3_L4 values less than (4));
SQL> begin for i in 1 .. 1000 loop insert into PARTTEST values (1,i,'TEST'); insert into PARTTEST values (2,i,'TEST'); insert into PARTTEST values (3,i,'TEST'); end loop; commit; end; /
SQL> exec DBMS_STATS.GATHER_TABLE_STATS(USER,'PARTTEST', method_opt => 'FOR ALL COLUMNS SIZE 1');
SQL> alter session set events '10053 trace name context forever, level 1'; SQL> select /*+ gather_plan_statistics HARD PARSE ME */ count(*) from PARTTEST where QID = 3;
A cardinality of 1000 rows based on the predicate “QID = 3” is calculated (formula: 1/<NDV> x <NUM_ROWS> = 1/1 x 1000 = 1000), which is absolutely in the right ball park. The partition and (partition) column statistics “Min”/”Max”=3/3 and NDV=1 are correct as well. Now let’s add a new partition, copy the local partition statistics from partition P3_L4 to the new one called “P4_L5” and fill it with the same amount of data.
SQL> alter table PARTTEST add partition P4_L5 values less than (5); SQL> exec DBMS_STATS.COPY_TABLE_STATS(USER, 'PARTTEST', 'P3_L4', 'P4_L5');
Footnote: Global statistics are not modified / adjusted by DBMS_STATS.COPY_TABLE_STATS, but that is not the topic right here as the (local) partition statistics are evaluated and used in this example.
SQL> begin for i in 1 .. 1000 loop insert into PARTTEST values (4,i,'TEST'); end loop; commit; end; /
SQL> alter session set events '10053 trace name context forever, level 1'; SQL> select /*+ gather_plan_statistics HARD PARSE ME */ count(*) from PARTTEST where QID = 4;
The impact of Oracle bug #14607573 should be obvious right now. In reality the execution plan step “PARTITION RANGE SINGLE / TABLE ACCESS FULL” still returns 1000 rows, but the cost based optimizer has adjusted the estimated cardinality to 1.
We can spot the statistic issue / anomaly, if we look closely at the (partition) column statistics “Min”/”Max” and NDV. Ask yourself – “How is it possible that a column with only one distinct value (in our case 4) can span a value range of 2 (from 4 to 5)?” This is not possible in a logical or mathematical (disregarding final truncation) way. In addition it seems like the cost based optimizer is not able to handle that contradictory / inconsistent information and adjusts the cardinality to 1.
So what is the solution for this bug, if we run on an Oracle version <= 184.108.40.206 and are not able to apply the one-off patch or patch set? We hack it manually 😉
We manually adjust the column statistics (low and high value) with help of DBMS_STATS.SET_COLUMN_STATS as the partition column can include only one value per definition (high bound difference of one between previous and current partition). Unfortunately there is another bug (#11786774) in PL/SQL procedure DBMS_STATS.SET_COLUMN_STATS, which may lead to an invalid histogram on the column with the adjusted statistics. We can not avoid that invalid histogram with Oracle 10g (if that bug kicks in), but we can avoid subsequent copy errors by re-setting the number of buckets and column endpoint values every time.
SQL> DECLARE va_srec DBMS_STATS.STATREC; va_distinct NUMBER; va_density NUMBER; va_nullcnt NUMBER; va_avglen NUMBER; va_numvals DBMS_STATS.NUMARRAY; BEGIN DBMS_STATS.DELETE_TABLE_STATS(USER, 'PARTTEST','P4_L5'); DBMS_STATS.COPY_TABLE_STATS(USER, 'PARTTEST', 'P3_L4', 'P4_L5'); DBMS_STATS.GET_COLUMN_STATS(USER,'PARTTEST','QID','P3_L4',NULL,NULL,va_distinct, va_density,va_nullcnt,va_srec,va_avglen); va_numvals := DBMS_STATS.NUMARRAY(4,4); va_srec.epc := 2; va_srec.bkvals := NULL; DBMS_STATS.PREPARE_COLUMN_VALUES (va_srec, va_numvals); DBMS_STATS.SET_COLUMN_STATS(USER,'PARTTEST','QID','P4_L5',NULL,NULL,va_distinct, va_density,va_nullcnt,va_srec,va_avglen); commit; END; /
SQL> alter session set events '10053 trace name context forever, level 1'; SQL> select /*+ gather_plan_statistics HARD PARSE ME AGAIN */ count(*) from PARTTEST where QID = 4;
The cost based optimizer calculates the correct cardinality for that new partition as well with help of the copied and manually adjusted partition column statistics.
If you have any further questions – please feel free to ask or get in contact directly, if you need assistance by troubleshooting Oracle database (performance) issues.