Statistics is very important for sybase optimizer to create a good plan. Sometimes you update the statistics but still you could see that the plans are bad. It doesnt mean that the optimizer is making wrong choice. This choice is based on information provided to it. Sometimes the information provided is not enough or the statistcs on a column doesnt give the correct picture of what is happening. Thus optimizer could make the wrong plans.

One such case is when the columns data is skewed and we have used default or less number of steps.

Let us build the table that I will need here.

set

nocount

on

go

drop table

rnumtab

drop table

mytab

create table

rnumtab(rnum int identity not null primary key,col1 varchar(1) null)

insert into

rnumtab (col1) select top 5000 ‘1’ from syscolumns

create table

mytab (col1 int identity not null primary key ,col2 int not null,col3 char(120) not null,col4 datetime not null)

insert into

mytab (col2,col3,col4)

select

a.rnum,convert(varchar(120),newid(1)),dateadd(mi,-1*(b.rnum*rand()),getdate())

from

rnumtab a inner join rnumtab b

on

a.rnum >= b.rnum

insert into

mytab (col2,col3,col4) select col2,col3,col4 from mytab where col2 between 4990 and 5000

drop index

mytab.idx_1

create index

idx_1 on mytab(col2)

drop index

mytab.idx_2

create index

idx_2 on mytab(col4)

select count

(*) from mytab

I looked at the optdiag output for col2 and I could see that there are 5000 steps in historgram and it is quite great as my data is skewed and more steps means better statistics.

My server has histogram tuning factor = 20.

Now like everyone else I want my stats to be up to date. Thus, I have run the update statistics on index .

update index statistics

mytab idx_1

Now check the optdiag output and histogram has just 31 steps. What has happened here? I guess it is a bug as update stats shouldnt reduce the histogram stats unless specified explicitly.

Anyway, I will talk to some sybase guy on the above one. But that is not the point of my discussion. Now I have 31 steps and my data is quite skewed. Let us run a sample query.

set showplan on

set statistics io,time,plancost on

go

select

* from mytab  where col2 = 10

Here is plan and stats.

==================== Lava Operator Tree ====================

            
            Emit
            (VA = 1)
            r:10 er:3380
            cpu: 0

/
IndexScan
idx_1
(VA = 0)
r:10 er:3380
l:5 el:263
p:0 ep:263

============================================================
Table: mytab scan count 1, logical reads: (regular=5 apf=0 total=5), physical reads: (regular=0 apf=0 total=0), apf IOs used=0
Total writes for this command: 0

Execution Time 0.
Adaptive Server cpu time: 0 ms.  Adaptive Server elapsed time: 0 ms.

Look at estimated rows(3380

) vs actual rows(10). This is not good even though we have our stats upto date. But for given query it has done just fine as it picked correct plan. Things could go bad when this table is beging joined with some other tables and the estimated rows could cause a merge/hash join with other table instead of a nested loop join as it would have been done for 10 rows. There could be other issues as well due to this wrong estimation. Today, I will show such an example.

set showplan on

set statistics io,time,plancost on

go

select top 1

* from mytab  where col2 = 10

order by col4 desc

Let us think what kind of plan should optimizer choose for this query. We have indexes on both col2(idx_1) as well as col4 (idx_2). However, col2 is our SARG and we have order by on col4.

We are sure that it should pick index on col2. The work done will here be bring all rows using index on col2. Then sort the resultset based on col4 and return topmost row. Thats what I expected and most of us would expect. Let us see plan now..

==================== Lava Operator Tree ====================

                        
                        Emit
                        (VA = 2)
                        r:1 er:3380
                        cpu: 100

             /
            Top
            (VA = 1)
            r:1 er:3380
            l:0 el:297
            p:0 ep:297

/
IndexScan
idx_2
(VA = 0)
r:1 er:3380
l:29210 el:297
p:0 ep:297

============================================================
Table: mytab scan count 1, logical reads: (regular=29210 apf=0 total=29210), physical reads: (regular=0 apf=0 total=0), apf IOs used=0
Total writes for this command: 0

Execution Time 1.
Adaptive Server cpu time: 100 ms.  Adaptive Server elapsed time: 113 ms.

What happened here? Optimizer picked the index on col4 instead of index on col2. It is strange and everyone in my team thought that it is a Sybase bug. However, I think it is beauty of sybase optimizer to chose the index on col4.

I will explain why it picked index on col4.  Suposse it were picking index on col2. Then it would have had to rbing 3380 (estimated rows) rows into a wroktable and then sort on col4 and return topmost row.

Suposse it is picking index on col4. It need to a backward scan on col4 index and as soon as it hit a row where col2 =10 then it will stop processing and will return the data. Now optimize rknows that there are 12million rows and there are around 3380(estimated rows) rows for col2=10. Thus optimizr is thinking that every 12887151/3380i.e. 3870th row could be col2=10. Thus you could see that there are quite high chances to hit the row with col2=10 within first 3870 rows as these are distinct rows. Thus as per optimizer it will need to read at max around 3870 rows before getting the data. Look this 3870 is more than 3380 it would have read using index on col2 but in this case there is no need of sorting and thus no cost related to cpu and would have saved memory and that is why this index choice looks more obvious.

Now this is pure luck as it could be possible that the col2 =10 is at very start of the index while doing backward scan or it is at very end. In the above case it was at very start i guess within first 3-4 rows and thus it has done 12 IO’s. If it were at very last then whole table would have been read before we returned a row and that would have been disaster. Thats why I am saying it just depends on luck. Chosen plan could be perfect or disaster and that just depends where the data would be in that particular index.

Now let me run the same query but I am running against a value for which col4 is minimum or a value such that for which there is no data . I am trying col2= 0,

set showplan on

set statistics io,time,plancost on

go

select top 1

* from mytab  where col2 = 0

order by col4 desc

Plan and stats are below.

==================== Lava Operator Tree ====================

                        
                        Emit
                        (VA = 2)
                        r:0 er:3327
                        cpu: 59200

             /
            Top
            (VA = 1)
            r:0 er:3327
            l:0 el:309
            p:0 ep:309

/
IndexScan
idx_2
(VA = 0)
r:0 er:3327
l:1.281e+07 el:309
p:0 ep:309

============================================================
Table: mytab scan count 1, logical reads: (regular=12805777 apf=0 total=12805777), physical reads: (regular=0 apf=0 total=0), apf IOs used=0
Total writes for this command: 0

Execution Time 592.
Adaptive Server cpu time: 59200 ms.  Adaptive Server elapsed time: 59186 ms.

Here it has read full table and that too using an index and it is a very costly operation. If table were much larger then it would have been worse and we would have brought unnecessary data into the cache and thus impacting overall system perfroamnce.

So what is wrong here. Is optimizer wrong here? No I guess it is our statistics which are giving worng information to optmizer. if stats were correct and estimates were correct it would have used index on col2 column for values which has just few rows in table and would have used col4 index for values which are quite frequently used in the database.

Let us update statistics. I want at leats 5000 steps and if optimizer wants more it can create.

update statistics mytab (col2) using 5000 values

go

Let us run the query with col2 = 0 again… It hsould pick index on col2

set showplan on

set statistics io,time,plancost on

go

select top 1

* from mytab  where col2 = 0

order by col4 desc

==================== Lava Operator Tree ====================

                                    
                                    Emit
                                    (VA = 3)
                                    r:0 er:1568
                                    cpu: 0

                         /
                        Top
                        (VA = 2)
                        r:0 er:1568
                        l:0 el:131
                        p:0 ep:252

             /
            Sort
            (VA = 1)
            r:0 er:1568
            l:6 el:131
            p:0 ep:252
            cpu: 0 bufct: 24
/
IndexScan
idx_1
(VA = 0)
r:0 er:1568
l:4 el:5
p:0 ep:5

============================================================
Table: Worktable1 scan count 1, logical reads: (regular=6 apf=0 total=6), physical reads: (regular=0 apf=0 total=0), apf IOs used=0
Table: mytab scan count 1, logical reads: (regular=4 apf=0 total=4), physical reads: (regular=0 apf=0 total=0), apf IOs used=0
Total writes for this command: 0

Execution Time 0.
Adaptive Server cpu time: 0 ms.  Adaptive Server elapsed time: 10 ms.

Now estimated rows are 1568 not perfectly correct but much better than previous. But for other values these estimataions should be great.

Now let us run query where col2 value is quite frequent ie. say 4900. It should pick index on col4.

set showplan on

set statistics io,time,plancost on

go

select top 1

* from mytab  where col2 = 0

order by col4 desc

==================== Lava Operator Tree ====================

                        
                        Emit
                        (VA = 2)
                        r:1 er:4900
                        cpu: 0

             /
            Top
            (VA = 1)
            r:1 er:4900
            l:0 el:206
            p:0 ep:206

/
IndexScan
idx_2
(VA = 0)
r:1 er:4900
l:751 el:206
p:0 ep:206

============================================================
Table: mytab scan count 1, logical reads: (regular=751 apf=0 total=751), physical reads: (regular=0 apf=0 total=0), apf IOs used=0
Total writes for this command: 0

Execution Time 0.
Adaptive Server cpu time: 0 ms.  Adaptive Server elapsed time: 6 ms.

It is using index on col4. Also, it is using correct estimates.

Thus, keep you stats updated and make sure that you use proper number of steps especially where data is skewed too much. In the above case using having an index on col2,col4 would have been best. But you should create indexes after careful analysis and impact of new index on your dml statements as well on your maintenance programs.

Thanks

To report this post you need to login first.

Be the first to leave a comment

You must be Logged on to comment or reply to a post.

Leave a Reply