VLDB’s….it seems when discussing these, everyone focuses on the data volume aspect and the operational concerns.  These, of course, are very valid points.  However, when working with VLDB’s, we also often are working with “Very Large Hardware” – machines with very high core counts – or at least very large amounts of memory …and this can lead to some not-so-entertaining pain points if the DBA’s are not aware of what this means – and can mean that the VLDB implementation suffers – even more than the challenges with respect to backup and recovery, creating indexes online, etc.

Last week I was at a customer site (I don’t get to do this much any more – so it was a very enjoyable experience) and the lead DBA manager made a comment to me after looking at the results of some performance tests we were running.  To paraphrase, what he said was:

“In the old days, we focused on tuning physical IO’s – it looks like now we need to focus on spinlocks.  Back then, with low [available] memory, we did a lot of physical reads.  Today, on the larger memory footprints, spinlocks and memory contention is a bigger problem”

He made that observation after we had changed several different spinlock configuration values as well as changed some cache configurations and bindings to also help eliminate spinlock contention.  The system we were testing was 40 cores/80 threads and ~450+GB of memory in which all the current day’s data (which was being analyzed) easily fit into memory with no physical reads (the remaining days data, of course, on disk).  This is typical – how many of our systems do we proudly state have 95-98% cache hit ratios with only 1/10th of the memory of the full database size???  Probably most.

At this customer – like many others – a massively parallel operation ran at least periodically during the day (multiple times per hour I suspect).  The reason for massive parallelism was simple – the volume of data to be processed simply precluded a single threaded process.  I have seen this all too often – with healthcare claims and billing processes as well as FSI close market processes, etc. all  running high 100’s or 1000’s of concurrent sessions all doing the same logical unit of work – but dividing the workload among the concurrent sessions.  Simply put, single threaded solutions don’t work in VLDB – and the application developers quickly resort to massive parallelism to try to get all the work done within the desired timeframe.  The quicker the better.

As was this customer’s case.  His problem was that as he ramped ASE 15.7 up beyond 24 engines, severe spinlock contention (>50%) resulted in exponentially increasing response times.  Fortunately, we were running ASE 15.7 sp110 which had monSpinlockActivity – which makes it much, much easier to diagnose spinlock issues.  If you have never used this table…..you are really missing something – it has become a key table for me – much like monOpenObjectActivity.

Net result was that a process that should have ran in 10-20 seconds wasn’t finishing in well over 5 minutes when running 64 cores and scaling the number of concurrent sessions….in fact it was taking so long they were simply aborting the test.

Ouch.

Unfortunately, we were not running ASE 16.

Why do I say that??  Because at least two of the spinlocks causing problems were the DES and IDES spinlocks.  However, the first spinlock that was causing contention reminded me why VLDB’s should be using DOL locking – datarows preferably.  That spinlock was the Address Lock spinlock.  Yes, this is configurable via “lock address spinlock ratio” ….which mystifyingly is set to 100 by default.  Now, that may be fine for pubs2, but any decent sized database with a large number of tables and things get ugly quickly.  Specifically, this spinlock is used for indexes on APL tables.  Now, I hear quite often about the supposed penalties of DOL locking – with memcopy, etc. – but one trade-off that you need to consider when it comes to concurrency and contention:

  1. is it better to make it tunable to reduce the contention, or…
  2. is it better to avoid the contention to begin with by using local copies

AHA!  Never thought of that did you?  Yes, APL tables (in my mind) have a much bigger scalability issue in that you can’t avoid address locks.  Hmmmm….yeah, we can set “lock address spinlock ratio” to 5…but what exactly does that do??  Well a quick doc search reveals:

For Adaptive Servers running with multiple engines, the address lock spinlock ratio sets the number of rows in the internal address locks hash table that are protected by one spinlock.

Adaptive Server manages the acquiring and releasing of address locks using an internal hash table with 1031 rows (known as hash buckets). This table can use one or more spinlocks to serialize access between processes running on different engines.

Adaptive Server’s default value for address lock spinlock ratio is 100, which defines 11 spinlocks for the address locks hash table. The first 10 spinlocks protect 100 rows each, and the eleventh spinlock protects the remaining 31 rows. If you specify a value of 1031 or greater for address lock spinlock ratio, Adaptive Server uses only 1 spinlock for the entire table.

Don’t ask me why the docs cite the last example when in reality we are always trying to LOWER this value vs. raise it.  Soooo…when we set it to 5, we get 1031/5=206.2 or 207 spinlocks (rounding up) – 206 guarding 5 rows each and 1 spinlock on the last hashtable row.  It helped….although occasionally, we still saw periods of 2% or 5% spinlock contention.  Could we set it to 1?  Theoretically.  Might use quite a few resources on all those spinlocks though.  However, remember, the spinlocks are on the hash buckets…the address locks are in the hash chain so even at 1 you still could have contention if the address locks you are after are in the same hash bucket.  BTW, DOL locked tables don’t use address locks on the indexes – so the whole problem is avoided – reason – address locks are part of the lock manager and the implementation for DOL avoids that part….I wonder if not due to the memcopy aspect or the index latching…hmmmm….  Getting back to the supposed penalty for DOL locking…when is a penalty not a penalty???

When the alternative has an even bigger side effect.

For other considerations on DOL locking, see my friend and colleague Cory Sane’s blog on DOL locking at http://scn.sap.com/docs/DOC-52752.

How much of a difference did this make???  Well, we dropped from a never-ending soon number at 64 engines to something less than 5 minutes.

Now, back to the DES and IDES.  Similar to address locks, these also can be tuned via “open object spinlock ratio”, “open index spinlock ratio” and “open index hash spinlock ratio”.  Of course, if using partitions, one should also consider “partition spinlock ratio”.  In true P&T characteristic, as soon as we eliminated the address lock spinlock problem, these guys who were hiding suddenly popped up at 30-40% spinlock contention.  Remember – 10% spinlock contention usually triggers a search and destroy mission ….sooo…even though 30-40% was less than the 50%, we are still suffering mightily.  Once again, we simply set the ratios to 5 …or was it 10….sigh…getting older…memory fail…

Anyhow, you can find a full description of what these things do in the white paper “Managing Workloads withASE: Techniques for OLTP Scaling…” …which admittedly is a bit dated as it was based on 15.5, but it has a very good description of ASE metadata and some of the components in a descriptor.  Of course, it was plagerized heavily from an even earlier work by David Wein who was very instrumental in the ASE kernel….but the advantage of being co-workers is plagerism is allowed as long as we give credit where credit is due…..

This is where ASE 16 shines.  The descriptors in memory are NOT used for query compilation or optimization.  More critically, the descriptors contain address locations for where indexes are in memory, etc. along with OAM and GAM pages – which makes finding them faster.  In addition, they also are used to track concurrency such as keep counts and performance metrics – an example of the latter is that the IDES is the location of the metrics that monOpenObjectActivity collects.  Now, in pre-ASE 16, any time any of these counters were modified, ASE grabbed the spinlock.  Which spinlock???  Oh, yeah – the one that not only protects the current object descriptor – but because of the open object spinlock ratio also protects 99 more.

Ouch.

If I had different users running concurrent queries against different objects that just soooo happened to be protected by the same spinlock…. welllllllll…. hmmm…. that is what was happening.  ASE 16 does this a lot better.  Instead of using spinlocks as a concurrency implementation, ASE 16 leverages several different lockless structures such as “Compare & Swap” that are now actually implemented in modern CPU’s.  In addition, some metadata cache optimizations, such as pre-caching systypes and other similar system table information helps reduce systables lookups…more on this in a minute.  For more on some of these changes, you maywant to watch the replay of Stefan Karlsson’s webcast  “Achieving Linear Performance Scalability in Large Multi-core Systems with SAP ASE 16” @ http://event.on24.com/r.htm?e=775357&s=1&k=223D966FF0E5C249A1FD8ACECA0B91F7.  I know he and the SPEED team also have a white paper on this coming out – I am reviewing it at the moment and when he comes back in August from vacation, he will likely shake his head at all my recommended changes….  So look for it in September maybe.

Back to our performance test.  We were now down to something like 3 minutes.  Yayyyyyyy!!!

But here is where we hit a harder nut to crack – data cache spinlock contention.

Yes, the cache was partitioned. But…..

So the first thing we did is a common trick – move the system tables to a separate named cache.  A lot of folks think (mistakenly so) that the metadata cache is a system table cache.  NOT.  Query parsing/normalization often involves looking up column names, datatypes, etc.  Query optimization often involves reading systabstats & sysstatistics.  Both are even slower if we have to read those tables from disk…and by default, they are in default data cache and treated just like any other table….which means that big wonking table scan just bumped them out of cache.

If you bind the system tables to a separate cache with a relaxed cache strategy and then look at monDataCache for LogicalReads or CacheSearches, you will get a very interesting picture of how many LRU -> MRU relinkages just the system tables were causing in default data cache.  In some runs, we were seeing a high of 100,000 over ~2 minutes execution time – which is ~1000/sec – or 1 every millisecond.  Not claiming they take a millisecond….but it
is a bit of work that can be saved and reduces cache contention on cache spinlocks.  Of course, binding systypes in ASE 16 might be less of a concern due to the aforementioned fact that now it is fully cached as part of the metadata cache…but….why not?

By this time, we were down to 1.5 minutes of execution at 64 engines.  A huge drop from the starting point.  We then noticed that one of the key tables was in a separate named cache, but the others were cluttered along with everything else in default data cache.  To prove a point that it was LRU->MRU relinkages that was helping to drive the contention, we moved the tables to another named cache with a relaxed cache strategy.  This dropped us down to 1 minute total
execution.  However, we still noticed 10%-15% cache spinlock contention even on the relaxed cache strategy.

While we are still not where we would like to be, it proved several important considerations for VLDB management:

  • DOL tables have an advantage over APL in reducing address lock spinlock contention – which can be severe in high engine count configurations.
  • pre-ASE 16 requires a lot more tuning around object and index descriptor spinlocks – which helps – whereas ASE 16’s lockless modifications avoid many of these issues
  • LRU->MRU relinkage plays a role in data cache contention – even when partitioned.  Using separate caches splits the load and using relaxed cache strategy where possible helps even more.  As the customer noted, perhaps more careful consideration for the sp_sysmon recommendations for using ‘relaxed cache strategy’ on named caches other than default data cache is deserved vs. ignoring it as in the past.

There still is some contention with this app keeping us from hitting the perfect state – but again, I think to reduce it further will be a SQL change from a slightly unoptimal query that is the root cause of the rest of the cache contention (yes – table scans DO cause other problem…not the problem here, but a 1 million page table scan means 1 million LRU -> MRU relinkages and the associated spinlock issues…sometimes ya just gotta fix the problem vs. tuning around it)….but it was an interesting segue into how some of the future changes such as transactional memory in a future release of ASE will help.

The lesson…when running VLDB’s on VLH…you might want to consider upgrading to ASE 16 sooner rather than later….in the meanwhile, you might want to look at your memory tuning/named caches and spinlock ratios.  That “everything” in one big default data cache…..that is soooo 1990’s.

To report this post you need to login first.

2 Comments

You must be Logged on to comment or reply to a post.

  1. Grant Fidler

    Very good article Jeff, it mirrors exactly what we have seen when trying to scale Murex on 15.0 / 15.7 running 32+ cores – even when PS came on site they couldn’t make things happen as we’d have liked.  Now my next trick will need to be convincing Murex to certify ASE16.  I won’t hold my breath… 

    As a side reference, your Managing Workloads with ASE document has been helping me for years now and I would recommend it to anyone – any thoughts of an updated version?

    Edit – for info, the sweet spot for engines / contention seemed to be 12-15, depending on the queries.  Diminishing returns past that point.

    (0) 

Leave a Reply