ASE Cluster Edition 16sp01 w RDMA and ASE 16sp02 pl02 NV Cache
We all like to get presents for the holidays. This year, ASE engineering came through with two that should make a lot of ASE users quite happy.
The first was ASE Cluster Edition 16sp01 on Linux. While the main thrust of this release was to bring ASE Cluster Edition up to the ASE 16 level of functionality (and performance enhancements), there also was one of those “magic go faster” buttons added – via RDMA support. One of the bottlenecks in any Shared Disk Cluster (SDC) is the need for the different cluster nodes to communicate between each other. Often this coordination must be done in advance – e.g. logical or physical locks – and there is a LOT of it in any cluster. Tons of it. Mind boggling amounts of it. Loosely, for ASE/CE, we referred to any and all communications beween the nodes as CIPC – or Cluster Interconnect Protocol Communications. In the past, ASE/CE used UDP over IP as a slightly faster protocol for CIPC vs. TCP/IP. However, whether TCP or UDP, both are packet framing protocols in which network operations move through multiple layers before finally being sent. For ASE/CE implementations, this was often the bane of existence. First, few OS system admins bothered to tune the network settings – and if they did, it was merely the OS kernel memory and often just for TCP vs. UDP. Neglected was the ability to increase the hardware/NIC queue depth using ifconfig. Worse yet, very often the private interconnects were not even set up correctly. Enter RDMA. RDMA is to network IO as DMA is to disk IO. In otherwords, rather than copying memory contents repeatedly as happens with packet framing, instead the memory contents are mapped directly to the IO request. The difference is enormous. Whereas we usually measure network latency using TCP/IP in milliseconds, RDMA latency is often in the low 10’s of microseconds. In some cases short enough that the process doesn’t even need to yield the CPU in cases of CPU threading. RDMA can be implemented over Infiniband or Ethernet – the latter, of course, much cheaper. Currently, ASE/CE 16sp01 only supports RDMA over Ethernet. However, the good news for most is that for very little investment (<$10K US), you can upgrade the NICs used for CIPC and any switches from older non-RDMA supported HW to new 10GbE with RDMA support. Current testing in engineering has shown a 35% improvement in application performance when in cluster mode, with badly partition applications showing nearly 200% performance gains. Now the bad news: 1) This isn’t to infer that ASE/CE should or could be used for horizontal scaling. There simply are too many SDC related bottlenecks for this to work for OLTP applications we aim ASE at. 2) ASE/CE 16sp01 was only released on Linux – before you ask, I don’t have dates for those of you running on AIX, Solaris SPARC or HPUX on HPIA/64. For more information about what is new in ASE Cluster Edition 16sp01, download the ASE 16sp01 What’s New guide that was published in December from http://help.sap.com/ase1601 (http://help.sap.com/Download/Multimedia/zip-ase1601/SAP_ASE_Whats_New_en.pdf)
The second present under the tree from engineering was ASE 16sp02 NV Cache. While ASE 16sp02 was GA’d in October, there were two features in restricted release due to finalizing some usability aspects. One of course is the HADR feature in the Always-On option – which should be unrestricted with the release of ASE 16sp02 pl03. However, in ASE 16sp02 pl02, we moved the NV Cache feature from restricted to fully available. The NV Cache feature leverages SSD devices to extend main memory through an industry standard technique called “SSD cache extension”. Whereas competitive offerings were limited to clean pages only and also didn’t differentiate between cache hits, ASE engineering built a “smarter SSD cache”. In the case of ASE, the named caches (or default data cache) are divided into two page chains – Single Access Cache Chain (SACC) and Multi-Access Cache Chain (MACC). The first time a page is read, it is appended to the SACC. If the pages is read a second time while still in cache, it is then appended to the MACC. When pages from the SACC hit the wash marker are not moved to the NV-Cache – but instead are simply discarded (if clean – if not, a write is posted then discarded once write completes). However, when a page in the MACC hits the wash marker, it is copied to the NV-Cache. If the page is re-read, it simply is read (and removed) from the NV Cache and placed back into the standard cache (on the MACC list obviously). As a result, the NV Cache is more effective as it isn’t filled with useless read-once-and-discard data. In addition, the other competitive advantage to ASE’s NV Cache over competitive offerings is that all writes are done via the NV Cache. The rationale is a bit interesting. Have you ever run a query with ‘set statistics io on’ and seen a lot of writes – in particular the writes increasing or fluctuating with increased concurrency on the system??? The answer is actually in the documentation. While it *could* be sort spills or other worktable related query processing – it also *could* be the simple fact that in reading pages from disk into memory requires other pages to be pushed out – e.g. normal MRU-LRU processing. As pages are brought into memory, your cache changes may push dirty pages already in cache through the wash marker, resulting in disk writes on tables/indexes you are not even affecting with your query – disk writes of dirty pages written to by other users. Depending on the size of your cache (more specifically the wash size), you may have had to wait for these writes to complete before there were clean pages for you to be able to read pages from disk – called ‘cache stalls’. Admittedly, infrequent, but they do happen. More to the point, when writing to the transaction log, any ULC flushes or commits had to wait for the actual writes to disk – which often was the biggest cause of log semaphore contention as was evident in the ‘delayed commit’ option implementation. When a database is bound to an NV Cache, all writes instead happen first to the NV Cache – and then ‘lazy cleaner’ threads write the dirty pages to disk. The number of lazy cleaners is configurable using sp_configure and the NV Cache supports a ‘journal’ which tracks the write sequence to ensure recoverability (and to denote which pages need to be written by the lazy cleaners). Once again, early testing by our partner EMC using their XtremIO(TM) All Flash Array has shown that using these devices as NV Cache provides nearly the same performance as an all-SSD hosted database – despite using much slower HDD for actual storage. The actual peformance numbers will be reported via a whitepaper that my colleague Andrew Neugebauer is working on with EMC. More information on the NV Cache can be found in the ASE 16sp02 System Administration Guide volume 2, Chap 4 – Configuring Data Caches, Section 4.19 – Managing NV Cache.
….and based on the roadmap, all I can say is that next year’s presents look equally attractive to speed freaks such as myself….only 340-odd days to go….