Do MetroClusters Make Sense?
Note: opinions and statements made in this blog are mine and not necessarily the official position of SAP.
My colleague and ASE übergeek Peter Thawley recently sent me an announcement from EMC titled “Oracle Real Application Clusters (RAC) on Extended Distance Clusters with EMC© VPLEX™ Metro”. The genesis of the white paper was how to set up Oracle RAC in a “stretch cluster” configuration. For those unfamiliar with the concept, it involves two sets of cluster nodes and two sets of disks in separate location – but operating as a single cluster. The theory is that this improves availability of clusters over a single location due to the simple aspect that if the primary location goes down, the bunker facility can take over. Now, we all understand that there are 3 R’s in High Availability. The two most of often cited are Recovery Point Objective (RPO) and Recovery Time Objective (RTO). Synchronous disk replication provides zero data loss, achieving a perfect RPO and by extending SDC clustering to the DR site, the normally long RTO of disk replicated solutions due to cold restarts is minimized to just the seconds of a SDC cluster failover – achieving a near perfect RTO. The third R – present in every solution – is ROI. No one ever wants to have expensive hardware sitting idle. So to that extent, EMC’s approach seems to have the perfect trifecta – RPO, RTO and ROI. The secret sauce here is EMC’s VPLEX™ Metro technology – which is the first true viable bidirectional disk replication solution.
My half tongue-in-cheek response was “How quaint, they are still using disks!”
But the real question is “Is the smart way to implement high availability?”
Before I begin, let me first say that I like EMC. They build industry leading disk-based SAN storage systems. However, I must be honest and say that I find the performance degradation from disk based storage (SAN or DASD) to be so severe that I question its use in mission critical systems – quite simply SSD storage in my opinion is cheaper in the long run and there are quasi-SAN SSD based solutions available (e.g. Texas Memory Systems). So….my feelings are bit mixed about seeing a great technological breakthrough hamstrung by reliance on arcane technology. And what a breakthrough it is…..what EMC is doing with VPLEX™ Metro is something customers have long asked for and is really worth considering.
However, first I should state some interesting observations that EMC made in the paper, summarize the setup, offer some insightful observations I made and (unfortunately) expose the one problem I saw with the benchmark. First, EMC defines VPLEX™ Metro as synchronous block level replication with less than or up to 5ms round trip time (RTT) with block level read/write access on either side of the SAN. Now that is definitely cool. Did you get that??? You can write to the exact same mirrored blocks on either SAN frame – remote or local – simultaneously (or nearly so). That raised a couple of questions immediately in my mind we will get to in a bit. So, to set up their test, EMC simulated (emphasis all mine) a 100km distance by using an Empirix PacketSphere Network Emulator to induce a delay of 1ms.
STOP!!!! 100km is only 1ms??? Hmmmm….now the speed of light is 299,792,458 meters per second – or 299,792km/sec…which is a mere 292km/ms – so we are claiming 1/3rd the speed of light as our latency??? That is the one problem I had with this benchmark. Yeah, I know – supposedly light travels through fiber at 2/3rd the speed of light…so a round trip is 2x the distance and thus 1/3rd the speed of light. But that ignores the impact of networking equipment in between. Do yourself a favor – go to one of your hosts and try to ping someone 100km away and tell me what the speed is. I just tried to ping a host 20 miles away and got 10ms latency. 2000 miles away is 50ms….so the long haul stuff isn’t the cause. I have often argued that it is not the distance but the amount of equipment that causes the problem – and given we are all likely leasing infrastructure installed several years ago designed before that, I doubt we have any single link >10km in a metro area, but more importantly, the amount of network equipment locally to attach to the long haul links is probably more than on the long haul links itself. Only way around it is to rent a backhoe and dig yourself a long trench. Hopefully your fellow commuters won’t mind you taking out a few streets.
Looking a bit closer, I noted that they also discussed 10Gb networks between the sites. Nice. Let me think. Nope, not realistic either. First of all, they also made mention of VPN over that 10Gb network, which immediately in my mind adds some overhead – and they don’t mention if they did VPN during the test. Yes, you can (and should) use hardware VPN for such an implementation, but my thinking is that encryption probably adds more overhead than all the network switching gear which has to route and retransmit the packets plus the decryption on the other end….and we have more likely 10’s of milliseconds in 10’s of km….not 1 over 100. They then took unreality (okay – “highly idealistic and unlikely conditions”) and stretched it even further by claiming that VPLEX™ Metro could reach 500km and still be within the 5ms RTT threshold. Darn! So why is it so many of my bulk materializations lately require an airplane and removable disk packs???
Real life is a lot different. Recently on an internal test using sio for device speed testing, one of our internal systems achieved 50MB/s throughput and stopped scaling at 4-5 concurrent writes on EMC VMax’s – which I was quite surprised at that low of a number….I mean, shoot, I’ve played on EMC systems for years – and never seen anything that slow. We later found out that it was synchronously disk block replicated 10km between locations. Turning off synchronous disk block replication allowed it to scale 200MB/s (16 concurrent writers) and going higher on the limited write concurrency tests we asked for. I got to thinking about that a second. At 50MB/s, if we do a bit of simple math, that is 400Mb/s…which is sort of close to 512Mb/s – which is half of 1Gb/s. Now, given that SAP is a large global corporation, I can believe we have a 1Gb link between the campuses. 10Gb is even maybe doable at those distances if you are willing to bury dark fiber. Hmmm….my suspicion is that as with all WAN traffic, the network was provisioned and the disk block replication was allowed 512Mb/s….and that is the second thing. Given older and existing infrastructures, a 10Gb network between sites is bit of a stretch unless upgrading the infrastructure (and tremendous cost).
…my point. I really don’t think 1ms for 100km is even achievable. 10km….much more likely. 20km….with your own dedicated link. I think the answer really is more like 0.2-0.3ms/km latency on leased infrastructure. You might get better speed if you dig your own trench – but that also eliminates the need for VPN and eliminates half the network gear. And even then, considering bandwidth sharing with other systems, provisioning for other uses, etc., the result will be a flattening limit of the total throughput. The reason we were doing the tests was due to a backup/restore speed….and 50MB/s for a multi-TB database yields some recovery times that are best not experienced unless you have to (~6 hours/TB). More importantly, it represents a scalability bottleneck that no additional cluster nodes will overcome. More on that in a second.
And then I got thinking about a common bidirectional replication problem and EMC’s claim of block level access on either side. I wondered – how does EMC know what the database has modified in cache??? Remember, modern databases do not flush all their data changes to disk immediately. Instead, we tend to use data caches and some form of write-ahead logging to ensure recoverability. The problem as I saw it was that two people on each node could theoretically try to make changes to the same block – perhaps different changes (overlapping vs. conflicting) and since they still were in DBMS data cache, the SAN would never know. Then based on different read patterns or whatever, the writes could be staged to disk at different times by the checkpoint, cache MRU-LRU, housekeeper, whatever….and the result in my mind was similar to what commonly happens in bidirectional replication using the mythical most common form conflict resolution – the last guy to write wins. Problem: the other guy’s changes are lost.
But then I realized that EMC was protected from this by Oracle RAC. A key aspect of Shared Disk Clusters (SDC) is the notion of cache coherency/cache synchronization. So…if two users attempt to write to the same block, in Oracle – whether RAC or not – they will be sequenced via the ITL list and due to cache synchronization, they will proceed serially (we will avoid a long discussion here on cluster locking and distributed lock management) – so writer #2 will get a copy of the (dirty) block with writer #1’s modifications. In other words, don’t try this at home with cluster software that doesn’t do cache synchronization.
A second question was how they dealt with the “split brain” problem of SDC. The term “split brain” is derived from the aspect that at least one node of the cluster has lost contact with others and has its own thoughts about cache synchronization and locking states – which can result in corrupted disk blocks if it tries to write to disks that the remainder of the cluster believes it is controlling at that moment. ASE Cluster Edition prevented this problem by using the SCSI3 Persistent Group Reservation (PGR) to control which processes on which hosts were allowed to write to the data volumes. As a result, split brain was prevented by allowing the cluster to simply deregister nodes from that have gotten isolated. The question is could PGR be implemented between two different frames on two different sets of physical disks?? Given that SAN’s play a part in PGR and given EMC’s technical prowess, I could see this being achieved somehow…but they don’t mention it. I wish they would have…enquiring minds would like to know.
But then I remembered something. Oracle eschewed the standards in this case and implemented their own device in Oracle ASM™ – so once again, EMC was spared finding a solution to a difficult challenge. Not belittling what they have done – it is remarkable….but the task would have been much more difficult or nearly impossible if they would have had to coordinate host memory or host process states.
Now then, EMC went on to make the case for cluster scalability with this solution. In charts near the back of the paper, they showed a 4 node (0 distance) RAC cluster scaling from what looks like 7500 tpmC using a single node to 20000tpmC with all four nodes (66% scalability). They then repeated the test using their highly “idealistic” setup and showed the same four nodes with 2 sets 100km apart scaling from what appears to be 7000 tpmC (slightly lower) to 17000 tpmC. In other words, minimal impact – a mere 15% degradation for a final 57% scalability. Those charts will be guaranteed to sell lots of this in many boardrooms as it looks as if the magic bullet has been found.
Or has it?
I remember a customer benchmark of a single system test with 4 6-core Intel XEON processors that hit 3000 inserts/sec with an EMC frame and 100000 inserts/sec with a FusionIO SSD. Now, with the 66% scalability that EMC demonstrated with RAC at 0 distance (4×7500=30000tpmC vs. 20000 tpmC actual), to reach 100000 inserts/sec we would need ~50 nodes. OUCH!!! I was also reminded of a startling statistic about Oracle Exadata™ Database machines – they often only have 2 nodes of RAC and a plethora of (okay, okay – 14) IO processing nodes with as many or cores doing IO processing than database query execution. And that was on scads of SSD’s (5TB+ and counting). In other words, if you scale out horizontally – how much of that horizontal scaling is being sucked up by IO processing??? And forget 100km – at that rate of 100000 inserts/sec, the customer desired HA solution became a bit of a challenge. Across the room – doable. Between buildings – probable. Long distance…uh oh…ouch…uh..uh…hmmmm. Problem. We needed to implement a different HA strategy that was a lot more complex than the ease at which disk block technologies are implemented with. It would have involved duplicating market feeds and then selectively replicating local changes to reduce the volume to a level sustainable over long haul communications. Not the simple solution the customer wanted to hear….but then it is the only achievable solution….
This all simply continues to prove that the slowest hardware in any computer is the disk drive and IO simply gets in the way of real work being done. So, you have three choices for scalability –
1) try to scale out using shared disks…and lots and lots and lots of hardware; or
2) replace the disks with SSD and make the IO tons faster, dramatically reduce the need to scale out; or
3) (gasp!) In-memory databases.
Hmmm….now lest everyone think I am late to jump on the HANA bandwagon…nope. I previously ranted on the SY blogs (“The Lurking Surprise” – July 2010) about the demise of disks and rise of in-memory databases. If you were aware and remember, prior to the acquisition by SAP, Sybase was developing a grid-based ASE solution that would have been in-memory and allowed a scale-out – and used Sybase Replication Server (Next Generation) to copy disk blocks and transactions to do so. Of course SAP had developed HANA which was a bit further along by that point with respect to in-memory computing and scaling. Great minds thinking alike. Of course, if using in-memory databases….disk block replication simply is no longer a viable availability solution.
Now as with anything magic, we all know in reality there is a “distraction” and then a bit of “slight of hand”. In other words, the magician distracts you from what he/she is doing long enough to make the trick work. So…what is the distraction here??? Ah yes, that scalability claim along with that carrot of leveraging offsite resources. Good carrot!!! But, let’s focus on the real problem – high availability. It is true that such a solution would tremendously reduce the RTO and improve ROI of an availability scenario over SRDF™ alone – but is it really protection??? I am forced to remember the Chase Bank credit card system failure a few years ago in which their Oracle RAC cluster was down for 3 days due to a corrupted disk. Yep – they had a disk block replicated copy…..which, of course, was also corrupted as it was faithfully and dutifully replicated image of the primary. Ouch. And therein lays the Achilles heel of disk block replication solutions – you only ever have a single disk image. Period. One.
Back when we were discussing ASE Cluster Edition with customers, I commonly would depict a “stretch cluster” with two cluster nodes and Sybase Replication Server in a peer-to-peer bidirectional setup. Yes, it does take a lot more effort to set up than disk replication (largely due to the peer-to-peer aspects), but you also gain increased availability as you now have protection from the myriad of problems that afflict single disk images – database corruptions, schema changes, DBMS major upgrades (vs. rolling minor upgrades), the proverbial “oops” user transactions, wide-spread geographic disasters, etc. Never mind the fact that over 90% of all system availability issues are planned downtimes. Think about it – a 2 hour maintenance window each day immediately translates into 91.6% availability – far below the oft touted 5 9’s availability goal. And with a single disk image – planned downtime is still downtime.
Also, as I noted earlier, a single disk image (even at the highly idealistic limit of 100km) promotes a highly centralized architecture. In other words, as with most global corporations – users from all over the globe connect to a pair of systems only a short distance apart. The result is those systems must have a higher availability due to different time zones and must support higher transactional rates due to the aggregated volume. And these two considerations have a cumulative negative impact on system availability. If the system is located in EMEA (as much of ours are), the nightly batch processing impacts NAO/LAO or APO daytime users. Think backups for instance. I worked with one customer at which the nightly BCV copy/split operation for system backups saturated the SAN write cache so much that physical reads from disk were blocked for over 2 hours. Of course, the database got blamed for poor performance…at least until we demonstrated the real cause. In addition, such systems have to support 10’s of thousands of users with multiple peak periods. One global trading system I know has to contend with 20 market closures per day. As a result, the hardware has to be much larger than an equivalent distributed system. For example, if we use the 80:20 rule of thumb, we can assume 80% reads and 20% writes. So…at 20000 tpmC we would have 4000 writes and 16000 reads. Now if we divided the system into a fully distributed system across 3 geos (EMEA, Americas, APO) and if we assumed evenly distributed workload, we would only have to support 4000 writes and 5333 read tpmC. You might think – wait a sec – 4000 writes divided by 3 geos is 1333 writes per geo…and it is. But in a pure peer-to-peer (aka shared primary) configuration, we need to have each node have all the data (unlike a split primary), so each write will occur at each site. Result, however, is that each of the nodes only needs to sustain about 9333 tpmC – barely more than the single node 7500tpmC transaction rate. Consequently, we can get away with smaller nodes than a highly centralized architecture proposes – reducing hardware acquisition costs and have completely autonomous operations – not to mention improved application response times due to reduced WAN latency. In addition, much of the batch processing and availability requirements would be localized….think about it – when do you schedule major outages for global systems? There simply isn’t too many global 3-day holiday weekends….especially for a retail bank or retailer who is hoping that all those holiday folks are out shopping.
The net?? If we believe the highly unrealistic network times, yes, MetroClusters do improve the RTO of a disk based availability solution. I think it would work great over a short distance (5-10km). However, it doesn’t really improve availability as the same most common single disk image failures are still unprotected and it leads to higher acquisition and infrastructure costs due to increased hardware and availability infrastructure for a highly centralized implementation.
Real protection requires replication.