Master of the obvious….that’s me!!!
One thing I hadn’t realized until I started researching HA/DR incidents for recent stats is just how often failures happen. Maybe I am a bit isolated due to the excellent services our IT org does …..but some interesting tidbits I learned:
- Data Center failures happen on average about twice per year per customer. Ouch. Now I – like many others – were aware of the quarterly DC downtime for maintenance – but wasn’t quite aware the we tended to have so many unplanned outages. Turns out (not surprising) that the biggest cause of DC outages is power related. 2nd place was water. The first I understand – that’s the downside of all those cores & spindles. The second makes sense when you realize that most DC’s are ground level ….and those leaking pipes above….or that ground water drainage problem….and suddenly we have H2O in the HP’s. The other one I was looking for was backhoes….as I know they habitually take DC’s down when they dig up the cable….
- The longest downtimes are due to hacking. Now that one makes sense when you think about it….. Forensic analysis takes a while….and you can’t put the system back online until you are sure all the bad stuff is gone.
- Human error is the leading cause……geee…..given the complexities of today’s systems, this one is almost a gimmee.
- Another leading cause is over-capacity/surge. For those of you who tried to get those $1 airfares from a certain low cost airline recently understand the problem. And they aren’t alone. Another well known leading discount department store has suffered two major outages when internet orders flooded their systems the first day of new clothing from certain fashion brands were launched.
- The most common failure time is right after an upgrade – either hardware or software. I was amazed at how often network upgrades were the culprits as I assumed that application upgrades would be high….and they were – but network upgrade failures tied app upgrades as cause for major outages.
- It can happen a lot…..another leading internet auction site had 14 major outages in 2014 alone. They got hacked….ran out of ipv4 addresses – you name it.
I would have thought by now that as smart as all of us IT folks are that outages would have been diminishing ….or at least much less often. With hardware reliability at a fairly high compared to a decade (or two) ago and especially with SSD’s eliminating moving parts….. But, alas, we still have software to blame things on.
The reason I was doing this research is that in my job as a Product Manager for ASE, I was tasked with focusing on the ASE Always-On option that we are releasing with ASE 16sp02. While it has been under development for sometime, it was fun (and sometimes exasperating) to put my hands on it and take it for a test spin. One of the considerations behind Always-On is to address two of those problems above – the Data Center failures and Upgrades.
With regard to the former, Always-On is based on typical HADR synchronous replication. This allows you to take a hit on the primary and failover to the standby with minimal latency and zero data loss. While the last is key, the former was really surprising at how well we did even at higher volumes. While the solution is based on SAP Replication Server, even under heavy load, engineering has done such a good job of pretuning it that one customer was able to get ~20K rows/sec from primary to standby with <5secs latency. Internally (and with better HW) we were able to get much better…. And to think I remember categorizing 5K rows/sec as high volume a few years ago…
With respect to the upgrades, ASE Always-On will be just that – always on – even during upgrades. Yes, even when you upgrade to ASE 17 or whatever the next major release is. The current installer installs the software and then manages the failover/failbacks in such a way that overall processing is never interrupted.
Not only that, but failover times were quite impressive. I spent a lot of time recently doing kill -9’s on my dataservers to simulate a crash. When the project originally started, there were some fairly long times discussed as a possibility (which influenced the JDBC api a bit). However, under testing (including customer tests), the speed is much faster than OS clustering and competing with ASE Cluster Edition in some cases.
Now, doing this did take a few changes to both ASE and SRS. For example, we all are familiar with “near zero” downtime approaches in which you get ready to failover and at the appointed time, you kill all the applications (or have them log out), flip to the new system and restart the applications. With luck and enough staff, it can be done in minutes. But ASE 16sp02 added a new soft quiesce feature that even eliminates that – as a result, planned outages can happen without disrupting the applications.
There were some other modifications made. On next Wednesday, I am hosting a webcast on the ASE 16sp02 Always-On option. If you would like to hear more – join me.
Join us for a webcast September 30th, 10:00 EDT: SAP ASE 16 Always-On – The ultimate in zero downtime/zero data loss. I’ll be discussing the new SAP ASE 16 always-on feature that leverages synchronous log replication and SAP’s leading database replication technologies to create a solution for customers that is a comprehensive solution for high availability (HA) and disaster recovery (DR) requirements. This new feature will allow zero downtime during major upgrades to SAP ASE or key hardware components (such as a complete storage upgrade) while being impervious to data storage failures. Offering zero data loss in HA configuration, it helps to reduce application RTO objectives while providing transparent application failover for both planned and unplanned outages.