High availability means to have your system up as long as possible. This means to have all parts of your system available redundantly- more or less. Watching it from a “simple minds” perspective (we want to be political correct, don’t we?) this is just having two servers instead of one. So that the second server can take over if the first one fails. Unfortunately it is not that easy. As you might have read in my first “landscape for idiots” blog, even what we call a server in general has several parts that may be redundant. To clarify this we have to introduce the “single point of failure” (SPOF). This means any unit that would bring down a system in case it is not available. Your system is full of Spofs, you think? You might be very correct! The first step to have anything redundant usually is to have more then one server (again this spooky word), what usually means you will install a second or more Application Server, or Dialog Instance (which both mean the same thing). Now, this is really redundant. But is that all?
In Search of SPOFs
Well, in case you are an expert at this point you might see me concentrating on the software alone, ready to give me some advice and show these guys from Walldorf they are not always perfect. While I’m always are ready to learn, this time you have to wait: Of course your hardware is absolutely a matter of high availability too, and the work is not done by keeping machinery redundant. You have to think about a double network, including cables, net cards, routers and what else. If going to the internet, please keep in mind that the internet itself was designed to survive thermonuclear strikes, but your provider may not. But back to NetWeaver specifics.
The Usual Suspects
Even if you have a whole (software) cluster of SAP Web Application Servers, they contain Spofs. There are two units in the system that are contained only once, and that is our famous Message Service and Enqueue Service (sometimes also called Servers, but we want to keep it simple). Both of them are quite important for the system and their failure would stop the system over time (not necessarily immediate!). To protect them we need a “hardware cluster” also called “switchover system”.
The Big Trick
A switchover system works the following way. There are two computers, or hosts. In case one goes down the other one takes over. Simple, isn’t it? Oh, you’re so clever! Of course it isn’t. A lot of questions are coming up. How does the overtaking machine tell it’s clients it is taking over? How does the overtaking machine know it has to take over?
As we now know how it’s done, we have to admit, there are two ways how this is done in NetWeaver, due to it’s two heart chambers, ABAP and Java. The traditional way was to run our two Spofs inside an instance called “central instance”. Because it is bound to a running application server, this whole instance has to be switched in case of failure. This is no big deal, as the ABAP instance itself can be switched quite fast. Switching is done in three steps:
- * stop the server *
- * switch to the redundant machine *
- * start the server again *
Why stop it? Remember, we check the disturbance case by asking the system if it is still there. If we don’t stop it, it may continure to run in part, delivering unpredicted chaos. For an ABAP central instance this can be done in two to five minutes, depending on the hardware. For Java this is different and may take longer than fifteen minutes. “Slow” Java again? Stupid! As you might know, ABAP is loading programs only on demand from the database (the stop watch in the left down corner tells you). Java loads all applications on startup. That obviously takes more time. Because of that Java was delivered from the beginning in a new way.
The Conjoined Way
Message and Enqueue Service are delivered in an extra instance and run independent from the Java part itself called “server central services” (SCS). With that, the Java server is in a new position. When switching a system we anyway always lose the current session content (for standard applications, we’ll check later what that means). That means if we have a bunch of redundant application servers it is not worth the effort to switch them. Switching the SCS makes the Spofs available again (and therefore makes them non-Spofs) and lost application servers can reconnect to any redundant other server. Because the SCS contains only two very small processes that just have to be started again, we are much faster doing that. Actually the SCS can be back in far less then a minute. However, if that can be of benefit for real systems is under investigation as we have to count in some more details. What about ABAP? Is it possible to have this advantage there also? Absolutely! For once, it always was possible to have Message and Enqueue in an extra instance by making this up manually. Second there are some good news: from ERP 2005 (which is based on minor release NetWeaver 2004s) on the installation supports this step and installs an ASCS instance for your system. From that time on there will always be an SCS and ASCS instance – well, as long as we don’t decide to put them together in one, as both services are only held in different instances to make the two worlds (ABAP and Java) not to influence each other. As the services are made from the same code they can serve both worlds in one instance – once we are sure nothing unexpected will happen.
No SAP System runs without a database. Does this qualify a Spof? Absolutely. AS it is not a single DBMS we are supporting there is at least as many ways to solve this problem as there are database vendors. Even more, since there are some vendors who have specialised on cross system high availability. Most basically there are two ways to secure your database Shadowing your database by replication Clustering your database The mostly use method today is the first one, as it is easier to do. Your DB just writes every transaction arriving to a second instance of the DB, writing it two times. In case of failure, WebAS only needs to connect to the second DB. Some customers prefer this because you can artificially lengthen the process. Now, why should you do this? Even if the system survives any kind of error, what is the reason for this if from a software view everything is alright, but not from a human view? Meaning, if consistency of the data is disrupted, this is not checked on every possible point. In such a case usually this may be perceived right fast and if you shadow goes half an hour behind reality, it is easy to stop and switch back. Clustering the database does not mean the same as a switchover or hardware cluster. It means there are two or more instances of the DB process running on the same DB files. In case of failure your application may switch to a second instance and continue immediately. Almost.Your transaction surely is gone.
SDM, Batch, Spool?
The world seems bright and shiny and suddenly you think about these services: SDM for Java and Spool, Batch, and some others for ABAP. What happens to them? First the more complicated one, SDM. The Software Deploy Manager(SDM) is in fact a Spof, but not for the complete Java system. It only affects the systems ability to deploy new applications or upgrade them. Therefore we do not consider this to be system critical. If it is for you, you are reverberated to the state that you need to secure the SDM too. At least until next year, when the SDSM will be distributed itself on the application servers… Same is true for the ABAP services that usually sit on the central instance. They have to be distributed on all available application servers.
Don’t forget WebDispatcher
And then there was this other part, WebDispatcher (software load balancing if you want it short). Of course it is a Spof too. Many of our customers prefer to keep the WebDispatcher on the hardware clustered machines, as theses are usually the more expensive ones and because they are able to switch over. Unfortunately this only works as long as you are working inside the firewall. As soon as you need to go through a firewall there is no way but to have WebDispatcher on it’s own machine and therefore on a hardware cluster (what adds it up into two machines). There are alternatives to WebDispatcher, of course. But none of them would solve the issue. The only advantage you have upon the server is that WebDispatcher does not keep state and that means for you, that it can run on both machines and does not need to be restarted or taken care to come down or whatever.