More virtual than real, part 2
Last weekend was the biggest hardware change in my tenure at my current job, coinciding with the biggest snowstorm on record in the Baltimore area. The teams that executed this change had plenty of practice, a lot of training, and good vendor support to fall back to. I’ll tell the bigger story later, once I gather more trending data, but for now, here’s a short story comparing Solution Manager and EarlyWatch tools with freeware, homegrown ones.
In the previous blog, I talked about virtual memory. But that’s been around for decades, with the DEC (no pun intended) VMS operating system one of the early successes. We were one of the few shops that started running SAP R/3 on that operating system, only to find it wasn’t widely adopted and we moved to a more conventional OS (Tru64). Within the past decade, improvements in hardware stability and architecture, as well as operating system and hypervisor software has brought about virtual CPUs. Nothing new, although many companies we talked to were just beginning to implement virtual-based applications, particularly in production use. It might be good to be a risk taker, but losing access to your ERP system can be devastating.
We felt the technology mature enough for this hardware and software generation to combine production and non-production systems in the same cabinet, splitting up the load among 2 data centers, and having a mix of larger systems (the biggest is a 24-way behemoth, and the smallest are 8-way powerhouses). Having larger and smaller systems became an exercise in what we called “Tetris” sizing, trying to estimate how the shared systems would fit together most compactly.
The new hardware allows us to move running systems between frames (IBM calls this PowerVM), meaning that if we need to load balance or take systems offline for power-down maintenance users will see less, or no, service interruption. We’ve used that feature in testing. Trying it on productive systems might be a few more months away until we’ve practiced it again and again.
What about metrics? With partitions sharing frame, the old rules about 4-way and 8-way (or larger) CPU database or application servers change. Formerly, we’d assign the databases to the biggest boxes, the application servers to the smaller ones, and when either of them began to run out of CPU, we’d either need to add more, or move them to new boxes. Having capacity on demand simplified that process somewhat, but we were still locked into to cabinet size. The first chart below might look familiar to those who view EarlyWatch reports, generated from Solution Manager. This is a capacity view, where the busiest hour of each week is reported, or something to that effect. I’ve quoted the report below. As long as the saposcol process is running, we get data (note we lost a week in September 2009).
Part of my job is to spot the anomalies, like the near-100% values near the beginning of December 2009. Can you tell the difference between a legitimate long running process and a bogus one? It can be tricky sometimes. Having a visible spike at least gives a clue there is something worth looking into. You may see that the app servers ranged between 40 and 75 percent, while the database server had a smaller range, between 40 and 60 percent. This resulted from more CPUs on the database server, which tends to smooth out the averages. As our “old” app servers had only 2 CPUs (4 way SMT) it didn’t take too many batch processes to rack up a high average.
Last week was the first full week of our Power6 virtual CPUs running SAP R/3, and Early Watch tells me that all the systems had close to 100% CPU. Except that I watched them myself, and this is incorrect.
For the first comparison, here is data from OS06, daily averages for the past several weeks. The blank day is when we shifted hardware platforms; database and application servers are both shown. Low values are on weekends.
The next (and last for now) chart shows CPU at the micropartition level, aggregated to the frame. As this chart from RRDtool auto-scales, it does not show there are 24 CPUs available. The large green area is the R3 database server, which appears to be continually using at least 2 CPUs, once data were available on Monday. Again, back to the runaway process, unless you knew this average was higher than it should be, you might not look any further. We have a vendor process which doesn’t seem to be behaving well, and probably needs to be terminated. As there is plenty of headroom, there is no urgency.
There are 10 systems running together in the 24-way box, each of which would have required at least 2 CPUs in the non-virtual world. We have room for more, as well as expansion capacity of 8 more CPUs sometime in the future. The current target is 20 partitions, not counting the 4 virtual I/O partitions used for, well, I/O.
Early Watch metrics explanation:
This analysis focuses on the workload during the peak working hours (9-11, 13) and is based on the hourly averages collected by SAPOSCOL. For information about the definition of peak working hours, refer to SAP Note 1251291: 1251291.
I can’t expect to see “high CPU” processes just looking for daily averages above a peak of 20 or 30 percent. It will be normal to see averages of 60 or 70%, since the virtual landscape dynamically shifts CPU allocations to match workload. I’m going to be learning new measuring and sizing methods to find the poor performing code and make sure we spend just enough to run the business.