[UPDATED 24-Jul-2009 – See below]
Ten years go, we set up enterprise monitoring of our SAP and non-SAP systems. The architects that described the vision(s) typically drew data repositories as cylinders, like a (number 10) can of corn, agents at the operating system level as tiny triangles (and there were a lot of them even then), and lots of lines connecting the agents, the agent handlers, the filtering root cause analysis steam punk engines, while the highest level, the Manager of Manager, was also depicted as a triangle, but more like the one on the back top part of a US dollar bill. Where are we today? I’d say we’re pretty close to the same vision, though with more agents being run, and more lines connecting the agent networks. Oh, and we now have Java to monitor, or to help us monitor.
Let’s look at a few moving parts of the SAP Solution Manager infrastructure.
|saposcol||basic operating system data collection.||‘a legacy’|
|sapccm4x||CCMS agent (level 2)||See my earlier blogs|
|sapccmsr||CCMS agent (level 2)||… 7606 7607 7725 9886|
|ccmsping||CCMS agent (level 2)||… “Agent Health” in addendum 1|
|smdagent||Solution Manager Diagnostics||Runs under java|
One of the truisms of my job is that I’d better find out if an agent is no longer running, because no news is not always good news. Over time, we’ve adopted measures to ensure we’re not trying to look back at an incident and discovering a giant blank spot in the archives. Depending on a operator to view a screen to see if all of the lights are green might seem the simplest approach, but suppose the system behind that screen has a meltdown, locking itself in the ‘fail safe’ mode? I generally advocate running fire drills on a random interval, where we trigger an error condition and see if the alert gets escalated properly.
As we’ve recently transitioned from one monitoring tool set to another (and neither were “SAP Solution Manager” by the way) I lost some features and gained others. During the transition period I built a few “temporary” monitoring practices in order to prove that both tools gave equivalent answers, by comparing them to a third opinion. One way was to use the SAP OS06 transaction to pull data together, as collected by the saposcol agent. Every now and again, a system was bounced and for some reason, that agent didn’t restart. Holes in the history record.
Also, as we upgraded our BW system to 7.0, we needed to roll out new agents to each node. That was generally smooth, but given the number of different systems, stress during cutover times, etc., we missed a couple patches. More holes in Albert Hall.
My normal routine in getting up to speed on monitoring agent architecture is to go into the OS level (Windows or UNIX) and look at running processes. That worked for the earlier generations, where the agent was autonomous, single purpose. Today with Solution Manager and the various organ donations that occur between or among vendors, it’s a bit harder to tell if all of your agents are in the field, not to mention returning you the hard data, and not chicken feed.
Let’s look at one screen I just found in the Solution Manager Root Cause Analysis menu tree. I reached this via the EP1 Work Centers, which launched web browser pages.
That’s a rather optimistic menu title, I think, but it could happen. I snapped this screen shot as it tells me 5 out of 5 systems we have set up look good. Green lights all around. All warm and fuzzy on level 1.
The next shot is on the second tab, agent connectivity. This says everything is connected via message server. Again, all green, or at least, checked.
Third shot is a tab near the end of the row, “agent credentials.” Cool, now we’re getting to the password, signs and countersigns part of the spy business. Better make sure our agents only talk to us. Hmm, three are running as smd_admin, while two are running as j2ee_admin. Do we have different spy agencies, like MI5 and MI6 or something (“better tell the SIS to keep out of sight” – mick jagger)?
Now here’s something pretty useful for agent health checks – a log viewer. I poked around each of these. Nothing dramatic, but as usual, it’s a best practice to see what the systems look like under nominal conditions, as the astronauts say, the better to find out what’s broken when conditions are abnormal.
SLD, SMD? System Landscape Directory, Solution Manager Diagnostics. Deeper into this quagmire another day. This is under a different tab – the “SLD Agent Candidates”. Maybe a place for training future operatives?
The bad news on this last view is we have a red triangle as well as a broken connector symbol. Also, we seem to have six systems, where before we had five. More paperwork required here, I’m sure, to get everything shipshape.
Finally, just to show that these agents are reporting data that is valuable to the republic, here’s a view of day one after our BW 7.0 upgrade. Good job, team.
Now, can I have some more memory? That’s the next battle, for another day, and maybe for another cadre of agents.
In the System Log (SM21) I saw periodic ABAP dumps, so we have now applied this note:
- 1334507 – Dumps CREATE_DATA_UNKNOWN_TYPE in SMD_DATA_LOADER100