Alerts, the good, the bad, the ugly…
Ever work in an environment where everyone is all pro for monitoring, but as soon as you have things going everyone creates filters for the alerts and then take no action on the alert(s); even if they specifically had you create them based on their criteria?
I have setup various Monitoring systems (SolarWinds, Nagios, Solution Manager) and while they do have different ways to go about obtain their data, it always comes down to alerts being the worst part of configuring the monitoring system.
So before I continue on, I would love to see feedback on issues/solutions or good/bad/ugly situations you have in regards to alerting. 😀
- Saving time – avoid some of those morning/middle of the day/afternoon redundant tasks, like buffers/memory/high CPU/number of users logged in/etc etc.
- Transactional issues – know about a problem with your tRFCs/qRFCs/iDocs typically caused by bad code or a user and you get dragged into the issue for a resolution to the issue.
- Knowing that your communications are functioning – RFCs, a failure here in some cases it would of been better easier if the destination system would of just gone down.
- User specifics – number of users on a particular node/instance Dialog/HTTP/RFC types, perhaps system accounts being locked, or even when any account is locked.
- Integration – Alerts can be set to generate a ticket in the ITSM and in turn a ChaRM request to resolve the particular alert
- Alerts are mis-configured – missed a zero or added one too many, whatever the cause the incorrect value caused a false positive or worse yet, no alert and the system or business process fails.
- All talk – yeah we should setup alerts for x, y, and z; I’m tired of team a, b, and c coming down here every time something is broke. Result alert flooding and you have to turn it off or they don’t tell you that they setup filters to just delete the notifications.
- ITSM or Third party ticket system – I don’t want a ticket created as it will just add to the email (that I am deleting via rules) and require me to log into a system to close the ticket that was generated due to an issue.
- Functional co-workers disabling metrics – Co-workers who have the ability to log into the host OS and they disable monitoring agents/collectors due to a task manager having CPUs spiked for less than 1 min, and then informing you they would just turn it back on if there was a problem with the system.
- Metrics stop working – Odd no alerts over the last few days, wonder wha…Holy mother of *beep beep beep* and you spend the rest of your day hoping to resolve all issues before an end user or functional person catches it. Then try to figure out how to re-enable metric collection as you don’t care to go through that again. 😉
- Lack of requirements – which could be linked with a lack of enforcement, what should be monitored and when does value X become a problem (threshold settings), and then everyone’s opinions get in the way.
- Old school settings – I’m all for reliable, what I find ugly are the folks that refuse to listen or even review/look at newer methods that perform the same function but its “new”. sapccms4x/sapccmsr/ccmsping <– I understand you had to work with what you were given, but these are way more complicated then they need to be…frankly monitoring of SNMP could not of been that difficult, and its just as confusing! 😆
- Sending template descriptions – Taking the time to print a template to a PDF (in most cases large PDFs) to send off to people to review and you wait to hear back for at least a mention of 1 useful metric…
- Lack of familiarity – Having co-workers who recommend enabling alerting on any SM21 log deemed with a status of Red…to those newer to the land of SAP this would send an email for every lost/disconnected SAP GUI session, you would quickly disable the alert.