SAP Intelligent Operations Control Center Topics and Examples (Part I)
Previous blogs describe the Operations Control Center (OCC) as an integral part of SAP’s best practice operational model for IT organizations. It is responsible for services such as IT monitoring, alerting, reporting/analytics, dashboards/transparency as well as root cause analysis of the hybrid system and solution landscape, covering information systems on premise and in cloud. It’s main processes are Event Management and Continuous Improvement, embedded into the value chain Detect to Correct. While the concept may be tried and tested, the Operations Control Center is continuously improved in all aspects – from organizational to procedural to technical. The following offers some examples of the specific perspective Intelligent Collaboration and Procedure Automation – to evolve an Operations Control Center deployed on classic technology to an Intelligent Operations Control Center utilizing modern state of the art collaboration and automation software. A brief summary of different perspectives is found here.
Attended (supported) and unattended (automated) event and alert reaction
Automated and supported event reaction procedures are the key of the Detect to Correct value chain, esp. the Event Management process. Through this process, alerts or events, i.e. automatically generated warnings and errors of predefined severity and rating, are traditionally addressed by human actors on shift duty in the Operations Control Center. Human operators (role as per the Organizational Model of the Operations Control Center, see OCC Whitepaper for details) follow previously documented procedures and practices to address any individual alert/event, in order to address its underlying issue occurring in the system and solution landscape to resume operations to – again predefined – thresholds and boundaries as quickly as possible. This normal operational behavior of any component and object in the system and solution landscape (the so called monitored object) for which an event/alert type has been defined, configured, and activated has to be known by the OCC’s Technical Administrator and Configurator (role as per the Organizational Model of the Operations Control Center) in order to set the event/alert parametrization accordingly*. In the intelligent OCC, this classic configuration is supported by more dynamic approaches such as machine learning and artificial intelligence. Please find references to articles describing this aspect below in the further readings section.
Each event/alert type also needs a predefined procedure (part of its configuration) to address the issue or issues for which the event(s)/alert(s) are triggered. This procedure is also defined and documented during configuration time, the analysis and resolution is not an ad-hoc activity at run time!). The Operations Control Center receives the event/alert resolution procedure input and requirements together with the event/alert type via the Request and Change Management Process of the Operations Control Center from requestors in the project teams (Build teams) and technical and functional support teams (Run teams) of the Center of Expertise, i.e. the IT organizations subject matter experts throughout the application life cycle.
These procedures, once defined and documented (for ongoing maintenance and improvement), have been traditionally incorporated in software systems guiding the human operators to execute an instance of the procedure whenever an event/alert of the corresponding type has been triggered. This means, the documented procedure (content) is configured in a guidance system (configuration). This is part of the event/alert configuration and activation exercise. In SAP Solution Manager 7.2 and similar SAP Application Lifecycle Management (ALM) software products, for example, the Alert Inbox lists the events/alerts which are (to be) processed by the human operator on shift and offers a jump in to a guided procedure, which holds the steps and activities (i.e. tasks or sub-processes) which the human operator executes as soon as he or she processes an event/alert. This not only guides the operator to do precisely what is expected to address the underlying issue or issues, but also logs the execution of event/alert processing for reporting and improvement. Please find details here.
The intelligent OCC addresses some challenges of this traditional approach, i.e. it aspires to
- reduce time to process alert reaction procedures e.g. the execution in and via guided procedures or equivalent,
- increase the immediacy and precision of alert reaction procedure processing,
- streamline the documentation and maintenance of and for alert reaction procedures.
Following the traditional approach, human operators assign themselves to an event/alert in a managing system, identify the corresponding procedure (which optimally is present in the event/alert context), and execute an instance of the procedure, which can be time consuming as it might involves several execution steps in managed objects/systems.
Enter the intelligent OCC. It applies Intelligent Robotics Process Automation software, i.e. Robotics Process Automation (RPA) for automation of procedures and Conversational Artificial Intelligence (CAI) for interaction with objects, adding non-human operators (aka bots or operobots) to the event management process.
Equivalent to guided procedures for event/alert types for human processing and execution, procedure scripts are created for automatic or semi-automatic execution, using the design time or a Robotic Process Automation/Conversational Artificial Intelligent software of choice**. During runtime, these operobots, i.e. bot executions, support the OCC operators either by automatically reacting to and addressing the event/alert occurring by executing the entire procedure (unattended mode), or by semi-automatically addressing some of the steps (i.e. tasks or sub-processes) of the procedure while collaborating and interacting with the human operators via conversational AI user experience like e.g. a chatbot, copilot or equivalent (attended mode). Hybrid approaches are possible, too. For example, operators can manually execute these procedure scripts in a semi-attended mode as part of a classic guided procedure or as command in a modern collaboration application.
For the Operations Control Center, the benefits of using intelligent technologies like robotic process automation, conversational artificial intelligence, and business process management like intelligent work flow are fast, reliable, immediate IT Event Management procedures. This will support compliance with IT service level agreements and with transitions to operations, e.g. if a more agile mindset and organization is to be incorporated into the ICC and OCC parts of the IT organization.
Examples of attended (supported) and unattended (automated) event and alert reaction
Two examples shall illustrate the author’s point of view. The examples are kept simple by means to explain the concept and options rather than to serve as technical specification and immediate recommendation.
Attended (supported) event/alert reaction processing
Example: Conversational AI Use Case Check System or Service Configuration/Coding Changes
Issue: “But it was working fine yesterday!” How often does a little change in a system’s configuration affect the performance or availability of an important component (or the entire system or service itself). Knowing for example which parameters have changed or if there were any changes imported into the system or service runtime can help to narrow down the reason for a sudden change in system behavior.
Objective: The goal is to avoid a manual search for changes in configuration in the managed components manually, by navigating – in the worst case – into each system or service runtime’s configuration stores to check. If an event/alert is consumed in the CAI UI, the operator can ask the operobot to check for him/her, offloading this part of the event/alert reaction procedure to the operobot.
Bot Skill: CAI4OCC is a chatbot that can react to questions about changes in a managed system and can provide a list of for example all changed parameters or imported transports. Athena can also answer the question for overall changes and guide the operator to the relevant tool to investigate the changes further.
Benefit Case: Save time when searching for changes.
Unattended (automated) event/alert reaction processing
Example: RPA Use Case Reaction to File System Full Alerts
Issue: A critical file system reaching its capacity can lead to system or database outage. However, it is rarely immediately clear what lead to the system standstill and precious time is wasted on the analysis of the other related issues.
Objective: The goal is to avoid these kinds of situations altogether and react quickly and accurately to file system full alerts in a fully automated way.
Bot Skill: RPA4OCC is a bot that reacts automatically and immediately to alerts/notifications for File System Full events. The exact auto-reaction depends on the implementation. A possible and safe implementation would be to move certain files to a dedicated overflow location until the specific file system can be either cleaned up or extended (dynamically).
Benefit Case: Avoid spending time and effort to check for this possible root cause. Apply resolution automatically w/o providing OCC operators increased authorization/access to your hosts.
To summarize: Adopting Robotic Process Automation and Conversational Artificial Intelligence to the Operations Control Center Event Management process makes inherent sense. It is a stringent evolution of the Guided Procedure (aka “Runbook”) via Alert Inbox concept and technology, leveraging (semi-)automation and technical operobots assisting human operators to safeguard the IT system and solution landscape. The author wants to emphasize that this does not invalidate the classic approaches and technology, rather it complements and extends the OCC’s capabilities. Moreover, these intelligent capabilities do not relieve – at least at present – the IT organization from documenting and maintaining any relevant event/alert type together with its resolution procedure first, before this content and configuration time object can be coded or transformed into a runtime object for attended or unattended execution by the OCC.
In combination with managing systems as collectors of data from the managed hybrid system and solution landscape and creators of events/alerts, RPA and CAI runtimes and user interfaces offer powerful execution environments – in unattended mode by operobots triggered events/alerts and in attended mode by user interfaces offering the operator real-time interaction with other operators and operobots to get the job done, replacing a classic Alert Inbox as an environment and experience for processing alerts, for example.
* These parameters include the metrics, measures, and threshold values for metrics measuring managed runtime objects like systems, services, tenants, databases, servers, etc. but also automation/background processing, API/interface processing, business process executions, data consistency evaluations, etc. threshold breaches trigger events/alerts to be processed.
** SAP offers SAP Intelligent Robotic Process Automation as part of SAP Business Technology Platform.