Streamlining Error Handling and Incident Managemen...

namratharaj · ‎12-01-2023

Introduction

In the dynamic landscape of SAP CPI integration, it's essential to have efficient error handling and incident management to keep integrations running smoothly. We can encounter various errors, ranging from data inconsistencies to connectivity issues or simple configuration mistakes—all of which have the potential to disrupt critical business processes.

So, what's the solution? We need a robust framework that can effectively address these errors and manage incidents, allowing operations to run seamlessly. That's precisely what we'll explore today. This blog will delve into why error handling and incident management are pivotal in SAP CPI and provide a step-by-step guide on how to tackle these challenges head-on.

Architecture

The proposed solution for error handling and incident management in SAP CPI integrates with various tools and platforms to provide a unified approach.

The key components of our design include:

Centralized Logging and Monitoring: SAP CPI is integrated with Splunk to provide centralized logging and monitoring of errors encountered during interface execution. Splunk's powerful search capabilities facilitate quick identification and troubleshooting of errors. Furthermore, Splunk's indexing, and correlation features create a structured container for error data, streamlining the error detection and management process.

Alert Notification: Splunk alerts are configured to run regularly and notify business users via email when errors occur. This ensures that stakeholders are promptly aware of errors, allowing for quick resolution.

Automated Incident Creation: The solution leverages the integration of SAP APIM and Azure Event Hub to facilitate deduplication of error logs. Error logs are transmitted from SAP CPI to Azure Event Hub, which utilizes the Radar application for deduplication. Subsequently, the integration with ServiceNow enables the automatic creation of incidents for each unique error, ensuring efficient incident management and resolution.

Scalability and Customization: The proposed solution is built to scale easily, thanks to Splunk's ability to handle growing amounts of data. It also lets you customize alert timings and send notifications to specific business groups. By routing incidents to the support teams responsible for each integration, the framework makes it easier to resolve issues quickly and holds everyone accountable.

Interface Design

The Generic Error Handling Interface is implemented to serve as a centralized hub for managing exceptions across all iFlows within the tenant. This setup allows for smooth changes, enabling easy management and simplification of any logic enhancements in future.

Step 1: Main iFlow Configuration

All iFlows within the tenant are configured to send error details and additional log properties to the Generic Error Handling iFlow through a dedicated JMS queue.

1.1 Exception subprocess of Main iFlow

When an exception occurs in the interface, the Exception subprocess is invoked. For connectivity errors, a retry mechanism is initiated to reprocess the data a specified number of times. If all retries are exhausted and the issue persists, an error log is sent to a generic flow within CPI, which then forwards it to Splunk for further analysis and monitoring.

Store Exception properties

Camel Exception in the exception subprocess of main iFlow is used to read error code in receiver adapter and set these in Splunk properties.

3. JMS Adapter config

Note: Retry via JMS is applicable only for asynchronous flows. For all other cases, error logs are sent directly to the generic flow without retrying.

Further Reading: For JMS retry configuration, you can refer to this SAP blog.

Step 2: Generic Error Handling Process Flow to Replicate MPL logs in Splunk.

Once the error data is received from the JMS queue in the Generic flow, a series of actions are taken to effectively manage the error. These include:

Retry Check: A router condition checks for potential retries in case the Error Handling iFlow fails due to exceptions such as Splunk being unavailable. If the retry limit is reached, the data is stored in a data store using the WRITE operation.( A separate iFlow can then be used to read the logs from the data store and send them to Splunk when it becomes available.)

Error Message Character Limit: A Groovy script checks the length of the error message. Given that Splunk has a character limit of 5000, messages exceeding this limit are trimmed.

SFTP ManualRetry Check: This step assesses if the data should be stored in SFTP for possible manual retry, particularly when the source system lacks the capability to resend failed messages. In cases where SFTP is needed, the payload is archived in SAP SFTP, and logs are sent to Splunk in parallel. If SFTP isn't required, logs go directly to Splunk.This decision is based on the property "SFTPRequiredForManualRetry," which is sent from the main iFlow.

Step 3: Integration with Splunk

Splunk Integration setup details can be found in this https://blogs.sap.com/2023/06/23/sap-integration-suite-external-logging-to-splunk/ blog.

Splunk Alert

Navigate to Splunk and in the Search page, enter the search query for which you have to generate alert.

To create an alert select Save As and then click Alert as shown below.

You will be presented with the following screen where you can enter additional details:

Title: Name of the alert

Description: Additional details of the alert

Permissions: Options are Private and Shared in App. Private indicates that only you have permission to view and edit the alert, it is not visible to other users. Shared in App indicates that the alert is available to other users in the searching and reporting app. The alert is visible to other users in this context. Depending on their permissions, other users can edit the dashboard.

Alert Type: Scheduled

Trigger Conditions: The condition that triggered the alert.

Trigger alert when: The alert is triggered when a criteria is met and you can also add additional criteria where the number of results are greater or lesser than a particular value.

Trigger: Do you want this alert to trigger once or whenever the criteria is met.

Throttle: This feature allows us to control the frequency of alert triggering. It is useful in scenarios where alerts may trigger frequently due to similar search results or frequent scheduling. By setting throttling controls, you can suppress alert notifications for a specific time period, reducing the number of alerts you receive.

For scheduled searches that run frequently, you can set longer throttling periods to avoid being notified for every result generated. In the case of real-time searches, if the alert triggers for each result, configuring throttling can help suppress additional alerts and prevent your inbox from being overwhelmed with notification emails. The default throttling period is 60 seconds, but you have the flexibility to adjust it according to your needs.

Trigger Actions: In this section, we have the option to customize the e-mail subject and body for the alert. We can use tokens to add specific information to the alert. For example, the $type$ token will be replaced with the name of the search when the alert is sent.

We also have the option to include information from the trigger search results by using the $result.count$ token. This allows you to dynamically incorporate the count of the error log into the alert.

Once you save the alert you will be presented with the popup where you can make changes to the permissions.

Email Alert from Splunk

Step 4: Splunk Webhook to CPI

From the Search page in the Search and Reporting app, select Save As > Alert. Enter alert details and configure triggering and throttling as needed.

In the Add Actions menu, select Webhook.

Type a URL for the webhook. The target URL must be on the webhook allow list. For more information, see Configure webhook allow list.

Click Save.

Step 5: Integration with Azure Event Hub

AMQP adapter is used to integrate with Azure Event Hub.

We have Integrated with the event hub to leverage the deduplication process, which helps ensure data accuracy and efficiency. By using the event hub, Logic Apps and ELK can efficiently transfer messages through the pipeline, with alerts being sent to the ELK stack. The ELK stack then utilizes the message content to create Incidents in ServiceNow.

Deduplication is valuable when multiple alerts for the same issue could lead to duplicate Incidents, causing confusion and wasting resources. With deduplication, only one Incident is created, optimizing resource allocation and simplifying incident resolution, ultimately enhancing workflow efficiency.

Step 6: Incident Creation in ServiceNow.

Streamlining Error Handling and Incident Management in SAP CPI Integration

SAP PI for Beginners

ABAP 7.40 Quick Reference

Fiori: technical installation and configuration of one app from A - Z