Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
Former Member

Resolving problems in complex environments, such as SAP PI, can be very challenging. In this article I will give you some pointers to help you effectively troubleshoot a problem. I wrote the article with interfaces in SAP PI in mind, but most of it can be applied to troubleshooting in general.

The big four

Troubleshooting a problem boils down to answering these four questions:

- What is the problem?

- What is the cause of the problem?

- How can we resolve the current error situation?

- How can we prevent it from happening again?

I will illustrate this using a simple example: a customer complains they are no longer receiving order responses. They normally receive at least two every workday but yesterday they did not receive any.

The problem stack

The problem seems very obvious: the customer is not receiving order responses. A quick look in SAP PI confirms this: there are no order responses for this customer.

Now where do we start troubleshooting? You always need an understanding of the desired situation before you can effectively troubleshoot a problem. You then start your investigation at the point in that process of which you know it does not work (NOK) until you find a point that does work (OK).

In our example, we know the order response is normally triggered by the receipt of an order. Another quick look in SAP PI shows that no orders have been received from this customer. This means we need to re-phrase the problem from the customer is not receiving order responses to the customer is not receiving order responses because we are not receiving customer orders.

This gives us the first part of what I call the problem stack, a drill-down into the problem until we find the root cause:

The customer is not receiving order responses because

    We are not receiving customer orders because

        ...

Finding the root of the problem means we have to repeat this process until we find a step in the process that is still working without problems. To illustrate this, let's try to fill in the next line in our problem stack.

Assume nothing

In the before-problem situation the receipt of orders was triggered by the customer sending us orders, so our next step is to check whether the customer did actually send us an order.

It is tempting to assume they did: why else would they complain about not receiving order responses? So, we can skip this step, right?

Not at all. This would be a potentially very costly mistake to make. You won't be the first to spend days investigating a problem that turns out not to be a problem at all. Instead we ask the customer for a quick verification and they send us a log file showing that indeed they sent us an order.

This presents us with a problem: we have one step in the process that works fine (the sending of the orders), the next step (the receipt of orders in SAP PI) does not work, but we have not found a step that is actually displaying errors and can be marked as the step in which we have to look for the cause of the problem.

It's in the details

The fact that we have not found the actual cause means only one thing: the steps we took are not fine-grained enough for us to find the cause. We need to 'zoom in' on the area between the 'OK' and 'NOK' part of the process. If you zoom in far enough you will find the root cause:

In our example: after the customer sends an order, it goes through the internet, through a firewall, through the AS/2 adapter into SAP PI. Investigation shows that the AS/2 adapter is rejecting the order because the customer's digital certificate is not trusted.

The customer is not receiving order responses because

   We are not receiving the customer orders because

      The AS/2 adapter is rejecting the transaction because

         The customer's digital certificate is not trusted

Of course it's not an option to start your investigation at the lowest level of detail; you will waste vast amounts of time investigating steps that have nothing to do with the problem.


Someone touched it

We finally made it to the root problem. We have, however, not the slightest clue as to what caused it. One day the interface works, the next day it doesn't. The only thing we know for sure: something changed. In this case: a certificate that used to be trusted is not trusted anymore. In the example we find evidence in the log files that a user deleted a certificate from the system. This completes the problem stack:

The customer is not receiving order responses because

   We are not receiving the customer orders because

      The AS/2 adapter is rejecting the transaction because

         The customer's digital certificate is not trusted because

            A user deleted a certificate from the system

Note that our user, when asked, claims that nothing has been changed on the system for ages. Even worse: this user is convinced nothing changed: he deleted the certificate by accident. Asking people whether they did anything to the system is a very good starting point, but a negative answer is no guarantee they really didn't.

Depending on the urgency of the problem, you can decide to not try to find the root cause but jump straight into resolving it. Be aware, however: resolving errors can rob you of the opportunity to find the root cause. This also means you have no way to prevent the problem from happening again.

Resolve and prevent

In our case: change the configurationso that the certificate is trusted again and ask the customer to re-send the orders and restrict access to the certificate storage. In most cases it's as simple as that: when the problem stack is complete, the resolution and possible ways to prevent this problem from happening again are obvious enough. Note that a possible conclusion is that there is no way to prevent the problem from happening again.

Summary

For every problem that occurs in an interface that used to work fine, there are several certainties: something has changed to cause the problem, there is a root cause to the problem and it can be fixed. If you drill down into the grey area between the OK and NOK part of the process, you will find the root or the problem and be able to resolve it. Don't forget that something has changed to cause the problems. All you need to do is find and revert that something.

Labels in this area