My Tips about how to handle complex and tricky issues
- Symptoms of complex & tricky issues
- 1. The issue needs complex steps to reproduce
- 2. Different software components involved
- 3. The issue could only be reproduced in customer production system
- 4. The issue could only be reproduced in background job execution mode but not in online mode
- 5. The issue could only be reproduced in normal execution, but when you debug the program, everything works perfectly
- An example of how to resolve such kind of issue
- Update on 2014-5-14 to correct some mistake
- Further reading
Symptoms of complex & tricky issues
During my seven years working on SAP China, I have resolved hundreds of internal tickets or customer tickets. Among them there are some kinds of tickets which make me headache:
1. The issue needs complex steps to reproduce
For example I have ever resolved one customer ticket, I need to (1) create a new sales order (2) create a new customer demand based on sales order (3) create a pick list (3) release the generated delivery note. The issue can only be reproduced by releasing the delivery note. Then I have to repeat the lengthy steps (1) ~ (3) and do debugging in note release.
2. Different software components involved
I bet most of you guys have such feeling: if the issue is purely occurring within your responsible component, you will always be confident that it could be resolved sooner or later, since you are the owner of your API and quite familiar with it. However if your API is called by other software component or from other system with complex context, you have to spend more time to have a basic understanding of the whole story, to find how your API is called, to analyze whether your original design of API could sustain this new challenge you never think about before?
3. The issue could only be reproduced in customer production system
In most of the cases I ever meet, the reason is because of the data setup. For example in customer test system, the test data is not well set up so that the erroneous code has no chance to be executed in test system. Sometimes there is technical limitation or whatever other reasons so that it is impossible for you to ask customers to setup exactly the same data in test system as the data they are using in production system. The worst situation is, sometimes the issue occurs during write operation, for example the pricing calculation is wrong when a business document is saved. In this case you can not simply debug the save process, as it will influence customer business. You have to coordinate with customer how to proceed.
4. The issue could only be reproduced in background job execution mode but not in online mode
The first step to check such issue is trying to find whether there are some FMs or methods which should not be used during background execution when the presentation server is not attached.
5. The issue could only be reproduced in normal execution, but when you debug the program, everything works perfectly
Everyday I use debugger to fight against bug. When I found the bug could not be found via debugging, however it does exist in fact, I feel helpless, since this powerful weapon could not help me out this time. Then I have to read and analyze the code, and make them running in my brain. In most cases finally the issue is related to time-dependent processing in the program.
As an ABAPer we are lucky since we do not always have to struggle with such time-dependent issues. When I am developing an Android application for SAP CRM customer briefing in year 2012, I suffer a lot from such kinds of issues. Just two examples:
a. When you touch the Android tablet with single finger and make a slip, there are 5 or 6 different kinds of events triggered sequentially. My event handler registered for these events will handle with the coordinates of events occurred. Those coordinates will become invalid if code stopped in debugger. Then I have to write many System.out.println to print the coordinate in console for analysis.
b. Dead lock in multi-threading. Such issue is hard to reproduce via debugging.
In fact some issue does not simply fall into one or two categories listed above but consists of several of them. I never encounter an issue from customer which contains all the five feature above, and I pray I will NEVER meet with it.
An example of how to resolve such kind of issue
Recently I have been working on one ticket which took me totally almost 10 hours to resolve it. I will share how I analyze this issue step by step.
I am owner of SAP CRM IBASE component CRM-MD-INB, the issue is my Solution management development team colleague complains when they create a new IBASE component and delete it afterwards in the same session and do a save operation, there will be ST22 dump in middleware processing stuff.
I know nothing about solution management development before.
This issue could only be reproduced in background execution. ( The program is designed to only execute in background )
The issue is not always reproducible. 囧
Step1 Understand how and when my API is called
I quickly go through the solution manager program, there are tens of thousands code. I set breakpoint inside my API ( IBASE create, update and delete function module ), then identify all calling space and importing parameter content.
Step2 Write simulation report and ensure the issue could be reproduced via it
As the scenario is really complex – CRM, SOL and Middleware involved, I spent one hour debugging without any hint found. Purely judgement based on code level, there are too many factors which will impact the program. In order to make me concentrate on my API, I plan to develop a simulation report which also calls IBASE create, update and delete and then perform save. The idea is to make the API call decouple from the solution manager logic. If the issue could then also be reproduced in my simulation report, then life is easier – I then only have to work on the simulation report which only contains 200 lines.
I have spent another 1 hour to finish the simulation report. Unfortunately I cannot reproduce the issue with it. After I check again with issue reporter,
I realized that the report does not 100% simulate the real program regarding IBASE operation, and I change it to fix the gap.
The simulation report is uploaded as attachment of this blog.
Step3 Identify the core code in simulation report which is related to the dump
Since the simulation report is owned by me, it is very convenient to change it for issue analysis.
a. comment out all IBASE related code.
b. uncomment IBASE component creation FM, and execute report – no dump
c. continue to uncomment IBASE component change FM, and execute – no dump
d. continue to uncomment IBASE component deletion FM, and execute – dumps!!!
So now I get to know this issue is related to IBASE deletion.
Step4 Investigation on ST22 dump
Now the issue could be reproduced during normal execution of simulation report, but could work perfectly well during debugging.
My previous experience told me that it might be caused by some time dependent processing logic in the code. Then I check the position of code which raises error( line 103 ) and found lots of time operation logic in the same include:
The aim of this include is to find the IBASE and filled it into es_ibinadm. First check in buffer table gt_ibinadm_by_in_guid, if failed then try FM in line 91( in the first screenshot of this blog) as last defence. In normal case, the es_ibinadm is expected to be filled however in this issue, the last defense also fails so the X message is raised. I set breakpoint in this include, however during my debugging, the variable es_ibinadm is successfully filled in line 54, everything works perfectly. However the dump is indeed there when I execute the report directly.
So I run the report once again and go to ST22, this means the dump there is “fresh” and the Debugger button is available only in this fresh state, so that I can observe the variable value in debugger when the dump occurs.
I soon find the root cause: the valfr and valto of the buffer entry is the same,
so during normal execution, the check in line 53 fails, so the code has to try the last defense to call FM CRM_IBASE_COMP_GET_DETAIL. In this case, it is expected behavior to raise an X message since the entry in the buffer table should be returned. When the code is executed in debugger, the valto is always greater than valfr, so the code directly return the entry to its caller without further call on FM CRM_IBASE_COMP_GET_DETAIL.
I will not go deep into IBASE valfr and valto generation logic as it is CRM specific and I am also not expert on it. ( a default creation of IBASE component creation will set its valid to timestamp as a never invalid date( 99991231235959 ). The comparison timestamp is set as valid from timestamp )
After I add the following code to ensure the check in line 53 in above screenshot will always succeed, the issue is resolved – no dump in background job execution any more.
I guess it would also work if the “<” is changed to “<=” in line53. However this code is owned by Middleware software component and I could not change, maybe I can discuss with responsible colleague.
1. Benefit of simulation report
Although it took me 1 hour to develop the simulation report, I think it is definitely worth since it liberates me from spending lots of time and effort to debug the unfamiliar solution management program and enable me to concentrate on the core code which might be related to the dump.
Sometimes if you have some findings and need to make changes on the code which calls your API for verification, you can not really do this since the code is not owned by you. In this case the simulation report plays its role! You can change it at your will for verification.
2. The Mini-System methodology for issue-isolation
In the early ten years of 21 Century, it is very popular in China to assemble a PC by ourselves via DIY approach. It means we buy CPU, memory chip, hard disk, motherboard and other stuffs from different hardware manufacturers and assemble them by ourselves. Most common issue is that after assembly, the computer cannot boot at all. Then we use “Mini-System” approach for trouble shooting: as first step we only try to boot computer with LEAST necessary hardware ( CPU + Power + Motherboard: these three components constitute a so called “Mini-System” ). If the first attempt succeed, we can append additional component, but ensure only ONE new component is added in EACH step. Such iteration could enable us to find which hardware makes the boot failed.
Compared with computer system, our ABAP program is much more complex and issue-isolation is then necessary for root cause investigation.
In my issue processing I used “Mini-System” methodology to finally identify that the dump is related to the incorrect call of IBASE delete function module.
3. Try to gain a perspective of overall situation of the issue
In this issue processing I spent quite a lot of time to debug why function module CRM_IBASE_COMP_GET_DETAIL raises an X message in the beginning.
Inside this FM it calls some deeper APIs which are not owned by me so I waste lots of time to understand the logic. Later after I read the whole source code of includes where the CRM_IBASE_COMP_GET_DETAIL is called, I asked myself: should it be called at all? Why is it called in normal execution to get data from DB, although the entry is already available in the buffer??
Do not think solely, think holistically
It makes sense to spend time and effort to debug the code where the exception is raised to understand why. It makes MORE sense to investigate the code holistically, analyze the callstack and execution context of the code. If the code ( method or function module ) fails to generate the correct output as you expect, ask yourself: should it be called at all?
Hope this blog can help for your issue analysis. And also welcome to share your tip & experience regarding tough issue processing 🙂
Update on 2014-5-14 to correct some mistake
The solution to force the valid_to timestamp to be 1 second later than the valid_from timestamp via the following ABAP code is wrong:
lv_valid_to = ls_comp_det-valfr + 1.
Suppose the component creation and deletion are both done on timestamp 20140514180000 ( for simplicity do not consider time zone ).
So lv_datlo ( 20140514 ) and lv_timlo( 180001 ) are passed into FM below. However the design of the function module does not support deletion in the future:
So currently only available solution is to add a WAIT UP TO 1 seconds because the call of delete function call, which could ensure the deletion always occur 1 second later than creation. Of course this will degrade the performance, however fortunately it is impossible for customer to encounter such issue in UI – they could not achieve to create the component, save it and delete it within the same 1 second.
The dump will only be there if the following two prerequisites meets at the same time:
1. Customer or partner develop their own code to call the creation, save and deletion FM sequentially
2. The system performance is good so that the three FM call ( create, save and delete ) are done within the same 1 second.
There is a very famous book written by guru Jon Bentley, <<Programming Pearls>>. In chapter 5.10 it gives several brief but useful debugging tip, and also a very interesting bug: “A programmer had recently installed a new workstation. All was fine when he was sitting down, but he couldn’t log in to the system when he was standing up. That behavior was one hundred percent repeatable: he could always log in when sitting and never when standing. ” Would you like to know the root cause of this mysterious bug? Go read that book 🙂
Interesting Blog 🙂
Would have to read the book for the Interesting behaviour of the SYSTEM too 😀