The ABAP Detective and the Heisenbergar Inequality Case
Beside the short cases that end tidily, sometimes an ABAP Detective may work an investigation that takes longer and may not have an explicable or explicit case solved/message processed. This adventure had me booking tickets for a software beta test using a distributed quality assurance suite that had an implicit design bias for black and white answers (or red pill/blue pill, or team( divisions.
Some tests returned ambiguous results. How can a software algorithm vary you ask? If you flip a coin, it will usually land on one side or the other: heads or tails. It is feasible in some cases, that the coin lands on edge and doesn’t fall. Or a pigeon flies back and your flip never lands. It’s like that in quality assurance.
Originally, I was hunting the most wanted: a regression result. A test case that worked in the past but doesn’t now. Those bug reports are their own reward, in a way, as fixes can usually be applied when the deviation from prior results is researched. I found none of those grails, holy or otherwise, and eventually recognized a pattern where some tests failed sometimes.
The entire software test suite has dozens, if not hundreds of test cases. Some are built as defensive bulwarks (“this routine always returns a positive result”), and others as recognized bug alerts (“once this test passes, the previously discovered bug is either fixed, or the test doesn’t work right any longer”). On slower machines (like remote sensing system-on-chip, e.g. Raspberry Pi), the total run can take hours, making seeing a rare failure problematic.
I also set this case up as a frequently running cron job. The failures occur between 5 and 10 percent of the runs.
Here, a tangent. When I was in the enterprise seat, I learned the optimal problem report includes a reproducible case, or “how to reproduce” the error. For the cases that produce different results, providing steps to cause a failure might not.
As these result values were very small, I initially named the case “Iota” (as in, not one). Then, feeling publican, I put out a poll:
The quasars and pulsars come from space sciences; I made up Heisenbergars from the Uncertainly Principle (Narrowing our Schrodinger for quantum duplicity). The last term comes from the purported successor to the current test suite (a research tale for another time). Kyuant is supposed to seem like Quant. I guess.
So, now the Heisenbergar Case is named. How to proceed with this insight? After setting up the initial frequently scheduled run of a single test I set up a series of single tests. That way, each run is short enough to gather useful data in a few days or weeks at the most. Because the target machines had wide speed variations, the schedule for each was adjusted so that test runs didn’t overlap each other and there were no long periods without tests running.
Well, side bar:
After a few days, I had sufficient case results to begin the process of putting together the basic parts of a PR, the specific test, environment, and results as a failure percentage.
My field notes tell me I found:
I used multiple architectures (arm and amd64), finding some tests fail between 10% and 100% (no failures get you kicked out of the club), some from 1% to 10% and even some that fail less than 1% of the time.
Essentially, I broke down the results into orders of magnitude/powers of 10, representing fault infrequency. I am not suggesting that tests with similar fault rates are correlated but I figured any clue could be helpful at this stage.
I then decided to open tickets beginning with the rarest faults, thinking they take the longest to see so look there first. Trying to be responsible and not just “dump core” I attempted to read the test case source and look for priors (cases with similar errors, not felons). That was as close as I could get to supplying do-over steps.
Cut To The Chase
One set of tests failed intermittently with memory shortages. (Short dumps, as it were). On first inspection you’d think a test run would act the same way each repetition. Outside interference should be ruled out; what else would vary the test run memory footprint? I won’t go all the way back down the rabbit hole I aliced just suffice it to say that no code ever has enough comments.
* Our metric of success is crash / no crash.
Thus I started not with a murder case, the inaptly named ping-of-death test, of the group having under one-percent recorded faults.
Another case left evidence in plain sight (system logs), once you looked there of course.
[ 1498596.274645] UVM: pid 25241 (t_exhaust), uid 1000 killed: out of swap
Analyzing failures across system size also revealed trends of smaller memory increasing faults:
pi4:arm64 8GB - 100% pass pi3:arm64 1GB - 90% pass pi0:arm32 512MB - 10% pass
Case findings like that allow me to file the investigation into the “solved” board, or at least “explicable.”
Others have built test cases to determine whether race conditions occur in specific threaded software areas. These case names popped out of my results:
On analyzing the first run cycles, the latter of the two failed 100% of the runs. I tossed it back on the cold case file, as it didn’t fit this effort of reporting infrequent failures.
I filed the case with this error, though I felt sheepish that I had no other evidence to tie in:
tc-se:Fail: regexp WARNING: ThreadSanitizer: data race on vptr \(ctor/dtor vs virtual call\) not in stderr
I don’t know if I fully agree with this posted comment in the test code: “Ignore the ones with “race” in the name, they fail from time to time and win the race in other cases.”
Space and Time
The tests that broke with memory shortages had at least a recognized failure mode, a modus operandi if you will. The out-of-swap limit can be overcome (as can the “don’t write to SD cards much” fault), but tests that run out of a time limit are a different kettle of fish.
One I reported had a 5 minute term limit. A very low failure rate, and none found on some testbeds. Still a mystery to me, probably even more so after reading the source code. Beyond that, the log messages don’t seem to match the timestamps on some failures. More data to digest; this might be two bugs linked to one test. I found background about grace periods that could be revelatory.
I have a handful of more-frequent failures that are ready for problem reports; some of them I’ve decided don’t have enough detail or analysis to make the effort worthwhile. Unfixed but reported bugs are sometimes necessary.