The ABAP Detective Meets The Kobayashi Maru

JimSpath · ‎04-10-2023

Testing hypothesis? What type of detective work is that? Sounds like scientist work, not my set.

Recently, as the world has changed yet again, I found myself doing piece work for an operating system conglomerate. A small non-profit, so I wasn't going to get rich. I'm not sure I was going to get paid in any coin of the realm except the highly discounted "exposure." Operative, yeah, that's it.

Despite using this OS for decades, I had managed to avoid the back alleys where the unit testers, the user acceptance folk, and the even more ominous automated testers hang out. Turns out a sophisticated quality assurance system has been out there, like the rock of Gibraltar, and I was blissfully unaware. After completing somewhat pedestrian installation compatibility and benchmark testing, I signed myself up as a relic hunter. Relic as in ancient bugs. Here's one of the older code sections I found (April 1st, not April's Fools, 1982 though):

/*

 * @(#)table.c  1.1 (Berkeley) 4/1/82

 */

My first plan was to run the full built-in suite across various implementations, from a PC-sized system on down to tiny systems on chips. The immediate clue was some tests work only with escalated privileges, so that doubled the number of planned test runs. Unprivileged tests are also important.

Once the full suite completed, I could look at the results and determine the significance of each failure, apparent failure, or bogus "pass" result. What resulted was a multi-faceted dive into bug reports and how to make them better. The findings included tests that needed work, tests that showed something missing, and tests that were either inconclusive or ambiguous.

The golden rule is, write good repeatable steps for issues found whenever possible.

Testing bugs

If a test run fails, does that mean there is a bug in the code, or is the test itself buggy? One of the first issues I unearthed was an environmental monitor function. Apparently when written there was either a dearth of configurations or the coder was only familiar with a subset.

That code was rewritten/patched with a more comprehensive approach. It still doesn't help if the sensor code doesn't detect all available sensors but that's a cold case trail for another saga.

Clearly, initial test designs may suffer from lack of experience/knowledge, or there might be ongoing developments that render the original assumptions moot. When later testers read the original test case documentation (or sometimes just the code), are the assumptions and protocols clear?

One unsolved mystery has excellently commented code, at least for those who are capable of reading with understanding. An example:

        if (is_tcp) {

                /*

                 * Get the write socket buffer size so that we can set

                 * the read socket buffer size to something larger

                 * later on.

                 */

                socklen_t slen = sizeof(sndbufsize);

                ATF_REQUIRE(getsockopt(writefd, SOL_SOCKET,

                    SO_SNDBUF, &sndbufsize, &slen) == 0);





/*

 *

 * The above code is taken from the t_empty.c test on a NetBSD system,

 * which requires the copyright and source link be shared, as follows:

 *

 */





/*-

 * Copyright (c) 2021 The NetBSD Foundation, Inc.

 * All rights reserved.

 *

 * Redistribution and use in source and binary forms, with or without

 * modification, are permitted provided that the following conditions

 * are met:

 * 1. Redistributions of source code must retain the above copyright

 *    notice, this list of conditions and the following disclaimer.

 * 2. Redistributions in binary form must reproduce the above copyright

 *    notice, this list of conditions and the following disclaimer in the

 *    documentation and/or other materials provided with the distribution.

 *

 * THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS

 * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED

 * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR

 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS

 * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR

 * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF

 * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS

 * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN

 * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)

 * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE

 * POSSIBILITY OF SUCH DAMAGE.

 */





/*

 * code source, of sorts: 

 * https://github.com/NetBSD/src/blob/trunk/tests/kernel/kqueue/t_empty.c

 */

Typo?

One issue turned out to result from incorrect syntax to express the desired test logic. The code was not such that a compiler or other warning might be expected, so only by examining the output was the glitch noticeable.

Some might contend that code-checkers would find these "human" errors and flag them. Maybe, in some cases, though totally valid logic in code may not reflect the designer's intent.

Write your assumptions into your bug reports so that other analysts can determine if they made the same assumptions, such as character-encoding, expected outputs, and known interfering conditions

This is the Kobayashi Maru moment. The test itself must be changed; as the Captain said, "I don't like to lose."

Timelines, Timelines, Timeliness

"What I tell you three times is true" (Stand on Zanzibar). Wait; one of those is not the same! My advice to you rookies, when you take down statements, er, case notes, record the time; and when you put together an incident time line, keep it neat and tidy. Just the important steps, no extras.

In a low-level debug crime, time is not always as it appears, Like Alice's trip through the mirror (looking glass), distortions and disturbances may occur. Check out, for example, this tale about time warps. Multithreaded processes might have time gaps while waiting for locks, etc.

The simple detective story case note format I recommend has time, observation, and comments.

Time	Observation	Comments

Pull time and observations directly from event logs if possible, and include meta-data or limited speculations in the comments ("I think someone turned off the power."). Keep everything in chronological order.

True bugs?

Among the errors and omissions in the test protocols, I might even stumble on a likely suspect showing a repeatable bug. For this investigation, one test result I looked at failed on a compile step that should produce an application with embedded instrumentation. Profiling (not the civil rights violation type, the statistical analysis kind) data.

The observations told me something was amiss.

tc-end: 1680560733.140145, hello_profile, failed, atf-check failed; see the output of the test for details

tp-end: 1680560733.159462, usr.bin/cc/t_hello

A widely-dispersed sample program ("hello world") is as basic as it gets. In this case, the test creator added command line flags that caused compile-time errors in my test.

In writing the ticket, er, case report, I included an isolated command that also fails, outside the test suite, to show to myself and future readers, basic repeatable steps.

$ cc -o testcase_pg -pg  testcase.c

ld: /tmp//ccxyF2mE.o: in function `main':

testcase.c:(.text+0xc): undefined reference to `__gnu_mcount_nc'

The third pillar of test reporting is to speculate, wisely, on a root cause. If you have no idea, maybe say so. If you found the suspected root cause, write it that way.

The root cause for this error was not obvious initially.

<so>Lowering kern.maxvnodes to 2000</so>

<so>Executing command [ sysctl -w kern.maxvnodes=2000 ]</so>

<se>Fail: incorrect exit status: 1, expected: 0</se>

More investigations revealed an invalid test assumption in that 2,000 vnodes were not already active. Whether this test can be altered to provide valuable feedback is to be determined.

Tracking back

Looking at test suite summaries can be daunting, even with programed helpers, to find useful gems in a sea of standard outputs and errors. I naively looked in the beginning for "failure" then learned to classify results of "expected failure" differently. Some tests produce minimal logs as designed, and more details might need to be added to glean possible root causes or evidence of states.

One valid failure I found indicated a discrepancy (two output streams did not match).

tc-end: 1680881450.833599, servent, failed, Observed output does not match reference output 

tp-end: 1680881450.836324, lib/libc/net/t_servent

After filing my case report, others realized certain steps appended to a database without removing stale entries. Not solved, but my detective work sufficed to generate ideas for fixes.

The Spice of Life: Variety

In a well-established code base (think the SAP ERP kernel, for instance) there might be some ancient tropes that, given their ubiquity, won't be going away. After learning more about one embedded database (services), I looked in the same location and found 3 different architectures with just 3 files.

locate.database: data

man.db:          SQLite 3.x database, user version 20190518, last written using SQLite version 3026000, file counter ...

services.cdb:    NetBSD Constant Database, version 1, for 'services(5)', datasize 205871, ...

The first is possibly the oldest design, the second uses SQL*Lite, and the third uses a descendent (I think) of the classic Berkely "db" design. As an aside, I wondered why the locate database file was not identified by the handy "file" command.

"A fast filename search facility for UNIX is presented."

Reference: https://www2.eecs.berkeley.edu/Pubs/TechRpts/1983/5392.html

I created test conditions to include multiple conditions where possible, such as privilege level noted above, and where possible, multiple interfaces. One function might work great in a standalone way; adding more variety might show an unexpected discrepancy. I found one unresolved (yet) test bug by turning wi-fi on and off. You might consider testing where internet services are unavailable, to see if there are hidden external calls (or worse).

Temperance of Patience

Sometimes, test results are inconsistent. Here, multiple runs with varying background states might be in order. For benchmarking or scalability tests, having a 'clean" running system is important. For these regression tests at a unit level, the expectation is the results are the same no matter what the system load or running processes might be. I am not trying to do integration tests, volume tests, or scalability other than having a gamut of processor powers available from finger-sized Raspberry Pi's to multiple-fan-running multi-core AMD boards. Incidentally, one of the NetBSD code trees claims to fame is the enormous processor family targets that run the most recent release I'm testing. 8 Tier 1 and 49 Tier 2, of which only 2 (like Itanium) are not distributed as source and binary (VAX anyone?).

Reporting tests without consistent results is more art than science, more guidelines than actual rules. As a developer, or database administrator, the phrase you don't want to hear is "sometimes it fails". Hard to rig a test that doesn't behave logically.

What do you do when repeated runs return ambiguous results?

Don't Panic. Even if you want instant gratification, know this is not always possible.

Set up repetitive runs. Hold all variables except one constant, if possible.

Document, thoroughly, your conscious assumptions (you have unconscious ones too, so wait for insights to reveal themselves at random times, like while bathing ("Eureka!") or after a good nights sleep ("Aha!")

After an initial full test suite run, I isolated specific failures and tried to repeat those tests individually. Generally they fail again once I create the same state. On occasion, the failed test passed, and then later, failed again. Other tests failed in the main run, but then I could not make them fail again. Back burner!

The random fail of the day:

tc-end: 1680978077.609410, sock_tcp, 

 failed, 

 /usr/src/tests/kernel/kqueue/t_empty.c:167: 

 (readsock = accept(readsock, (struct sockaddr *)&sin, &slen)) != -1 not met

So far, I cannot understand the possible reasons for this test to intermittently fail, so I have delayed opening a problem report. How I am approaching further detecting:

Automate this unit test; for now, there is a cron job that reports a success/fail code to an external tracking database.

Observe potentially unrelated conditions. Since we don't know what is in the trigger, note as much as possible so that future tests might reveal useful clues.

Document. Like this post!

Shut down apps and retry.

In just under 24 hours, running once per minute, this case failed 50% of the time. Alas, not a consistent pattern of pass-fail-pass-fail which might imply a possible root cause, but more like a coin flip with unpredictable individual results.

Fail = 1; Pass = 0

To do:

Try to find the unpublished memo titled "Webster's Second on the Head of a Pin" by R. Morris and K. Thompson. You may have heard of one or both of these ancestors (Ken is still with us).

File the next PR (problem report). Subsequent detecting tasks will be to run parallel tests, run after reboot with minimal process load, wired/wireless, and whatever strikes a chord at the time.

Try to repeat any tests that did not pass or fail 100% on all systems available (reload current systems with minimal install and attempt to repeat, preserving the flaky image).

Read the source. Suggest documentation improvements.

Check bsd42.sourceforge.net for interesting archival code.

Test. And test again.

Supporting Cast of Operatives:

Mma Precious Ramotswe

Guy Noir

Dixon Hill

Sherlock Hemlock

Hemlock Stones

The BSD family, with this rift healed ("first!", "no me first!"). Looking at the full family tree on record, you might see the MAC OS X origins between FreeBSD and NetBSD. Email link content shows but a snippet.
- The Jolitz's, who published about 386BSD 0.1 in Dr. Dobbs, which I grabbed by floppy disk and ran before the Net and Free BSD distributions began (30 years ago now).
- core
- toor

Case study:

Summary for 944 test programs:

    10508 passed test cases.

       14 failed test cases.

       76 expected failed test cases.

      559 skipped test cases.