The ABAP Detective Melts in the Heat
As a one-bit detective in an ABAP world, troubleshooting is my business. I use a test suite like some might use a ruler. And when data goes astray it’s my case to solve.
Previously, I’ve looked into collecting environmental data points from local or remote sensors and draw conclusions about energy use and impact on a micro scale. One finding that stuck out was outlying data points that looked and felt wrong, creating a false history record and obscuring the true picture of heat flows.
Here is an image showing one spike:
Here is a more recent observation, where the temperature spikes down instead of up:
The second chart shows readings from a different set of sensors than the first, but the pattern is similar: readings following a narrow ranges, then jumping out of that range, then back in-range. Highly unlikely there was a microburst of freezing temperatures here in the middle of the summer. No ice was involved.
To research the root cause and try to prevent the data pollution, I went back to the source. Code, but not ABAP exactly. The sensor supplier provided Python libraries for some, but not all, of their devices. To fit into the back office data aggregation, I could use shell scripts or Python, or some combination. Because one of the sensors I was testing had multiple parameters available, I thought I’d combine the logic into one module, using a switch/case statement. Alas, the purity of Python grammar meant that basic algorithmic building block was forbidden. A bunch of if/then/else blocks hurts my eyes, not to mention being hard to logic trace. My case file search turned up a statement introduced with Python 3.10 “match/case”.
Code snippets follow.
#!/usr/bin/python3.10 import qwiic_bme280 # ... while counters < 3000: match channel: case 'temperature_celsius': readings.append(mySensor.temperature_celsius) # ... case other: print ("Ow\n") counters = counters + 1 time.sleep(0.2) print("%.2f" % statistics.fmean(readings))
Don’t shoot me for the inelegance of this sample; it is a one-off to find outliers! By looping 3,000 times and sleeping for one-tenth of a second, I was able to reproduce the data error. A sample from the logs:
2022-07-25 21:12:46.011 INFO Compensated temperature: 26.99 *C 2022-07-25 21:12:46.214 INFO Compensated temperature: 26.90 *C 2022-07-25 21:12:46.417 INFO Compensated temperature: -7.09 *C 2022-07-25 21:12:46.619 INFO Compensated temperature: 27.10 *C 2022-07-25 21:12:46.822 INFO Compensated temperature: 26.93 *C 2022-07-25 21:12:47.025 INFO Compensated temperature: 26.92 *C 2022-07-25 21:12:47.228 INFO Compensated temperature: 27.09 *C 2022-07-25 21:12:47.432 INFO Compensated temperature: 27.00 *C 2022-07-25 21:12:47.635 INFO Compensated temperature: 26.93 *C 2022-07-25 21:12:47.837 INFO Compensated temperature: 27.02 *C 2022-07-25 21:12:48.040 INFO Compensated temperature: 26.93 *C
The loop structure is working as intended, returning 4 or 5 sample values per second. This extract highlights one out-of-range value, dropping 30 degrees and back within a half-second. Clearly a “bogon” or bogus result.
How bad is the error rate? If too frequent, setting up an algorithm to clean out bad values is more daunting. Word count results:
$ wc -l /tmp/bme280.txt 3004 /tmp/bme280.txt $ grep " -" /tmp/bme280.txt | wc -l 30
30 bad values out of 3,000 (not counting 4 header records) is low frequency (let’s call this one-percent?). Now that I had a suspect list, I wanted to see how the errors occurred over time. Did they happen at random intervals? Were they bunched around certain times that might indicate a root cause, such as too many processes trying to run at the same time? Rather than an elegant high-level data visualization process, I did the usual thing and threw the times and temperatures into a spreadsheet, then a chart.
The scale might not be apparent, but clearly there are repetitive groups of outliers. To understand the pattern a little more, I zoomed in on several outliers to see how often they happened compared to the in-range readings. Too close, and this will be harder to separate good from evil.
Well, that’s a relief that there are at least 10 or so good values between the bad values. Further testing showed similar results.
3,000 readings would span around 10 minutes, which was much too long for the automated data capture I set up. More readings would make the statistics sampling more reliable (I don’t know what the ideal sample size is, as I’m 4 decades from my last stats class). I tried cutting the number of readings to around 30 seconds but the Zabbix framework rejected that trial with a “timeout while executing a shell script” error message. While the documentation suggests 30 seconds is the cut-off, trial and error changes were impacted by the automatic disabling logic, so I dialed the loops back to 15 samples in around 3 seconds.
Running more frequent sampling shows repetitive data, but also the duration of the glitches.
2022-07-29 13:15:03.013 INFO Compensated temperature: 27.32 *C 2022-07-29 13:15:03.046 INFO Compensated temperature: 27.23 *C 2022-07-29 13:15:03.078 INFO Compensated temperature: -9.39 *C 2022-07-29 13:15:03.111 INFO Compensated temperature: -9.39 *C 2022-07-29 13:15:03.143 INFO Compensated temperature: -9.39 *C 2022-07-29 13:15:03.176 INFO Compensated temperature: -9.48 *C 2022-07-29 13:15:03.208 INFO Compensated temperature: -9.39 *C 2022-07-29 13:15:03.241 INFO Compensated temperature: -9.39 *C 2022-07-29 13:15:03.273 INFO Compensated temperature: 27.17 *C 2022-07-29 13:15:03.305 INFO Compensated temperature: 27.17 *C 2022-07-29 13:15:03.338 INFO Compensated temperature: 27.26 *C 2022-07-29 13:15:03.370 INFO Compensated temperature: 27.26 *C 2022-07-29 13:15:03.403 INFO Compensated temperature: 27.26 *C 2022-07-29 13:15:03.435 INFO Compensated temperature: 27.26 *C 2022-07-29 13:15:03.467 INFO Compensated temperature: 27.26 *C 2022-07-29 13:15:03.500 INFO Compensated temperature: 27.17 *C
Within 1/2 second, 6 bad values out of 16, showing this is both too frequent and too short overall sampling.
I started down the Python numeric library path but hit an unexpected error running basic tests.
File "pandas/_libs/interval.pyx", line 1, in init pandas._libs.interval ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 48 from C header, got 40 from PyObject
After I observed suspicious sensor readings, I reviewed the trends to determine how significant the outliers were, how frequent, and how far apart errors occurred. By grabbing more frequent samples, I determined the incorrect readings started and stopped rapidly. By guided trial and error, I introduced improved sampling code that drops the outliers.
The new solution is not ideal though, as it simply tests against the now-known bad values. The better solution will be to deploy an algorithm that computes mean and standard distribution, rejecting values outside a defined acceptable range (some standard deviation multiple).
For that improved fix, I need to either write my own routines based on the statistics library that supplies mean and standard deviation, or debug the errors in the more advanced “numpy” library.
I’m rolling out Python 3.10 on every node that I can, if only to muck with claims such as “But, in Python, there is no case statement by default.”.