The ABAP Detective Takes The Heat, or How To Cleanse Data
First, I must apologize for classifying this case as “Big Data” when the focus is the total opposite, “Little Data” but I could not find a better tag, nor is this a total stretch since Big Data is invariably made up of a lot of Little Data, and if the latter has bias or wrong values, the former may be suspect.
In my personal “back office” I set up a Raspberry Pi, running the FreeBSD operating system, and deployed the Zabbix monitoring suite there. In doing so, I uncovered many useful comparisons and suggestions applicable to larger networks of monitoring tools, consoles, and alert management systems. While it may seem trivial, I connected an environmental monitor to another Pi, and have been observing the ambient temperatures mostly, for a few weeks. Below I’ll link my personal post on how I built out Zabbix; for an SAP audience I want to focus on the data collection, storage, analysis, and resulting required decisions. I’d certainly recommend this tool for many uses, given my experience so far (and note that any data collection architecture is only as secure as you make it, and probably less if you skimp).
Data collection display
Here is a chart captured from the Zabbix console, showing room temperature in degrees Fahrenheit (the base values are actually in Centigrade, but I prefer my accustomed scale).
Notice the “spike” early in the timeframe, and the Y-axis scale going from 40F to 120F. Details in the last/min/avg/max values look like one might expect, except one of the ranges showing a maximum near 100F (38C). As an aside, the monitor itself has 2 distinct temperature sensors; there is a further complication that those readings are taken very close to the Pi chipset, meaning higher than room temperature as a background. I’ve adjusted for that, but that topic is off-topic from the post-measurement analysis. Let’s stipulate the values are reasonable, other than the ones in the spike.
How do we account for the spikes? Somehow, this inexpensive part does not supply valid output 100% of the time, not even mentioning power outages, reboots, or other interference. At first I didn’t even see any spikes so I had not considered what effect they would have on the data presentation and usefulness. I did get a taste when working up the offset calculations, but start-up blips are easy enough to ignore.
In this example, there is more than one “wrong value” displayed, though without knowing the sampling frequency directly, we could expect 1 or 2 per minute, if the ramp-up is any indication. But in a business or science application where the underlying values are expected to be correct, what does the keen observer do? I guess in business you call for an audit, and in science you do a data quality review where you need to determine if the outliers are due to incorrect recording (noise on the line, say), or are due to a previously unconsidered cause. The history of science is littered with notes about mistakes people made in making data fit prior assumptions. But here, given the nature of the measurements, and the actuality of having a backup value recorded nearly simultaneously, let’s assume it is wise to “throw out” these data.
Great, how does that happen? Again, in my home office I can just purge or refresh the base data tables any way I like. But in a business setting you’d need to know the rules and privileges necessary to cleanse data. In some roles, I was the sole data proprietor and in other roles, needed to file change notices or other documentation for data cleanup [data volume management/archiving fits here a bit but I’m focusing more on the recording validation algorithm/thought process].
With a few searches about Zabbix data housekeeping, and some database sleuthing, I came up with the specific places these values are stored. At least the initial capture; more on trends later. Besides being open source and having a very user friendly interface, the Zabbix tools have a pretty clear entity model, meaning it only took a few minutes to zero in on the aberrations. Other tools I’ve used include HP OpenView, BMC Patrol/Enterprise Manager, some CA products. and, alas, SolarWinds, all of which would have been more opaque (in general, my opinion) in this endeavor.
Example code snippet 1
select itemid, clock, value from history where itemid -- and ( value > 74 OR value <= 0 ) order BY clock ;
I left out irrelevant details isolating which data set contains the bad data. The range of values I’m expecting is between 0F and 74F in this example, and that is where the thinking needs to happen. As it is winter when I write this, 74F/23C is abnormal, in summer it would not be. If these data were more critical, I’d have to spend more time correlating the flaky sensor values with the less flaky one.
So far, I have not automated the data cleanup as it’s quick enough to find them and purge them if I cared to, compared to setting up a scheduled process. With Zabbix, it might be possible to include this in the base configuration; for me, though, writing a database procedure is not overly challenging, again giving credit to the Zabbix data model(s).
delete from history where itemid = -- and clock in ( 1645947811, 1645947931, 1645947691 ) ;
In a wider array of measurements, for example, background chemical levels, or biological traces, recording triggers such as these, later audits can be performed, and quality adjustments made, ensuring the confidence in conclusions drawn or actions taken as a result. Referring back to water quality sampling data from an earlier role I had, the metadata about sampling frequency, procedures, data collection chain of custody and other useful comments could/should be tied to the detailed data.
|itemid||clock||to_timestamp(CAST(clock as bigint))||value|
After the rows shown above were purged from the history table, the refreshed Zabbix interface showed the new data range without those.
Now that looks more reasonable! When we zoom out to a longer time period, the ambient temperature patterns are what I’d expect.
The dip near the middle of the plot from 65F(18C) down to 50F (10C) is a result of an open door allowing colder outdoor air to mix with indoor heated air; later dips are less drastic because of shorter heat leaks/loss. If the latter chart is what you present to management or decision makers, you are much better off than if you had overlooked the earlier data corruption without explanation or comments. The dip shown has a larger dimension than the earlier chart, now that the scale is reduced to the usual “room temperature” number range.
An even further look back in time showing diurnal cycles.
The lighter background are the default “working hours”, not quite the same as daylight, of which more in a future case.
Is our case air-tight (pun intended)? Not quite. Here are the “before and after” temperature ranges, for slightly different time periods.
The first row from the SHTC3 sensor has a maximum of 89.7F (32C), which drops to 70.9F (21.6C) after the data purge. The last row shows another 89.7F maximum, presumably due to other, uncorrected bad values, and a minimum of a Siberian negative 45 F (-42C, close to the -40 crossover between C and F).
The dilemma now apparent is that the auto-scaled temperate range on the Y-axis, and the data plots themselves, do not show any values above 80F or below 50F. What happened? If you had presented this chart without noticing the discrepancy, someone else might, with possibly negative outcomes. The first approach is to add a note, somehow, explaining that the data displayed in the graph has a different source than the data in the range box. You could speculate the root cause is one presentation uses history data directly, while the other pulls from trending or derived data.
Why are the trends incorrect? We’d need to look into how the internal data housekeeping operates, and see if a recalculation is even feasible. Or perhaps reset the history retention periods such that the derived data gets recalculated to better match the raw. Myself, I am unconcerned, now that I understand where the bad gets in, how to address it in the main, and what remnants of in imperfect world remain in place.
The Moral Of The Story
In a rendition of one of my favorite songs from the late great John Prine, he says this before the ending verse: “a good story should have a darn good ending, and if you don’t have a good ending, at least have a moral to the story. Here’s the moral.”
- Understand your data collection algorithms, transmission techniques, and storage mechanisms.
- Know what makes up any data visualizations; if they are invalid find out why and get that fixed.
- You might change what is recorded in history, but truth will prevail.