The ABAP Detective Takes The Heat, or How To Cleanse Data
First, I must apologize for classifying this case as “Big Data” when the focus is the total opposite, “Little Data” but I could not find a better tag, nor is this a total stretch since Big Data is invariably made up of a lot of Little Data, and if the latter has bias or wrong values, the former may be suspect.
In my personal “back office” I set up a Raspberry Pi, running the FreeBSD operating system, and deployed the Zabbix monitoring suite there. In doing so, I uncovered many useful comparisons and suggestions applicable to larger networks of monitoring tools, consoles, and alert management systems. While it may seem trivial, I connected an environmental monitor to another Pi, and have been observing the ambient temperatures mostly, for a few weeks. Below I’ll link my personal post on how I built out Zabbix; for an SAP audience I want to focus on the data collection, storage, analysis, and resulting required decisions. I’d certainly recommend this tool for many uses, given my experience so far (and note that any data collection architecture is only as secure as you make it, and probably less if you skimp).
Data collection display
Here is a chart captured from the Zabbix console, showing room temperature in degrees Fahrenheit (the base values are actually in Centigrade, but I prefer my accustomed scale).
Notice the “spike” early in the timeframe, and the Y-axis scale going from 40F to 120F. Details in the last/min/avg/max values look like one might expect, except one of the ranges showing a maximum near 100F (38C). As an aside, the monitor itself has 2 distinct temperature sensors; there is a further complication that those readings are taken very close to the Pi chipset, meaning higher than room temperature as a background. I’ve adjusted for that, but that topic is off-topic from the post-measurement analysis. Let’s stipulate the values are reasonable, other than the ones in the spike.
How do we account for the spikes? Somehow, this inexpensive part does not supply valid output 100% of the time, not even mentioning power outages, reboots, or other interference. At first I didn’t even see any spikes so I had not considered what effect they would have on the data presentation and usefulness. I did get a taste when working up the offset calculations, but start-up blips are easy enough to ignore.
In this example, there is more than one “wrong value” displayed, though without knowing the sampling frequency directly, we could expect 1 or 2 per minute, if the ramp-up is any indication. But in a business or science application where the underlying values are expected to be correct, what does the keen observer do? I guess in business you call for an audit, and in science you do a data quality review where you need to determine if the outliers are due to incorrect recording (noise on the line, say), or are due to a previously unconsidered cause. The history of science is littered with notes about mistakes people made in making data fit prior assumptions. But here, given the nature of the measurements, and the actuality of having a backup value recorded nearly simultaneously, let’s assume it is wise to “throw out” these data.
Great, how does that happen? Again, in my home office I can just purge or refresh the base data tables any way I like. But in a business setting you’d need to know the rules and privileges necessary to cleanse data. In some roles, I was the sole data proprietor and in other roles, needed to file change notices or other documentation for data cleanup [data volume management/archiving fits here a bit but I’m focusing more on the recording validation algorithm/thought process].
With a few searches about Zabbix data housekeeping, and some database sleuthing, I came up with the specific places these values are stored. At least the initial capture; more on trends later. Besides being open source and having a very user friendly interface, the Zabbix tools have a pretty clear entity model, meaning it only took a few minutes to zero in on the aberrations. Other tools I’ve used include HP OpenView, BMC Patrol/Enterprise Manager, some CA products. and, alas, SolarWinds, all of which would have been more opaque (in general, my opinion) in this endeavor.
Example code snippet 1
select itemid, clock, value from history where itemid -- and ( value > 74 OR value <= 0 ) order BY clock ;
I left out irrelevant details isolating which data set contains the bad data. The range of values I’m expecting is between 0F and 74F in this example, and that is where the thinking needs to happen. As it is winter when I write this, 74F/23C is abnormal, in summer it would not be. If these data were more critical, I’d have to spend more time correlating the flaky sensor values with the less flaky one.
So far, I have not automated the data cleanup as it’s quick enough to find them and purge them if I cared to, compared to setting up a scheduled process. With Zabbix, it might be possible to include this in the base configuration; for me, though, writing a database procedure is not overly challenging, again giving credit to the Zabbix data model(s).
delete from history where itemid = -- and clock in ( 1645947811, 1645947931, 1645947691 ) ;
In a wider array of measurements, for example, background chemical levels, or biological traces, recording triggers such as these, later audits can be performed, and quality adjustments made, ensuring the confidence in conclusions drawn or actions taken as a result. Referring back to water quality sampling data from an earlier role I had, the metadata about sampling frequency, procedures, data collection chain of custody and other useful comments could/should be tied to the detailed data.
|itemid||clock||to_timestamp(CAST(clock as bigint))||value|
After the rows shown above were purged from the history table, the refreshed Zabbix interface showed the new data range without those.
Now that looks more reasonable! When we zoom out to a longer time period, the ambient temperature patterns are what I’d expect.
The dip near the middle of the plot from 65F(18C) down to 50F (10C) is a result of an open door allowing colder outdoor air to mix with indoor heated air; later dips are less drastic because of shorter heat leaks/loss. If the latter chart is what you present to management or decision makers, you are much better off than if you had overlooked the earlier data corruption without explanation or comments. The dip shown has a larger dimension than the earlier chart, now that the scale is reduced to the usual “room temperature” number range.
An even further look back in time showing diurnal cycles.
The lighter background are the default “working hours”, not quite the same as daylight, of which more in a future case.
Is our case air-tight (pun intended)? Not quite. Here are the “before and after” temperature ranges, for slightly different time periods.
The first row from the SHTC3 sensor has a maximum of 89.7F (32C), which drops to 70.9F (21.6C) after the data purge. The last row shows another 89.7F maximum, presumably due to other, uncorrected bad values, and a minimum of a Siberian negative 45 F (-42C, close to the -40 crossover between C and F).
The dilemma now apparent is that the auto-scaled temperate range on the Y-axis, and the data plots themselves, do not show any values above 80F or below 50F. What happened? If you had presented this chart without noticing the discrepancy, someone else might, with possibly negative outcomes. The first approach is to add a note, somehow, explaining that the data displayed in the graph has a different source than the data in the range box. You could speculate the root cause is one presentation uses history data directly, while the other pulls from trending or derived data.
Why are the trends incorrect? We’d need to look into how the internal data housekeeping operates, and see if a recalculation is even feasible. Or perhaps reset the history retention periods such that the derived data gets recalculated to better match the raw. Myself, I am unconcerned, now that I understand where the bad gets in, how to address it in the main, and what remnants of in imperfect world remain in place.
The Moral Of The Story
In a rendition of one of my favorite songs from the late great John Prine, he says this before the ending verse: “a good story should have a darn good ending, and if you don’t have a good ending, at least have a moral to the story. Here’s the moral.”
- Understand your data collection algorithms, transmission techniques, and storage mechanisms.
- Know what makes up any data visualizations; if they are invalid find out why and get that fixed.
- You might change what is recorded in history, but truth will prevail.
I've been doing similar things, monitoring temperatures around my house with the intention of trying to analyse that data and so hopefully optimise my heating system at least a little. For the most part I'm using different technologies - still Raspberry Pi based, but using home assistant as the data collection platform and more traditional smart home devices and sensors (smart radiator valves and zigbee temperature sensors). I have hit very similar data issues, and spikes in temperature readings play havoc with trend analysis. In my case, much of my data analysis is done outside of home assistant, using Python/pandas code I've written myself, and so I've been able to QA the data and ignore obviously wrong data points in the analysis code. But you have to be careful even there - just ignoring data points affects the analysis.
I do find it interesting that even small home projects encounter similar issues to large enterprise projects, and need similar skills to find working solutions. It has been fun working on my "little" home project and I know I've learned things that will be useful in larger "work" projects I get involved in. Definitely educational as well as fun.
By the way, when I say "little" - the database holding the history of my smart home devices currently has over 17 million rows, and has been running for only 5 months so far. I don't know if this counts as "big data" but it certainly isn't "small data" any more...
As an aside, the Raspberry Pi is an amazing platform for this stuff. I have so much running on a single Pi 4 that currently costs £74 (the 8G RAM version), and it still rarely breaks a sweat!
Thank you for the reply. We've touched base on the Pi topics before and seem to be running on parallel but not identical paths. I don't have a smart thermostat beyond programmed start and stop settings, and am undoubtedly not benefitting from more controls.
That point (pun intended) is exactly my view above. If you did science like Trofim Lysenko did you could do humanity great harm, on the big data scale. It's important to know the small details before extrapolating to big trends. Let's say your sensors measured dissolved oxygen in a water body. If the numbers are close to zero, or even zero, that's a bad sign for the environment (particularly if you're a fish), but if the numbers come back below zero, something is amiss.
While I'm still on the Perl train myself, I can say the tendency for Python coders to include execution time in algorithm discussions is helpful in building applications that scale. As a minor example, I searched for examples to descramble words (in the old newspaper word game, not that new one). Found some that don't work anymore (Python 2 syntax) and two that work, but one more efficient than the other. If you need to clean millions or billions (or more) of data points, efficiency is crucial.
Likewise, my internal monitoring scheme has generated nearly 4 million rows in just around a month, scanning under a dozen platforms. I appreciate one of the key metrics that Zabbix includes ("Required server performance, new values per second") to show at a glance the complexity of the data mesh. Understanding the impact of retention time on responsiveness is important.
With inexpensive data collectors such as the Pi Zero shown above (at ~1/10 the cost of the Pi 4), data volumes continue to increase with greater need to corral the results. I could see a few hundred sensors being used at once for water quality metrics such as temperature, oxygen, turbidity and various pollutants, sweeping through water bodies very cheaply.
In the 1970s, I worked with organizations that used "STORET" as a water quality data repository; that design lasted until just a couple years ago when a replacement was deployed.
Now, it's "R" and friends on the analytical side [ http://usgs-r.github.io/EGRET/ ]. Pretty cool page there with graphs of data from a river I've been on near here. "A companion package
EGRETciimplements a set of approaches to the analysis of uncertainty associated with WRTDS trend analysis" references the uncertainty resulting from data quality issue.