The ABAP Detective Resolves a Data Retrace

JimSpath · ‎10-22-2022

This case is about data more than code. In the cold case files are stacks of evidence, observations for possible future use. Could be environmental or climate data, could be traffic or road conditions, could be inventory information. Digging through archives can be rewarding as long as there is some type of index or metadata, and the current technology allows for review of previously collected digital bits. Think floppy drives, magnetic tape, or even optical storage. The content on those media is useless if you can't read them.

What happens when the "cloud" storage repository goes away? That is the crux of this current case: geotagged data that was uploaded to a site that <big cloud provider> decided to shut down. When I learned hundreds of images I contributed were to be inaccessible, I made a withdrawal, intending to move the content to a new repository.

Here's where the metadata, the data dictionary, and sizing requirements come in. In my case, I recorded many ground-level images, doing field survey/service work. Over 10 years ago, the ability to capture GPS data was more limited (and expensive) than it it now, so the primary data had no location information. On upload, locations were manually tagged on a map. Doing so added metadata. However, because of site rules, the images were degraded (reduced in resolution). Meaning my original data was better in some ways, and worse in others.

Why crowd source data collection? The easiest example is traffic; hundreds (or billions) of GPS-enabled location records allow sites to produce real-time condition maps, offer alternate routes, and capture evolving trends. Even if you don't realize you're contributing you are, for which we thank you. For voluntary projects like bio-diversity analyses, having people collect data on forms can be burdensome, so the easier the set-up is, the more data to be gathered. Sites can offer incentives, whether basic recognition ("Four Star Reviewer!") or some swag ("Oooh, T shirts!"), or future dividends such as data retrieval. See below under "also".

Field Survey

Frog Hollow and the Yellow Trail

Here is one example of a field data capture. On the left side of the inset image is a walking trail; on the right side is a stream. A primary use of these data is quality control on path suggestions; should a stream crossing be hazardous due to increased flow exacerbated by upstream "development".

The saved online file is missing metadata, and was shrunk from the original capture. The "jhead" utility program tells us:

File name    : panoramio-24364336.jpg

File size    : 139933 bytes

File date    : 2022:10:07 15:32:29

Resolution   : 800 x 600

JPEG Quality : 75

Not to mention, the date shows when it was downloaded, not when it was captured or uploaded. The "takeout" process where you can preserve your original contributions show metadata in a separate related file:

 139823 Nov  8  2018 2018-09-05/panoramio-24364336.jpg

    874 Nov  8  2018 2018-09-05/panoramio-24364336.jpg.json

The image size is the reduced quality (drat), but there are lat/lon points, and the original capture date:

 "formatted": "Jul 11, 2009, 6:49:30 PM UTC"

Some data reconstitution could be done from these parts, but the higher resolution imagery could not. Long digging steps skipped here for brevity.

Ah, one of these may be the original:

 1302599 Jul 11  2009 DSCN7671.JPG

 1308378 Jul 11  2009 DSCN7672.JPG

 1297094 Jul 11  2009 DSCN7673.JPG



File name    : DSCN7671.JPG

File size    : 1302599 bytes

File date    : 2009:07:11 13:22:50

Date/Time    : 2009:07:11 09:22:51

Resolution   : 2560 x 1920

...

JPEG Quality : 84

JPEG quality higher, image size larger, only problem is which image was preserved online, and what to do about the remainder? Turned out to be #7675. My search used timestamps though these were not entirely helpful for 2 reasons: one being the camera not having accurate time and date (less of an issue if the sequence stays in order), the other being the file transfer date was recorded rather than the capture moment. The latter cause headaches resulting in fewer good archive retrievals.

A side by side resolution comparison indicates the value of recovering the clearer images. If not apparent in the default post rendering, click/zoom to verify.

High resolution

Low resolution

Another issue is the quantity of original data points compared to those sent to the cloud repository. Perhaps some sites take (slurp) all your data, but curated sites might either reject some portion, or create barriers like transfer speed or content size limits (e.g. "< 10 megapixels"). As these geotags were done manually, the leftover images had no recorded locations other than my sparse field notes (or blog posts done at the time); recreating that level of detail after an appreciable time was tough. Some records were captured at obvious places with visible landmarks, some not so much. Think of field data capture such as valve or joint condition by physical inspection, maybe bridge piers or safety rails. More frequent image capture, even by non-professional structural engineers, can assist in safety reviews.

From a set of 5 dates, my archives had roughly 500 still images (and a few movies) in the time period from 2009-2010. Recently, with a 128GB data card, I took 700 images in an hour or two, just to show the potential explosive growth of baseline data. Fewer than 100 images had been uploaded to the Panoramio site, and only a portion of those, like the Frog Hollow one shown above, remain visible.

"Not selected for Google Earth or Google Maps after a second review"

Goal

I had a goal to at least re-post the 100-ish data records from the closed site to a different host. I'll skip the details of picking a target, as there aren't too many I've found that I could feel confident were going to exist for the foreseeable future. If I could share over half of the 500 historic images I'd be a happy camper, as we say.

I created spreadsheets to review the data inventories (though for a larger scale I'd go with a database design), which I leveraged with the tools jhead and exiftool. Geotagged images were required for successful upload; if I missed a tag the placemark would show up, literally, "off the reservation", necessitating a do-over, or perhaps record removal. The QGIS software came in handy for local checking of locations prior to site submission.

That's where quality expectations come in. I have standards, and try to share content that's correct to the best of my ability. I know the accuracy of a basic consumer GPS circuit in a camera or phone is a few feet at least (sorry, a meter at least), with terrain conditions such as tree cover worsen that to tens of meters or no data at all.

Did I meet the goal? Well, from under 100 original records to almost 200, not as many as I wanted but a respectable repair record. And the newly shared images are typically at least 10X larger.

Interpolation

In some cases, one data point without metadata was found between two data points having location metadata. If the time interval was short enough, I felt confident in finding the midpoint between them and assigning it manually to the third. Crowd-sourced Python to the rescue.

#!/usr/bin/python

# https://code.activestate.com/recipes/577713-midpoint-of-two-gps-points/

import math



def midpoint(x1, y1, x2, y2):

#Input values as degrees



#Convert to radians

    lat1 = math.radians(x1)

    lon1 = math.radians(x2)

    lat2 = math.radians(y1)

    lon2 = math.radians(y2)

# etc.



print(midpoint( 39.690999,39.690402, -76.269761, -76.270604))

Many examples of this "great circle" math can be found online; after testing the above simple approach I added command line parameters to get the results. An expansion would be adding this function into the spreadsheet/database inventory with Excel or SQL.

Benchmarks

Data markers are key to many data-based initiatives. Running a GC/Mass-Spec analysis with known published standard reference sources can calibrate an instrument so the results can be used in medical or court settings or whatever. In location work, benchmarks are landmarks perhaps installed by government agencies, perhaps private entities. One reference called this "ground truth."

Of course, this topic deviates from software and hardware benchmarks (less of an impact now than when I penned the SD speed post "fast is not a number").

Benchmark

How close?

Benchmark on trail

The above image shows a map in the lower corner. Even with several readings the discrepancy between the found location and the actual site is noticeable. The "jitter" for location spots that should have been on the dotted path looks like I was in the water more than once (I wasn't). It's too bad this particular rendering excludes a scale. I guess that's where the API comes in?

What happens now?

3D? In the new home, my images are shared with a creative-commons license, and the sequences seem to be oriented toward building virtual imagery. I noted bad spotting on some data (invalid metadata) and odd side effects with other content, such as reporting distance traveled between 2 or more seemingly disconnected sites. There are ways to upload dashcam (or GoPro-type) camera recordings. Have to check the detective's benevolent society funding levels.

Maybe a future round of field observations will be via a fleet of drones that go along the trail markers at chest height, recording higher and higher resolution imagery. These can be morphed into the historic records, with the right algorithms.

One reason to look back at old data is to find gaps where nothing was collected, or perhaps records were destroyed or degraded. I set my next sights on locating historic benchmarks, which I've found via NOAA (US National Oceanic and Atmospheric Administration)

Not found on last search: JV4815

Last found in 1998: JV6715

I have definitely located additional physical benchmarks that don't appear in online searches, opening a new line of detective search—are these points discredited due to being on private property or were they never memorialized as the phrase goes?

Also:

Revealing a crowd-data-sourcing idea gone sour, a thread in the geocaching space talks about community rewards given even if the field data report said "object not found".