A real-world case study: Worker Behavioral Analyti...

former_member787748 · ‎02-24-2022

This is the second article of a series of posts on Worker and People Behavioral Analytics and the "Connected Workforce", which started with the introductory post (link to the post is here) which covered the following fundamental questions:

Q1) WHERE IS THE “CONNECTED WORKFORCE” APPLICABLE?

Q2) WHAT MAKES THE “CONNECTED WORKFORCE” VALUABLE?

Q3) WHAT MAKES THE “CONNECTED WORKFORCE” CHALLENGING?

Q4) BUT THEN, WHAT MAKES THE “CONNECTED WORKFORCE” POSSIBLE?

Q5) WHAT WE WILL SEE IN THE NEXT POSTS OF THE SERIES?

As promised in the first introductory article, a specific case study will be presented. In this second article, as promised datasets will be presented, followed by a description of algorithms (starting with the overstay detection algorithm). The third article of the series has been posted too, which addresses the question: “Can data modeling be enhanced by incorporating business knowledge?” (link to the post is here). So, let is start this second article, which will again be based on question-driven sections:

Q6) WHAT IS THE OVERALL STRUCTURE OF OUR REAL-WORLD DATASET?

Following our first meetings with the client, a pilot dataset was given to us, containing a subset of their collected data. This sample dataset originally contained data from a specific year with roughly one month duration, and was later extended to two months. Furthermore, it contained data from 200 specific works, which were a subset of the thousands of workers of the specific industrial plant that we were addressing. The basic original mandate that was agreed was to provide analytics, and to concentrate on shift duration, as well as overstay detection (i.e. whether workers were staying more than the maximum allowed of 13 hours per shift; especially, as there were reports that some workers were also using the plant's area as a temporary residence, which was both illegal as well as dangerous, according to our client).

So, how did we start looking into our datasets?

First, in terms of Data Schemata; two large tables were given to us, in the form of .csv (comma-separated value) spreadsheets: The gate entry/exit log (X1), and the tracker log (X2), which in our case consisted not of computer-vision based tracker data, but rather on registrations of RFID's worn by workers whenever they were within the range of a number of stationary beacons.

The Entry/Exit Log (X1), had the following relevant columns (with roughly 20K rows for 30 days):

Person ID (200 workers, each with a unique ID)

Event Timestamp (Range: 10-September of a given year after midnight, to 11-October evening)

Event Type (One of four types: Gate In / Gate Out / Biometric In / Biometric Out)

Gate ID (329 Unique Values; this is a very large plant with kilometers radius, indoor & outdoor areas)

The Tracker Log (X2), with roughly 1 million rows for one month, contained columns such as:

Person ID (200 unique)

Event Timestamp

Beacon ID (more than 1K different unique beacons, with typical 10m radius coverage)

Plant ID (about 100 different sub-plants, such as "Pellet Plant", "Procurement" etc.)

Longitude

Latitude

Through a first examination of the tracker log, there were two important initial observations made:

O1) Beacon ID's have the finest granularity; but each beacon has a fixed correspondence to a specific Plant ID, which thus provide coarser granularity, if desired. Of course, one can try to group together various Beacon ID's into finer-grained categories than Plant ID's; or chose any other partitioning as desired, hierarchically "fitting" upon the Plant ID's (or not "fitting"), as desired

O2) Unfortunately, in our dataset, the (latitude, longitude) values were not determined for every Beacon ID precisely; rather, only 30 specific pairs of (latitude, longitude) existed, and one of these were assigned to each beacon, i.e. multiple beacons were sharing the same (latitude, longitude) pair - which of course were not their true locations. Thus, two-dimensional localization with adequate precision for anything practical (absolute or even relative localization) of the thousands of beacons became nearly impossible; one had to primary rely on Beacon ID's and their sequencies.

But what other more detailed observations were made, before we could start devising the shift duration and overstay detection algorithms? As you shall see, and as mentioned in the question Q3 of the introductory post of this blog series (Q3) WHAT MAKES THE “CONNECTED WORKFORCE” CHALLENGING?), the main problem was that although we had very few errors in our dataset, we had very large amounts of missing date; missing entries and exits from the plant as a whole (in the log X1), and large periods of time as well as parts of space without any tracker data (not only were the beacons sparsely located (O3) and not 2-D localizable due to observation O1 above, but also the time intervals in which they were producing tracking information were very irregular, often with whole hours without any tracking (O4)). Still, though, as we shall show, there were ways to get adequate estimates!

But before introducing our algorithms, let us first motivate them through further investigation of the dataset, using graphical and statistical methods, among others:

Q7) HOW TO EXAMINE THE ENTRY/EXIT LOG (X1) IN FURTHER DETAIL?

First, we briefly looked at data quality: starting from missing values and outliers. The dataset was covering a period of roughly one month, from a day in September 2021 till the next calendar day in October 2021. Fortunately, out of roughly 20K rows, only two clearly erroneous timestamps were found: 25-Jan-2080 06:32:37 & 26-Jan-2080 01:53:10 - which were easily identified through max/min and inequality testing in the descriptive statistics, and were removed, with little damage to the dataset, given that 2 out of 20000 is just 0.01% data loss. Of course, some of the other columns might have contained erroneous values; but fortunately, two things were observed: first, that all values were confined with the "allowable" categories and ranges: for example, the "event type" column only contained one of four allowable types: (Gate In, Gate Out, Biometric In, Biometric Out). Regarding missing values; no "cells" of the timestamps or other important columns were left empty; everything was filled in with one of the allowable values.

However, this is not a guarantee that, in the wider sense, no data is missing; it might well be the case that whole entry or exit events were not registered at all; either because the respective sensing devices were not working (gate and/or biometric loggers), or because, for any other reason, these entries did not end up being in the dataset: So, effectively, in our case, we could have whole rows missing (corresponding to entry or exit events); and just single empty cells. Upon closer inspection, exactly this was often the case; for example, we had several nice "complete" entry-exit pairs, such as:

EXAMPLE 1: Simple Complete Entry + Exit Pair

But we also had "incomplete" pairs; for example, a morning entry, which is not followed by any exit in the same day; but rather, one more entry, the next morning, as is shown below:

EXAMPLE 2: Incomplete case: Entry in the morning through gate (06:55) and also biometric ID (06:58), but then NO exit in that day ever logged; just entry again the next morning (07 Oct, 06:53)

And there's more complex cases. For example, there are cases of short exits for doing an errand outside the plant during the working hours:

EXAMPLE 3: Short-duration leave (exit-entry) to plant during a normal day shift

And many more variations: Night shifts, short visits, and so on. And always, we might either have Gate events alone, or Gate events followed/preceded by Biometric log events, or Biometric log events alone.

EXAMPLE 4: An example of a night shift

But how frequent are these special cases? And what do they mean for us, in terms of achieving our first goals? (shift analytics and overstay detection)

Let's start with the first question. Well, a simple event count, for the first month, classified across event types (Gate vs. Biometric, In vs. Out) clearly indicates the asymmetries, but at this level of analysis, they seem to be slight: Cumulative In vs. Out counts deviate by only roughly 1% (9546 vs 9452). But a deeper analysis shows a much more problematic situation, as we shall see!

Simple count of entry/exit event types: Note the slight asymmetries

But then, how often do cases like the above (such as example 2 and 4) really take place? And how many of them are the "problematic" ones? (such as the "incomplete" example 2, where there is no exit reported, but only re-entry, after too long a period for it to be the same shift?)

If we start "segmenting" our sequences after every "Gate In" as a prefix-marker, this is what we get:

Subsequence Distribution:

51% Class 1 = Complete = {Gate In, Biometric In, Biometric Out, Gate Out}

11% Class 2 = CompleteButNoBiometric = {Gate In, Gate Out}

5% Class 3 = CompleteButNoBiometricIn = {Gate In, Biometric Out, Gate Out}

10% Class 4 = CompleteButNoBiometricOut = {Gate In, Biometric In, Gate Out}

3% Class 5 = Incomplete = {Gate In, (n1 times Biometric in)}, where n1>=0

20% Other

Two further observations:

Thus, at this level of analysis, it is clear that only roughly three quarters (77% = 51+11+5+10) of the cases seem to be complete, with at least 3% of the cases clearly incomplete (and thus problematic), and the remaining 20% having more complex patterns which might be problematic too. But instead of examining this in more detail, two observations can now clearly be made:

Even if a case is complete (belongs to Classes 1-4, i.e. starts with Gate In and ends with Gate Out without further Gate In inbetween the two), then it might well be the case that there is apparent overstay - it all depends on the total apparent duration between the In and Out, and it might be the case that a whole pair of Out-In between the remaining In-to-Out has not been registered (O5)

Our choice of using Gate In as an sequence segmentation prefix marker was arbitrary; we could have as well chosen Biometric In, and indeed there exist cases where a shift is only marked with biometric events and not gate events, such as the one below (O6), and actually specific people never use gate events but only biometric:

EXAMPLE 5: A worker that enters / exits primarily by biometric events, very rarely by gate events

Thus, through the above observation O6, the need arose to temporarily collapse the four types of events into two new composite types of events; Thus, the two new types were:

IN = (Gate In or Biometric In)

OUT = (Gate Out or Biometric Out)

Through this simple change, things started making sense much more easily. The final conceptual step chosen with the introduction and precise definition of a new term: the term we call "Episode":

DEFINITION OF "EPISODE":

An EPISODE is a subsequence starting with an IN event (firstIN), and ending with an OUT event (lastOUT). If multiple further IN events exist after the first registered IN (i.e. after firstIN), they are ignored, as long as no OUT has occurred yet in the sequence. If multiple OUT events exist in the sequence after the first OUT, we keep only the last OUT event (lastOUT).

At this stage, we were almost ready to start drafting (and trying out) the first part of our algorithm. We just needed some business logic and threshold information from our client, and we asked for it, and this is what we got:

CLIENT INPUT FOR BUSINESS LOGIC AND THRESHOLDS:

BL1) Any short duration leave (i.e. small "exit-entry" during a shift, see EXAMPLE 3 above), with duration of less than 2 hours (let's call this Threshold 1, i.e. θ1), should be counted as time inside the plant.

BL2) Any total stay of more than 14 hours duration (Threshold 2, i.e. θ2), should be counted as overstay, irrespective of night or day shift.

Thus, we the above, we were ready to implement and evaluate the first part of our algorithm (towards shift analytics and overstay detection), which functioned even with the "imperfections" of the data set that we analyzed above:

ALGORITHM PART 1:

STEP1) Collapse Event Type Categories (from the original 4 to 2 composite categories):

{Gate In, Biometric In} => “IN”
{Gate Out, Biometric Out} => ”OUT”

STEP2) Turn multiple consecutive Entries/Exits into just one, i.e.

Concatenate “IN … IN” to a single “IN”, and keep datetime of first in sequence
Concatenate “OUT … OUT” to a single “OUT”, and keep datetime of last out sequence

STEP3) Count short duration leaves as time spent inside the plant, as per client's BL2 rule, i.e.

If a break is less than θ1 = 2 hours, then eliminate the inbetween “OUT” – “IN” pair:
[IN OUT IN OUT] => [IN OUT] and keep only first entry and last exit time

STEP 4) Count how many episodes of each type exist at this stage:

[IN OUT]: 4781 of 4865 = 99% which are made up of:
85% G-G + 5% B-B + 2% G-B + 6% B-G, where G=Gate, B=Biometric (the 1% remaining are at the first day of the one-month period and start with an OUT but the corresponding IN was before the start datetime of the whole dataset, and so it is missing (so we have some orphan [OUT] at the beginning of the dataset); correspondingly, there exist some orphan [IN] only, during last day of the dataset; mainly because their corresponding OUT was after the last datetime of the dataset and was thus not included in it)

Note however, that the fact that we now have 99% "Complete" episodes, does not mean that there are no missing IN/OUTs that could have generated very long (more than 24 or 48 hour) "episodes", as we shall see! Thus:

STEP 5) Estimate the probability distributions of episode duration, and interpret the findings:

By aggregating across all 200 workers, and creating three histograms for all episodes (showing durations 0...24 hours; but note, that as we shall see, we have quite some "episodes" with durations all the way up to 100 or more hours!), this is what we get:

Histogram count of episodes as per duration in hours (capped at 24hrs)

So, an analysis of counts as per episode duration, (across all episodes aggregated across subjects), now shows:

84% Below 13 hours, 16% Above 13 hours

And this 16% breaks down to:
12% below 24 hours, and 4% above 24 hours (max 619h)

Thus, at this stage, we have, three types of episodes:

84% Episodes with duration clearly below θ2=13 hours (certain no overstay violation)

12% Episodes with duration between θ2=13 hours and 24 hours (certain overstay violation, as no second shift can start within 24 hours, as per our client's input)

4% Episodes with duration above 24 hours (which need further investigation; as to whether they indeed indicate overstay or just arose out of "missing" in/out events: which is exactly what the second part of our algorithm will do, on the basis of the tracker log (X2) as we shall see below!)

So, the question arises: How can the tracker data help us investigate these cases?

In the next blog of this series, we answer this question! The link to the next blog can be found here.

ANALYSIS ACROSS INDIVIDUAL WORKERS

But let us temporarily move away from aggregates of people, and also provide an individual-worker-level analysis, to further explore the situation created by the Entry/Exit log S1:

Per individual worker, how often we have easy vs difficult episode cases?

Thus, from the above it becomes clear, that if "quality" is arbitrarily defined in the simplistic manner above (just looking at type-1 "certain no violation" episodes as a fraction of all episodes), that only 60% of the workers have quality above 90%; and that there does exist a sizeable number of cases requiring investigation.

So again, the question arises: "How can the tracker data help us investigate these cases?", which we answer in the next blog in the series!

RECAP - WHAT WE HAVE SEEN IN THIS ARTICLE

In this article, which is the second in the blog post series (links to the other articles can be found in the introduction of this), we have discussed the following questions:

Q6) WHAT IS THE OVERALL STRUCTURE OF OUR REAL-WORLD DATASET?

Where we introduced the two main logs: the Entry/Exit log (X1), and the Tracker log (X2). And then, we asked:

Q7) HOW TO EXAMINE THE ENTRY/EXIT LOG (X1) IN FURTHER DETAIL?

And to answer this, we touched upon missing and erroneous values, missing rows (i.e. measurements that did not take place or were not registered), distributions of entry/exit event types and different cases of entry/exit subsequences. Then, we introduced and precisely defined "EPISODES", and examined three different episode types as per overstay (certain no violation, certain violation, needs further investigation) - where violation is meant in terms of overstay. Finally, and most importantly, after motivating it and describing the business logic and thresholds that our client gave us, we introduced the five first steps of our basic processing algorithm (ALGORITHM PART 1), and also examined the distributions of apparent episodes as per duration.

In order to further classify the episodes of third type ("needs further investigation"), in the next article of this series (which you can find here), we will utilize the Tracker log (X2), after examining it, in order to classify these episodes into "most probably violation" and "most probably no violation".

A real-world case study: Worker Behavioral Analytics and the "Connected Workforce", Part 1/3

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win