This is part 2 of the Person of Interest series – you can find part one here: The Predictive Science Behind TV’s “Person of Interest”
“You are being watched
The government has a secret system – a “machine”.
It spies on you every hour of every day
I know, because I built it.
I designed the machine to detect acts of terror
But it sees everything
Violet crimes involving ordinary people
People like you.
Crimes the government considered irrelevant
They wouldn’t act, so I decided I would.
But I needed a partner
Someone with the skills to intervene.
Hunted by the authorities, we work in secret
You will never find us.
But if your number is up, we will find you.”
Eyes and Ears Everywhere
In my first post in this series (The Predictive Science Behind TV’s “Person of Interest”), I revealed that the “machine” has access to an immense amount of data – it can tap into many forms of communication, look up data from thousands of sources, and of course track you everywhere you go.
It looks for schemers, plotters, malicious intent, and suspicious transactions – but it has to analyze everything because individual event data doesn’t say anything about what data is relevant or why.
When he started building the “machine”, Finch had to teach it to collect information from any source without deciding whether the information is useful or not. This is very different than the world of analytics we live in where we usually do not have the option to collect “everything” due to cost, privacy, or other issues. Even in Person of Interest, it is unrealistic for even “the machine” to store every byte collected on every one of the city’s 8 million inhabitants.
The Secret To Better Predictive Analytics
So how does Finch’s “machine” work its magic without actually storing and then trying to process all of that data on the fly? Metadata. Finch taught his machine to perform image and facial recognition, voice-to-text transcription, and textual sentiment analysis to create metadata – data about the data.
This turns a single image of a person into a name, a birthdate, work history, GPS location, even metadata about other people visible in the same picture. An intercepted email contains entities, what they think, what they have done, and who they know. The machine continuously processes everything it collects and automatically derives the additional metadata.
In many cases, this metadata is all that needs to be kept – for example, if the machine has identified all the people in a photograph, where and when it was taken, and any other extractable metadata such as sentiment, there may be no other reason to keep the original binary image. The same metadata processing happens for all audio, video, and any other data the machine collects – so while the machine collects terabytes of data per second, it doesn’t need to store the raw binary streams.
POI Techniques In The Real World
Hopefully this gives you some ideas – you don’t need to do hyper-speed text analysis to learn more about a customer. If you are a retail operation, is your customer buying items from men’s and women’s departments or just their own gender’s? If you are in a services industry, did the customer phone into the call center recently? What was the nature of their call, and was their sentiment positive or negative? Maybe you want to run speech-to-text and capture the transcript of every call instead of storing all the audio. These extra pieces of metadata could be useful in determining buyer behavior or understanding if a customer is happy with you or not.
What metadata should you collect? Anything and everything you can. This is what the “machine” does, and what every good data scientist wishes they could do. Whether you store the data in your SQL-based data warehouse, an SAP HANA system, or your own Hadoop cluster, it doesn’t matter – you can always transfer or blend your data later once you know what you want to do with it. I’ve met some customers that think this is “overkill” and for some organizations it might be – the problem is you never know until it’s too late. There is no requirement to put in this extra thought (or processing in some cases), but if it could improve your profits by even 10%, wouldn’t it be worth it?
Metadata about Metadata
But how do you actually use this data to get that boost to your business? Not so fast – there’s another step that we typically don’t consider in the business intelligence world: derived data sets. There may be patterns in the data that cannot be detected by analyzing a few rows and a more sophisticated way of looking at the data is needed.
For example, it may be interesting that a perpetrator was at Central Park at 2 pm on Monday, but knowing that she has been there every weekday, but only when another person is at the same location becomes very interesting. Is the perpetrator stalking a new victim? How are they related? Is she shadowing the victim or does she know the victim’s routine? By the way, how do we even know that we are analyzing a perpetrator and not a victim?
The reason predictive analytics needs on disaggregated (non-summarized) data is that while a single row may not be significant, a combination of events may mean something. The secret lies in creating additional fields in the data based on a higher level understanding of the data that is lost when looking only at summarized records. This is where data science becomes non-obvious and a data scientist earns their wage.
As a trivial example, a data scientist may add seven additional binary fields to every event/transaction to encode the day of the week for easy analysis. Time of day? There might be another 24 fields. This can be extended to other types of data as well – It is easier for some algorithms to use a sales order record if each product has its own binary field that is set to “true” if it is included in that order.
Big Data = Wide Data
This creates a massive explosion in the number of fields to analyze and is a very important concept in predictive analytics: When we say “Big Data”, we mean very wide data sets and just not long ones.
If you thought collecting extra metadata was “overkill”, stay tuned for part 3 of this series where we’ll uncover the true secret to why the “machine” can be as accurate as it is.
As it was with part one, feedback drives the posting frequency for this series – rate this article (below) or “like” it above to help get the next installment sooner 😉 .