What is Dark Data, Why Does it Matter, and Why Are Humans Still Needed?
Back in the 1960s, a pair of radio astronomers were busily collecting data on distant galaxies. They had been doing this for years. Elsewhere, other astronomers had been doing the same.
But what set these astronomers apart – and eventually earned them a Nobel Prize – was what they eventually found in the data. Like other radio astronomers, they had long detected a consistent noise pattern. But unlike others, they persisted in trying to understand where the noise was coming from and eventually realized that it wasn’t a defect in their equipment as they initially suspected. Instead, it was an echo of the Big Bang, still emitting cosmic microwaves billions of years later.
This discovery helped prove the Big Bang theory – which, at the time, was not yet fully accepted by the scientific community. Other astronomers had collected similar data but had failed to recognize the full value of what they had observed – and today’s organizations are grappling with a similar dilemma. Opportunities for key insights are often buried in a vast universe of dormant information known as “dark data.”
It’s easy to collect information, but it’s hard to turn it into insights.
Vast swathes of information are generated every day – everything from corporate financial figures to teenage social media videos. It’s stored in corporate data warehouses, data lakes, and a myriad of other locations – and while some of it is put to good use, it’s estimated that around 73% of this data remains unexplored.
Just like dark matter in astrophysics, this unexplored data can’t be observed directly by standard analytics tools, and so has been largely wasted.
So how can organizations find data in their own universes?
Every data point stored has potential value. But to extract it, the data typically needs to be translated into other forms, reanalyzed, and turned into action. This is where new technologies and new opportunities come into play.
Today’s data volumes have long since exceeded the capacities of straightforward human analysis, and so-called “unstructured” data, not stored in simple tables and columns, has required new tools and techniques. But the latest machine learning algorithms can help us detect and identify patterns in the data – once some common problems are addressed.
Improving data quality
Unexamined and unused data is often of poor quality. This can be because it’s intrinsically noisy, due to inaccurate signals from cheap sensors or the linguistic ambiguities of social media sentiment analysis (“it’s wicked!”). Or it can simply be because there’s been little incentive to improve it.
Today’s data quality solutions, augmented by machine learning capabilities, can help sift through the noise, identify the patterns of bad data quality, and help fix the problem.
New technologies make it easier than ever to bring together information from sources both inside and outside the organization. Sometimes this can provide the missing key to unlock new value from the data you already have.
Weather radar data, for example, must filter out various sources of background noise to make more accurate predictions. But as we’ve seen, one person’s noise is another’s data gold mine. It turns out that weather radar can be an invaluable source of information about bird migrations.
Ornithologists, for example, have been able to augment and unlock the value of the radar information by mixing it with data stored in “citizen science repositories.” These repositories, containing observations from amateur birdwatchers, provide a detailed, three-dimensional view of migrations for different bird species at little cost. With this data, ornithologists can better analyze the loss of biodiversity and the effects of climate change.
Or take the city of Venice – which seeks to minimize the potentially damaging impact of millions of yearly visitors. With anonymized information from cell phone operators, the city has been able to analyze the flows of tourists throughout the city to better manage congestion and facilitate smarter municipal planning.
Another example is the city of Brussels, where authorities sought to improve the lives of citizens with disabilities. Using a municipal transport database that stored time and location data for when wheelchair ramps were used on buses, the city was able to optimize the allocation of funds to provide better access and a better experience for disabled citizens.
The problems of dark data are confounded by dark variables – the “black holes” of the dark data universe, invisible to the naked eye, but whose gravitational pull affect other objects.
For example: did you know that children with big feet have better handwriting? At first glance this may seem surprising – but correlation is not causation. In this case, the dark variable is “age.” Children with bigger feet have better handwriting because they’re older. Without understanding this dark variable, one can imagine executives immediately rushing off to create a feet-stretching taskforce. But, as always, it’s best to get the full picture before taking action – which is why humans are needed.
The human factor: shining a light into dark data
Untapped dark data represents opportunities to get new insights into aspects of your business that have previously been invisible. Such insights can help you increase efficiencies, spot new customer opportunities, or improve your carbon footprint.
But doing this requires an approach based on both machines and humans.
On the machines side of the equation, SAP and Intel have been co-innovating to help organizations move forward. SAP Business Technology Platform, for example, provides a full, cloud-native suite of solutions to integrate, improve, analyze, and act on data. At the core of this platform is the SAP HANA databases which runs in memory.
“Intel helps make SAP’s in-memory approach viable for real-scenarios,” says Jeremy Rader, General Manager, Enterprise Strategy & Solutions at Intel. “With technologies that speed processing, drive performance, enable memory persistence, and support security, we’re helping organizations get the most out of all their data – including dark data.”
But as powerful as SAP and Intel technologies may be, ultimately making sense of dark data takes people. Only humans can understand the context of how the data is stored, what data might be inaccurate or missing, and how it can be used to deliver greater value to customers and the business.
The best way forward is to bring together experts on data with expertise on the underlying business processes being studied. In this way, you can turn dark data into insights and help drive business improvements.
To learn more about dark data and how businesses can realize the true value of their unstructured data, have a look at this explainer video at Vox.