Machine Learning Thursdays: Data—The Crude Oil of the Digital World?
Today’s blog is brought to us by Dr. Paul Pallath, Chief Data Scientist & Senior Director with the Advanced Analytics Organisation at SAP
The digital economy spurred a move from a system of records to a system of intelligence. Every interaction that a human or machine has with the digital world leaves behind a trace. This digital footprint carries significant information embedded in the data that provides a fertile ground to build systems of intelligence.
Today, an increasing number of enterprises are inclined to make data-driven decisions. Therefore, it’s important that we understand that a new renewed focus on data strategy is gaining ground. However, the quality and robustness of our insights are directly proportional to the quality of the data. The following simile provides a helpful way of viewing data in this context:
“Data is just like crude. It’s valuable, but if unrefined it
cannot really be used.”
– Clive Humby (2006)
How to refine data is an important consideration. Data-driven insights are generated from data that is structured (or collected as part of our systems of records) and unstructured (non-transactional data like social media, video, speech, and text contents not captured as part of traditional systems of records). Combining data from such disparate data sources is difficult to achieve due to obvious reasons, and almost impossible to maintain. What’s more, data pollution, leakage, and spillages can have serious repercussions (much like oil spills).
It’s dangerous for machine learning implementations to rely on an external data source that doesn’t have clear authoritative sources, or has little understanding of how data is collected. This can create hidden debts that lead to non-scalable, unstable systems.
Finding a safe, data-driven approach is therefore important. And the quicker you find it, the better. The impact of machine learning on humankind is growing. Governments around the world are enacting new regulations to ensure transparency around the type of data used by enterprises, how they use their insights, and how privacy should be protected. Hence, standardization and data governance is focused on fixing problems at source, which affects the organizations owning the data.
On this point, this second comparison of data mining with oil exploration is illuminating:
“The difference between oil and data is that the product of oil does not generate more oil, whereas the product of data will generate more.”
– Piero Scaruffi (2016)
Data is the raw material for building systems of intelligence. It’s vitally important that enterprises are equipped with good quality data for gaining consistent value in their journey towards becoming a data-driven organization. That way, data can serve as a clean natural resource for your machine learning implementation.
July 13th Webinar: Debts in Machine Learning Implementations
Join me for a webinar as I discuss debts in machine learning implementations. While building machine learning at scale and at an accelerated pace to address business needs, it is necessary to be aware of debts that we accumulate in such implementations, and service them regularly. Hidden debts are even more dangerous as it compounds silently. This presentation will discuss the notion of “massive machine learning” and various type of hidden debts that we need to be cautious about.
- Date: July 13, 2017
- Time: 7am – 8 am PT
- Presenter: Dr. Paul Pallath
- Register Now