How to Avoid Drowning in a Data Lake

timoelliott · ‎06-22-2017

This article is excerpted from “Data Lakes: Deep Insights” by Timo Elliott, John Schitka, Michael Eacrett, and Carolyn Marsan*

Big Data is morphing into Vast Data. The next generation of the technology will lead to insights and correlations that reveal new strategies—even new business models.

Companies of all types—from agriculture through transportation and financial services to retail—are tapping into massive repositories of data known as data lakes. They hope to discover correlations that they can exploit to expand product offerings, enhance efficiency, drive profitability, and discover new business models they never knew existed.

How to Avoid Drowning in the Lake

The benefits of data lakes can be squandered if you don’t manage the implementation and data ownership carefully. Deploying and managing a massive data store is a big challenge. Here’s how to address some of the most common issues that companies face:

Determine the ROI. Developing a data lake is not a trivial undertaking. You need a good business case, and you need a measurable ROI. Most importantly, you need initial questions that can be answered by the data, which will prove its value.

Find data owners. As devices with sensors proliferate across the organization, the issue of data ownership becomes more important.

Have a plan for data retention. Companies used to have to cull data because it was too expensive to store. Now companies can become data hoarders. How long do you store it? Do you keep it forever?

Manage descriptive data. Software that allows you to tag all the data in one or multiple data lakes and keep it up-to-date is not mature yet. We still need tools to bring the metadata together to support self-service and to automate metadata to speed up the preparation, integration, and analysis of data.

Develop data curation skills. There is a huge skills gap for data repository development. But many people will jump at the chance to learn these new skills if companies are willing to pay for training and certification.

Be agile enough to take advantage of the findings. It used to be that you put in a request to the IT department for data and had to wait six months for an answer. Now, you get the answer immediately. Companies must be agile to take advantage of the insights.

Secure the data. Besides the perennial issues of hacking and breaches, a lot of data lakes software is open source and less secure than typical enterprise-class software.

Measure the quality of data. Different users can work with varying levels of quality in their data. For example, data scientists working with a huge number of data points might not need completely accurate data, because they can use machine learning to cluster data or discard outlying data as needed. However, a financial analyst might need the data to be completely correct.

Avoid creating new silos. Data lakes should work with existing data architectures, such as data warehouses and data marts.

Data for All

Given the tremendous amount of hype that has surrounded Big Data for years now, it’s tempting to dismiss data lakes as a small step forward in an already familiar technology realm. But it’s not the technology that matters as much as what it enables organizations to do. By making data available to anyone who needs it, for as long as they need it, data lakes are a powerful lever for innovation and disruption across industries.

Read the full article as it appears in the D!gitalist Magazine for innovative data-lake case studies.

Timo Elliott is Vice President, Global Innovation Evangelist, at SAP. John Schitka is Senior Director, Solution Marketing, Big Data Analytics, at SAP.* Michael Eacrett is Vice President, Product Management, Big Data, Enterprise Information Management, and SAP Vora, at SAP.Carolyn Marsan is a freelance writer who focuses on business and technology topics.