We live in a world of Big Data. Today, the data that organizations collect, store, process, and analyze is increasing in volume, velocity, and variety at nearly exponential rates. As we attempt to extract business value from data, the subject of Big Data Governance, becomes increasingly important. Therefore, it is imperative, that organizations get a handle on Big Data Governance before they’re engulfed in the ever rising tide of data. Today, Big Data solutions mandate the need for metadata management, scalable architectures, and privacy, security, and compliance controls. This is the first of a three part series on Big Data Governance.
What is Big Data?
What differentiates Big Data from ordinary data or “Small Data”, as it is sometimes referred? Big Data is most commonly defined in terms of the three “V’s”: Volume, Velocity and Variety, which was first described by Gartner’s back in 2001 1. The Big Data volume definition will continue to remain a moving target, as of 2012 Big Data sizes were ranging from a few dozen terabytes to many petabytes of data in a single dataset.
- Volume – Refers to the amount of data, which is higher
- Velocity – Refers to the speed of data collection
- Variety – Refers to the range of data types, sources, and languages
Some have added other “V’s” like Veracity (accuracy) or Value (new business value derived from new data sets) to define Big Data; however, the three original “V’s” represent some of the biggest challenges to Big Data Governance. Let’s take a closer look.
The “V’s” and Big Data Governance
Let’s take a closer look at Big Data’s three V’s – Volume, Velocity and Variety.
- Volume – Is the “V” that most clearly associates itself with the term Big Data. Historically, the cost and technical complexity associated with storing large volumes of data was a limiting factor to Big Data pursuits. Today, advances in technology have made it economically feasible for many organizations to store large amounts of internal or external data to support business analytics. However, storing terabytes or petabytes of data onto economical platforms and expecting the resulting analytics to yield new business insights is not a guaranteed outcome. Analytics that combine the traditional structured enterprise data with the less structured Big Data content, can lead to unpredictable results, not providing the expected value for the Big Data solution. A key governance principle for Big Data volumes is, “Don’t try to boil the ocean”. What does this mean? Identify the Big Data source content that is the most relevant for the business use case and establish contextual metadata for this specific content. This enhances business value without having to address large volumes of less relevant content.
- Velocity – Is the “V” that represents perhaps the greatest challenge to Big Data Governance, data quality, and analytics. The question posed here is Big Risk vs Big Value? Today, businesses are making decisions based on information from external Big Data sources, such as planned inventory levels and deployments, marketing investments, advertising campaigns, predictive analytics, and acquisitions, to name a few. Big Data solutions consume large amounts of ever changing data in near real time, often from external sources that have little or no governance applied. Businesses expect Big Data solutions to provide business insight and competitive advantage. Supplying information to decision makers via online dashboards and reports from unstructured and ungoverned data sources can be inherently risky. Investing in a few seconds of run time processing or pre-processing to apply data cleansing, transformations, or standardizations based on established methodologies yields improves data quality and business confidence in the results, reducing overall risk of decision making based on the results.
- Variety – Is another “V” that makes Big Data Governance challenging. With a myriad of sources available and the explosive growth of unstructured web and social media content, increasingly it will become necessary to bring together the three types of data structures – i.e. structured data, semi-structured, and unstructured. Not all data sources and content will be of equal value or quality. Given the variety of content that is available, some will be more relevant and valuable for business use than others. To address the variety of content, consider, using Pareto’s principle by asking the question, “From which information will I derive the greatest value with the least amount of effort? Focus on that subset of sources and content first. Consider eliminating data sources that don’t provide value or are highly redundant or duplicative.
Have you considered how Big Data’s Three “V’s” factor into your Big Data solutions and your organizations governance practices? In part two, we will explore Big Data Governance-Techniques & Technology.
1 Garter 2001, Doug Laney