With all the discussion going on about big data it’s important not to forget about data governance. Below are some of my thoughts on how organizations will need to evolve data governance to meet big data challenges. And in an upcoming webinar Information Governance in a World of Big Data, John Radcliffe will be sharing his insights on the topic.
Some policies are probably going to need to change as well as new ones created. How do you connect the policies to your data, processes, systems, business rules, etc? And make them visible/accessible to people throughout the organization.
You need to start thinking about connecting metadata in Hadoop (HCatalog) with your enterprise metadata to enable lineage and impact analysis (trust levels), improve search (you may have a huge pool of data in Hadoop but if people can’t find what they are looking for it does no good), etc. Programmers
who are building big data apps are going to need enterprise architecture information models that include the new data sources.
There maybe changes required to your governance organization based on new data types or responsibilities. Maybe data stewards will need training on social data. Or you may decide a Chief Data Officer is required to better oversee your governance program.
There will be a need for additional capabilities beyond traditional ETL like replication, streaming data, and process integration. As the logical data warehouse rise in prominence, companies will want to look at federation and virtualization capabilities that enable access across relational databases and Hadoop without having to physically move the Hadoop data into a SQL environment, which will enable more sophisticated data-tiering strategies based on utilization. Perhaps even adding data quality metrics into the virtualization to rationalize data from multiple sources with different field names/metrics in the same query.
There will be a need to process new types of information like text and spatial data, which may require advanced capabilities to perform semantic discovery, or
specialized functions/algorithms. Companies may want to look at new technologies like in-memory engines to speed up data profiling, transformation, cleansing, matching, master data consolidation, best record calculations, etc.
Archiving, Retention and Deletion
Companies will need to reassess their policies and decide what data to keep in live systems, what to archive to near-line storage, and what to just get rid of
because it is impossible to keep it all. There will potentially be value in leveraging new data storage technologies, based on Hadoop, that give new, reduced, price points for ILM requirements, particularly relative to holding years of information online in relatively expensive relational database technologies.
If you can’t actively govern all internally generated data for practical reasons and most of the external data you don’t control anyway, then you need to start
thinking differently. It would be sensible to start assigning trust levels to data, whether it’s actively governed, passively governed or not governed at all
(to your knowledge). This may include incorporating trust levels into data quality scorecards and metrics with the ability to display the lineage/provenance
of the data so they can determine their own fit for use.
Security and Privacy
Companies are going to have to reevaluate their security policies and ensure their segregation of duties and access controls are revised. Record-by-record privacy, security and audit capabilities so data stewards can manage access to information based on consent, data sharing agreement and corporate standards. If you are sharing or selling data, then methods of anonymization of data that may have previously been thought sufficient may no longer be so. With the ability to buy in data sources, combine them with internally generated data, some of it “dark data” that has never leveraged before, then there is the real possibility of creating new derivations that clearly identify individuals.
Master Data Management
Companies will want to start associating social data with their internal customer and product data. And although customer (CDI) and product (PIM) were the starting points for master data governance companies are going to need to start looking at master data more holistically and incorporate new domains like asset, device, and location.
I would love to hear you thoughts on how EIM and Data Governance will evolve to support big data.