This post is a recap of my presentation and is a semi-autobiographical journey in helping data teams setup data governance frameworks.
The first step is to understand what is data governance. Data Governance is an overloaded term and means different things to different people. It has been helpful to define Data Governance based on the outcomes it is supposed to deliver. In my case, Data Governance is any task required for:
- Compliance: Data life cycle and usage is in accordance with laws and regulations.
- Privacy: Protect data as per regulations and user expectations.
- Security: Data & data infrastructure is adequately protected.
Compliance, Privacy, and Security are different approaches to ensure that data collectors and processors do not gain unregulated insights. It is hard to ensure that the right data governance framework is in place to meet this goal. An interesting example of an unexpected insight is the sequence of events leading to leakage of taxi cab tipping history of celebrities.
Paparazzi took photos of celebrities in New York City using taxi cabs. The photos had geo-locations and timestamps along with identifying information about taxi cabs like registration and medallion numbers. Independently, the Taxi Commission released an anonymized dataset of taxi trips with time, medallion numbers, fares, and tips. It was possible to link the metadata from photographs and the taxi usage dataset to get the tips given by celebrities.
Data Governance is hard because:
- There is too much data
- There is too much complexity in data infrastructure.
- There is no context for data usage.
The trend is towards businesses collecting more data from users and sharing more data with each other. For example, the image below lists some of the companies PayPal has data sharing agreements.
As companies share and hoard more data, it is possible that they will link these datasets to garner insights that were unexpected by the user.
The Data & AI Landscape lists approximately 1500 open-source and commercial technologies. In a small survey, I found that a simple data infrastructure uses 8-10 components. Data and security teams have to ensure similar capabilities in compliance and security across all the parts of the data infrastructure. This is very hard to accomplish.
Analytics, Data Science and AI objectives compete with compliance, privacy, and security. A blanket “Yes” or “No” access policies do not work. More context is required to enforce access policies appropriately:
- Who is using the data?
- What purpose?
I have found it helpful when working with teams on Data Governance by answering these basic questions:
- Where is my data?
- Who has access to data?
- How is the data used?
Typically teams care only about sensitive data. Every company and team will have a different definition of what is sensitive. Common ones are PII, PHI, or financial data.
It is also important to ensure the process of obtaining answers is automated. Automation will ensure that the data governance framework is relevant and useful when required.
A data catalog, scanner, and data lineage application are required to keep track of sensitive data.
An example of a data catalog and scanner is PIICatcher. PIICatcher can scan databases and detect PII data. It can be extended to detect other types of sensitive data. The image shows the metadata stored by PIICatcher after scanning data in AWS S3 in the AWS Glue Catalog.
Typically it is not practical to scan all datasets. Instead, it is sufficient to scan base datasets and then build a data lineage graph to track sensitive data. A library like data-lineage can build a DAG from query history using a graphing library. The DAG can be used to visualize the graph or process it programmatically.
Most databases have an information schema that stores the privileges of users and roles. This table can be joined with a data catalog where columns with sensitive data are tagged to get a list of users and roles that have access to sensitive data.
The first is to log usage across all databases. Most databases store query history in the information schema. Big Data technologies like Presto provide hooks to capture usage. It is not advisable to log usage from production databases. Instead, proxies should be used for human access. Proxies can log usage and send the information to a log aggregator where it can be analyzed.
Data Compliance, Privacy & Security is a journey. Data governance is hard but can be tackled by starting with simple questions and using automation extensively.