DataGovOps: Fixing the Broken Promise of Data Governance

Everyone knows that today’s world revolves around data and being competitive requires enterprises to be data-driven. According to Accenture, 70% of the world’s most valuable corporations are data-driven, up from only 30% in 2008. Being data-driven requires enterprises to collect and store more and more data, and enterprises need to ensure that data is stored and used safely and appropriately. In other words, they need good data governance.

But for most enterprises, good data governance is really a broken promise. Or, as one financial services enterprise recently described to us:

"We have been up and down with data governance a few times in the past few years. There was a governance office. We “check-boxed” our way through and it was ultimately seen as more of a barrier than a benefit."

Enterprises cannot “check-box” their way through Data Governance. Check boxes imply manual processes. Check boxes also imply that their data really isn’t being governed.

Enterprises need to operationalize their Data Governance. They need DataGovOps.

Let’s peel this onion.

What is DataGovOps?

DataGovOps refers to the collaborative data management practice focused on improving the communication, integration and automation of context and policy among all Data Governance stakeholders in an organization, including Security, Compliance, Privacy, and Data Owners. DataGovOps automates the integration of security and compliance at every phase of the data lifecycle.

In order to fully appreciate what DataGovOps is and why it’s needed, it’s important to address:

What data governance is supposed to be
Why data governance in most corporations is a broken promise
How DataGovOps fixes that broken promise.

However, to fully understand DataGovOps, we need to have a common understanding of what Data Governance is, who is responsible for Data Governance, how Data Governance functions typically operate, and the shortcomings of most Data Governance programs.

What is Data Governance?

According to Google Cloud, Data Governance is (with added emphasis):

…everything you do to ensure data is secure, private, accurate, available, and usable. It includes the actions people must take, the processes they must follow, and the technology that supports them throughout the data life cycle… Data governance means setting internal standards—data policies—that apply to how data is gathered, stored, processed, and disposed of. It governs who can access what kinds of data and what kinds of data are under governance.

Who’s Responsible for Data Governance?

Every enterprise has a Data Governance function. Whether or not it’s formally called “Data Governance” or has employees with “Data Governance” in their titles is another question.

In many large organizations, Data Governance is a distributed function or “program” across multiple teams, including:

Security
Compliance
Privacy
Data
And maybe a few others

Because Data Governance is a distributed function, relatively few professionals actually have Data Governance in their job title. A few searches on LinkedIn reveal:

1,540,000 professionals with “security” in their job title;
635,000 professionals with “compliance” in their job title; and
16,000 professionals with “data” and “governance” in their job title -- a 40X to 100X difference.

So Data Governance is a cross-functional fabric that spans multiple teams and/or departments. Or, as we like to say, Data Governance takes a village -- it’s a shared responsibility that requires a significant amount of coordination and collaboration across multiple teams.

How Do Data Governance Programs Typically Operate?

Two words: divide and conquer.

To understand how a Data Governance program operates, we need to break down the above definition of data governance.

If Data Governance is everything you do to ensure data is available, secure, private, accurate, and usable, then Data Governance subsumes multiple functions. These functions include:

Data Protection

Data Preservation

Ensuring data is automatically backed up
Ensuring data is replicated/highly available
Archiving data
Retaining/not retaining data

Data Security

Ensuring data is encrypted
Threat monitoring
Managing access control
Authentication/Identity Management
Breach Assessment and Recovery
Data Loss Prevention

Compliance & Privacy

Tracking and understanding relevant regulations
Tracking third-party data processing agreements (DPAs) that the enterprise needs to honor
Creating policies to stay compliant with regulations and DPAs
Establishing best practices
Managing Data Subject Requests (DSRs)
Conducting regular audits to ensure compliance

Data Management

Creating and maintaining a data catalog
Classifying data
Tagging data/adding metadata to data
Ensuring data discoverability
Data quality/data cleaning
Managing enterprise data architecture
Modeling data processes

Visually, the Data Governance function might look like this. (Credit to the Storage Networking Industry Association for most of the section under Data Protection.)

Figure 1: The Data Governance Function

Given how the Data Governance function looks above, the reality of how Data Governance operates is this: the figure above ends up segmenting into separate teams, operating in silos, and occasionally interacting with each other via periodic sensitive data audits or access audits.

Figure 2: Data Governance Functional Silos

The Problem with Data Governance and Functional Silos

There are 4 key problems when Data Governance is siloed by function.

Functional silos result in technology solutions that are also siloed -- the solutions lack integration and automation across the data ecosystem and lifecycle and reinforce functional silos rather than breaking silos down
Collaboration between silos occurs periodically as opposed to continuously
When functions do collaborate, collaboration takes the form of manual, time-consuming audits, and
Between the periodic audits, data governance is left to chance and best intentions.

Data suffers from a fundamental lack of governance between the periodic audits. This is what we call the broken promise of data governance.

In other words, state-of-the-art Data Governance currently looks like this:

Figure 3: Data Governance via Cross-Functional Periodic Audits

Ideally, Data Governance should behave/operate like this:

Figure 4: How Data Governance Should Ideally Operate

The Problem with Non-Continuous Data Governance

To illustrate the problem with non-continuous data governance, let’s assume that a team of employees is working on a special project, and they need access to a specific set of data that they don’t typically have access to. They submit a User Access Request Form. Every organization has one. Your organization has one too.

The User Access Request Form typically kicks off a process that looks something like the diagram in Figure 5 (below). The Security team receives the request. The Security team may need to validate the contents of the target data set with the Data Team; the Security team may also need to escalate the request to the employee’s manager or executive sponsor of the team, and there may also be a secondary approval process. Once approved, the database and/or IAM permissions need to be updated to reflect the new permissions.

The biggest problem with this process diagram is the “Stop” oval. During the process, data is being governed carefully and meticulously. But, even after IAM and database permissions have been provisioned, many bad things can happen. For example:

A team member that has just been granted access to the data set promptly copies sensitive data into another table which all employees have access to.
A day after the team has been granted access to the data set, a data engineer adds more sensitive data to the data set. Now that team has access to much more sensitive data than originally intended.
A new employee is added to the team, but the new team member isn’t working on the special project. The new employee may inadvertently get access to the sensitive data.
An employee leaves the team and no longer needs access to the sensitive data set. After transferring departments, the employee still has ongoing access to the data set because someone forgot to file a request to reduce access.
An employee on the team leaves the company. Because the employee’s access to this sensitive data set was not part of typical onboarding, in the off-boarding process, the employee’s access to the data set is not deprovisioned, leaving a “ghost” user and creating a potential security vulnerability.

Without continuous monitoring of context and policy -- i.e., without operationalization -- the world of data governance becomes a massive collection of unenforced contracts. Even periodic access control audits and sensitive data audits leave data essentially ungoverned between audits.

Figure 5: User Access Request Process

Data Governance Policies: More than Just Access Control

Many people -- and many commercial software solutions, for that matter -- might tend to oversimplify the role of Data Governance into managing access control, for example.:

Determine what kind of data resides where via data classification
Manage who has access to a specific database, schema, or table
Manage what kind of permission they have (read, read/write).

Next-gen access control solutions might also include self-service portals for employees to request access, and obfuscated access, where employees can access data, but specific fields are masked or tokenized on the fly.

Access control is a necessary bedrock of a good data governance program. At the same time, access control is insufficient for good data governance.

Let’s explore some other types of Data Governance policies that enterprises have, and how those policies can often end up as well-intended pieces of paper on someone’s desk that aren’t actually enforced -- i.e., they never get operationalized.

Examples of Data Governance Policies	Data Governance Broken Promises
Data Preservation Policies
“All databases should be backed up every day.”	A cloud database isn’t configured to be backed up.
“All databases in Region A should be replicated to Region B.”	Replication script fails and no one notices.
“Data should be retained for 2 years, then archived.”	Data is kept beyond the retention period
“All data warehouse clusters should reside in this account and region.”	Data warehouse cluster created in the wrong account and/or wrong region.
Data Security Policies
“All data stores should be encrypted at rest.”	A cloud database isn’t configured to be encrypted.
“All data stores should be inaccessible from the public internet.	An S3 bucket or MongoDB instance is accessible to the public.
“All data should be replaced with synthetic data when being copied over from production to staging.”	Script fails, production data ends up in staging environment.
“When an employee is terminated, all the employee’s database usernames should be deprovisioned.”	Database usernames need to be manually deprovisioned, and a username doesn’t get deprovisioned.
“If a user hasn’t accessed sensitive data stores in the last 2 months, their permission should be reduced.”	Once granted permission, users have permanent access to data stores.
“Only Marketing personnel should be able to generate customer lists with more than 100 rows.”	A customer success representative downloads a list of 50,000 customers.
“All database user names should map to a service account or an employee identity.”	An unrecognized database username has activity and doesn’t map to an employee identity or a service account.
“Summer interns and high-risk employees should not have access to highly sensitive data.”	Someone inadvertently copies sensitive data to a schema that all employees have access to, including summer interns and high-risk employees.
“Customers should only be able to access their own data.”	Customer A is able to access Customer B’s S3 bucket.
Compliance & Privacy Policies
“Schema A should never have PII in it.”	Someone inadvertently copies PII into Schema A.
“Type A data should never reside in the same table as Type B data.”	Someone inadvertently copies Type A data into a table with Type B data.
“Type A and Type B data should never appear together in query results.”	Someone inadvertently issues a query that accesses 2 different tables and joins Type A and Type B data with an anonymous but unique identifier.
“Data from this data set should never be stored in this set of countries.”	A table from the data set in question is copied into a data store that resides in a restricted country.
“Data from this data set should never be handled by employees from this set of countries.”	An employee from a restricted country accesses that data set.
“We should always know where sensitive data is stored in our environment.”	A day after a sensitive data audit is conducted, an employee copies sensitive data from one data store to another.
“No employees should be able to violate the privacy of any of our customers.”	An employee looks up the records of his/her ex.
“No employees should make material changes to records in this data set.”	An employee deletes records in that data set. Any employee adds several records in that data set. An employee makes material edits to records in that data set.
Data Management Policies
“All field names should adhere to this naming convention:....”	Field names that don’t follow the naming convention are added.
“All data stores/sets should have a Data Owner assigned to them.”	Data stores/sets exist that have no data owner assigned to them.
“Data Owners should always be aware of any/all fields in their respective data sets.”	50 new fields were added to a data set last month, and the Data Owner is unaware of them.
“All data stores/sets should have the following metadata associated with them:...”	Multiple data stores/sets have no metadata associated with them.
“All fields classified as this sensitive data type should have the following tags/metadata associated with them:...”	Inconsistent tagging across fields that have the same sensitive data type.
“Fields that are entirely synthetic data should not be marked as sensitive.”	Fields containing synthetic data are marked as sensitive.
“Whenever a Data Steward changes the classification of a field, the Data Owner should review that new classification.”	Data Owner does not know when a field is reclassified.
“When data is copied to a different location, the metadata associated with the data should be copied with it.”	Data is copied, but metadata isn’t, resulting in multiple (conflicting) sets of metadata for the same field.

Data Governance Needs to be Operationalized with DataGovOps

SalesOps measures and evaluates sales data to determine the effectiveness of a product, sales process, or campaign. Similarly, MarketingOps measures and evaluates marketing data to determine the effectiveness of marketing programs and campaigns.

DevOps is the combination of philosophies, practices, and tools that increases an organization's ability to deliver applications and services at high velocity.

DevSecOps automates the integration of security at every phase of the software development lifecycle, from initial design through integration, testing, deployment, and software delivery.

By analogy, Data Governance Operations -- or DataGovOps -- is the combination of practices and tools that:

Automatically make data more secure, private, accurate, available and usable;
Guide people to take appropriate action and follow established process to better govern data; and
Continually measure and evaluate how internal data standards -- i.e., data policies -- are being adhered to.

DataGovOps is the collaborative data management practice focused on improving the communication, integration and automation of context and policy among all Data Governance stakeholders in an organization, including Security, Compliance, Privacy, and Data Owners. DataGovOps automates the integration of security and compliance at every phase of the data lifecycle. It’s the much-needed engineering counterpart to traditional Data Governance.

The cloud has transformed both the volume of data kept in organizations and the speed at which that data is growing. Given cloud scale and cloud velocity, Data Governance can no longer be a hodge-podge of manual steps, occasional audits, and a series of broken promises. It’s imperative for enterprises to automate and scale their Data Governance functions and invest in systems that continuously ensure that their data is being appropriately inventoried, stored, used, and deleted.

Now is the time to fix the broken promises of data governance. Now is the time for DataGovOps.

Author