The “Who Does What” Guide To Enterprise Data Quality

-

One answer and plenty of best practices for a way larger organizations can operationalizing data quality programs for contemporary data platforms

A solution to “who does what” for enterprise data quality. Image courtesy of the creator.

I’ve spoken with dozens of enterprise data professionals on the world’s largest corporations, and one of the crucial common data quality questions is, “who does what?” That is quickly followed by, “why and the way?”

There’s a reason for this. Data quality is sort of a relay race. The success of every leg — detection, triage, resolution, and measurement — relies on the opposite. Each time the baton is passed, the possibilities of failure skyrocket.

Photo by Zach Lucero on Unsplash

Practical questions deserve practical answers.

Nevertheless, every organization is organized around data barely in another way. I’ve seen organizations with 15,000 employees centralize ownership of all critical data while organizations half their size resolve to completely federate data ownership across business domains.

For the needs of this text, I’ll be referencing probably the most common enterprise architecture which is a hybrid of the 2. That is the aspiration for many data teams, and it also features many cross-team responsibilities that make it particularly complex and value discussing.

Just be mindful what follows is AN answer, not THE answer.

In This Article:

Whether pursuing a data mesh strategy or something else entirely, a standard realization for contemporary data teams is the necessity to align around and put money into their Most worthy data products.

This can be a designation given to a dataset, application, or service with an output particularly worthwhile to the business. This may very well be a revenue generating machine learning application or a collection of insights derived from well curated data.

As scale and class grows, data teams will further differentiate between foundational and derived data products. A foundational data product is often owned by a central data platform team (or sometimes a source aligned data engineering team). They’re designed to serve tons of of use cases across many teams or business domains.

Derived data products are built atop of those foundational data products. They’re owned by domain aligned data teams and designed for a particular use case.

For instance, a “Single View of Customer” is a standard foundational data product which may feed derived data products corresponding to a product up-sell model, churn forecasting, and an enterprise dashboard.

The excellence between foundational and derived data products is critical for larger organizations. Image courtesy of the creator.

There are different processes for detecting, triaging, resolving, and measuring data quality incidents across these two data product types. Bridging the chasm between them is important. Here’s one popular way I’ve seen data teams do it.

Foundational Data Products

Prior to becoming discoverable, there ought to be a delegated data platform engineering owner for every foundational data product. That is the team chargeable for applying monitoring for freshness, volume, schema, and baseline quality end-to-end across your entire pipeline. A very good rule of thumb most teams follow is, “you built it, you own it.”

By baseline quality, I’m referring very specifically to requirements that might be broadly generalized across many datasets and domains. They are sometimes defined by a central governance team for critical data elements and customarily conform to the 6 dimensions of information quality. Requirements like “id columns should all the time be unique,” or “this field is all the time formatted as valid US state code.”

In other words, foundational data product owners cannot simply ensure the info arrives on time. They should make sure the source data is complete and valid; data is consistent across sources and subsequent loads; and significant fields are free from error. Machine learning anomaly detection models might be particularly effective on this regard.

More precise and customised data quality requirements are typically use case dependent, and higher applied by derived data product owners and analysts downstream.

Derived Data Products

Data quality monitoring also must occur on the derived data product level as bad data can infiltrate at any point in the info lifecycle.

Even when the info quality is nice on the foundational data product level, that doesn’t mean it wont go bad on the derived data product level. Image courtesy of the creator.

Nevertheless, at this level there’s more surface area to cover. “Monitoring all tables for each possibility” isn’t a practical option.

There are various aspects for when a set of tables should turn out to be a derived data product, but they will all be boiled all the way down to a judgment of sustained value. This is usually best executed by domain based data stewards who’re near the business and empowered to follow general guidelines around frequency and criticality of usage.

For instance, considered one of my colleagues in his previous role as the top of information platform at a national media company, had an analyst develop a Master Content dashboard that quickly became popular across the newsroom. Once it became ingrained within the workflow of enough users, they realized this ad-hoc dashboard needed to turn out to be productized.

When a derived data product is created or identified, it must have a website aligned owner chargeable for end-to-end monitoring and baseline data quality. For a lot of organizations that might be domain data stewards as they’re most accustomed to global and native policies. Other ownership models include designating the embedded data engineer that built the derived data product pipeline or the analyst that owns the last mile table.

The opposite key difference within the detection workflow on the derived data product level are business rules.

There are some data quality rules that may’t be automated or generated from central standards. They’ll only come from the business. Rules like, “the discount_percentage field can never be greater than 10 when the account_type equals business and customer_region equals EMEA.”

These rules are best applied by analysts, specifically the table owner, based on their experience and feedback from the business. There isn’t any need for each rule to trigger the creation of a knowledge product, it’s too heavy and burdensome. This process ought to be completely decentralized, self-serve, and light-weight.

Foundational Data Products

In some ways, ensuring data quality for foundational data products is less complex than for derived data products. There are fewer foundational products by definition, they usually are typically owned by technical teams.

This implies the info product owner, or an on-call data engineer throughout the platform team, might be chargeable for common triage tasks corresponding to responding to alerts, determining a probable point of origin, assessing severity, and communicating with consumers.

Every foundational data product must have no less than one dedicated alert channel in Slack or Teams.

There are various ways you possibly can organize your data quality notification strategy, but a best practice is to make sure every foundational data product has its own dedicated channel. Image courtesy of the creator.

This avoids the alert fatigue and might function a central communication channel for all derived data product owners with dependencies. To the extent they’d like, they will stay abreast of issues and be proactively informed of any upcoming schema or other changes which will impact their operations.

Derived Data Products

Typically, there are too many derived data products for data engineers to properly triage given their bandwidth.

Making each derived data product owner chargeable for triaging alerts is a commonly deployed strategy (see image below), but it might also break down because the variety of dependencies grow.

An information triage process for derived data product owners. Image courtesy of the creator. Source.

A failed orchestration job, for instance, can cascade downstream creating dozens alerts across multiple data product owners. The overlapping fire drills are a nightmare.

One increasingly adopted best practice is for a dedicated triage team (often labeled as dataops) to support all products inside a given domain.

This could be a Goldilocks zone that reaps the efficiencies of specialization, without becoming so impossibly large that they turn out to be a bottleneck devoid of context. These teams must be coached and empowered to work across domains, otherwise you will simply reintroduce the silos and overlapping fire drills.

On this model the info product owner has accountability, but not responsibility.

Wakefield Research surveyed greater than 200 data professionals, and the typical incidents monthly was 60 and the median time to resolve each incident once detected was 15 hours. It’s easy to see how data engineers get buried in backlog.

There are various contributing aspects for this, but the largest is that we’ve separated the anomaly from the basis cause each technologically and procedurally. Data engineers take care of their pipelines and analysts take care of their metrics. Data engineers set their Airflow alerts and analysts write their SQL rules.

But pipelines–the info sources, the systems that move the info, and the code that transforms it–are the basis cause for why metric anomalies occur.

To cut back the typical time to resolution, these technical troubleshooters need a knowledge observability platform or some type of central control plane that connects the anomaly to the basis cause. For instance, an answer that surfaces how a distribution anomaly within the discount_amount field is said to an upstream query change that occurred at the identical time.

Foundational Data Products

Speaking of proactive communications, measuring and surfacing the health of foundational data products is important to their adoption and success. If the consuming domains downstream don’t trust the standard of the info or the reliability of its delivery, they’ll go straight to the source. Every. Single. Time.

This after all defeats your entire purpose of foundational data products. Economies of scale, standard onboarding governance controls, clear visibility into provenance and usage are actually all out of the window.

It could be difficult to supply a general standard of information quality that’s applicable to a various set of use cases. Nevertheless, what data teams downstream really need to know is:

  • How often is the info refreshed?
  • How well maintained is it? How quickly are incidents resolved?
  • Will there be frequent schema changes that break my pipelines?

Data governance teams might help here by uncovering these common requirements and critical data elements to assist set and surface smart SLAs in a marketplace or catalog (more specifics than you possibly can ever want on implementation here).

Image courtesy of the creator.

That is the approach of the Roche data team that has created one of the crucial successful enterprise data meshes on the earth, which they estimate has generated about 200 data products and an estimated $50 million of value.

Derived Data Products

For derived data products, explicit SLAs across ought to be set based on the defined use case. As an example, a financial report may should be highly accurate with some margin for timeliness whereas a machine learning model stands out as the exact opposite.

Table level health scores might be helpful, however the common mistake is to assume that on a shared table the business rules placed by one analyst might be relevant to a different. A table appears to be of low quality, but upon closer inspection a number of outdated rules have repeatedly failed day after day with none motion going down to either resolve the difficulty or the rule’s threshold.

We covered a variety of ground. This text was more marathon than relay race.

The above workflows are a strategy to achieve success with data quality and data observability programs but they aren’t the one way. Should you prioritize clear processes for:

  • Data product creation and ownership;
  • Applying end-to-end coverage across those data products;
  • Self-serve business rules for downstream assets;
  • Responding to and investigating alerts;
  • Accelerating root cause evaluation; and
  • Constructing trust by communicating data health and operational response

…you will see your team crossing the info quality finish line.

Follow me on Medium for more stories on data engineering, data quality, and related topics.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x