Change-Aware Data Validation with Column-Level Lineage

tools like dbt make constructing SQL data pipelines easy and systematic. But even with the added structure and clearly defined data models, pipelines can still develop into complex, which makes debugging issues and validating changes to data models difficult.

The increasing complexity of knowledge transformation logic gives rise to the next issues:

Traditional code review processes only have a look at changes and exclude the information impact of those changes.
Data impact resulting from code changes is tough to trace. In sprawling DAGs with nested dependencies, discovering how and where data impact occurs is amazingly time-consuming, or near inconceivable.

Gitlab’s dbt DAG (shown within the featured image above) is the right example of an information project that’s already a house-of-cards. Imagine attempting to follow a straightforward SQL logic change to a column through this whole lineage DAG. Reviewing an information model update could be a frightening task.

How would you approach this sort of review?

What’s data validation?

Data validation refers back to the process used to find out that the information is correct by way of real-world requirements. This implies ensuring that the SQL logic in an information model behaves as intended by verifying that the information is correct. Validation is normally performed after modifying an information model, corresponding to accommodating latest requirements, or as a part of a refactor.

A singular review challenge

Data has states and is directly affected by the transformation used to generate it. For this reason reviewing data model changes is a novel challenge, because each the code the information must be reviewed.

As a result of this, data model updates ought to be reviewed not just for completeness, but additionally context. In other words, that the information is correct and existing data and metrics weren’t unintentionally altered.

Two extremes of knowledge validation

In most data teams, the person making the change relies on institutional knowledge, intuition, or past experience to evaluate the impact and validate the change.

The validation method normally falls into one among two extremes, neither of which is good:

Spot-checking with queries and a few high-level checks like row count and schema. It’s fast but risks missing actual impact. Critical and silent errors can go unnoticed.
Exhaustive checking of each downstream model. It’s slow and resource intensive, and may be costly because the pipeline grows.

This ends in an information review process that’s unstructured, hard to repeat, and infrequently introduces silent errors. A brand new method is required that helps the engineer to perform precise and targeted data validation.

A greater approach through understanding data model dependencies

To validate a change to a knowledge project, it’s vital to know the connection between models and the way data flows through the project. These dependencies between models inform us how data is passed and transformed from one model to a different.

Analyze the connection between models

As we’ve seen, data project DAGs may be huge, but an information model change only affects a subset of models. By isolating this subset after which analyzing the connection between the models, you may peel back the layers of complexity and focus just on the models that really need validating, given a selected SQL logic change.

The sorts of dependencies in an information project are:

Model-to model

A structural dependency by which columns are chosen from an upstream model.

--- downstream_model
select
  a,
  b
from {{ ref("upstream_model") }}

Column-to-column

A projection dependency that selects, renames, or transforms an upstream column.

--- downstream_model
select
  a,
  b as b2
from {{ ref("upstream_model") }}

Model-to-column

A filter dependency by which a downstream model uses an upstream model in a where, join, or other conditional clause.

-- downstream_model
select
  a
from {{ ref("upstream_model") }}
where b > 0

Understanding the dependencies between models helps us to define the impact radius of an information model logic change.

Discover the impact radius

When making changes to a knowledge model’s SQL, it’s vital to know which other models is likely to be affected (the models it’s essential to check). On the high level, this is completed by model-to-model relationships. This subset of DAG nodes is often known as the impact radius.

Within the DAG below, the impact radius includes nodes B (the modified model) and D (the downstream model). In dbt, these models may be identified using the modified+ selector.

DAG showing modified model B and downstream dependency D. Upstream model A and unrelated model C are usually not impacted (Image by creator)

Identifying modified nodes and downstream is an important start, and by isolating changes like this you may reduce the potential data validation area. Nevertheless, this might still end in a lot of downstream models.

Classifying the of SQL changes can further aid you to prioritize which models actually require validation by understanding the severity of the change, eliminating branches with changes which can be known to be secure.

Classify the SQL change

Not all SQL changes carry the identical level of risk to downstream data, and so ought to be categorized accordingly. By classifying SQL changes this manner, you may add a scientific approach to your data review process.

A SQL change to a knowledge model may be classified as one among the next:

Non-breaking change

Changes that don’t impact the information in downstream models corresponding to adding latest columns, adjustments to SQL formatting, or adding comments etc.

-- Non-breaking change: Latest column added
select
  id,
  category,
  created_at,
  -- latest column
  now() as ingestion_time
from {{ ref('a') }}

Partial-breaking change

Changes that only impact downstream models that reference certain columns corresponding to removing or renaming a column; or modifying a column definition.

-- Partial breaking change: `category` column renamed
select
  id,
  created_at,
  category as event_category
from {{ ref('a') }}

Breaking change

Changes that impact all downstream models corresponding to filtering, sorting, or otherwise changing the structure or meaning of the transformed data.

-- Breaking change: Filtered to exclude data
select
  id,
  category,
  created_at
from {{ ref('a') }}
where category != 'internal'

Apply classification to scale back scope

After applying these classifications the impact radius, and the variety of models that have to be validated, may be significantly reduced.

DAG showing three categories of change: non-breaking, partial-breaking, and breaking (Image by creator)

Within the above DAG, nodes B, C and F have been modified, leading to potentially 7 nodes that have to be validated (C to E). Nevertheless, not each branch comprises SQL changes that truly require validation. Let’s take a have a look at each branch:

Node C: Non-breaking change

C is classed as a non-breaking change. Due to this fact each C and H don’t have to be checked, they may be eliminated.

Node B: Partial-breaking change

B is classed as a partial-breaking change resulting from change to the column B.C1. Due to this fact, D and E have to be checked in the event that they reference column B.C1.

Node F: Breaking change

The modification to model F is classed as a breaking-change. Due to this fact, all downstream nodes (G and E) have to be checked for impact. As an illustration, model g might aggregate data from the modified upstream column

The initial 7 nodes have already been reduced to five that have to be checked for data impact (B, D, E, F, G). Now, by inspecting the SQL changes on the column level, we are able to reduce that number even further.

Narrowing the scope further with column-level lineage

Breaking and non-breaking changes are easy to categorise but, in the case of inspecting partial-breaking changes, the models have to be analyzed on the column level.

Let’s take a more in-depth have a look at the partial-breaking change in model B, by which the logic of column c1 has been modified. This modification could potentially end in 4 impacted downstream nodes: D, E, K, and J. After tracking column usage downstream, this subset may be further reduced.

DAG showing the column-level lineage used to trace the downstream impact of a change to column B.c1 — (Image by creator)

Following column B.c1 downstream we are able to see that:

B.c1 → D.c1 is a column-to-column (projection) dependency.
D.c1 → E is a model-to-column dependency.
D → K is a model-to-model dependency. Nevertheless, as D.c1 is just not utilized in K, this model may be eliminated.

Due to this fact, the models that have to be validated on this branch are B, D, and E. Along with the breaking change F and downstream G, the whole models to be validated on this diagram are F, G, B, D, and E, or simply 5 out of a complete of 9 potentially impacted models.

Conclusion

Data validation after a model change is difficult, especially in large and sophisticated DAGs. It’s easy to miss silent errors and performing validation becomes a frightening task, with data models often feeling like black boxes in the case of downstream impact.

A structured and repeatable process

Through the use of this change-aware data validation technique, you may bring structure and precision to the review process, making it systematic and repeatable. This reduces the variety of models that have to be checked, simplifies the review process, and lowers costs by only validating models that truly require it.

Before you go…

Dave is a senior technical advocate at Recce, where we’re constructing a toolkit to enable advanced data validation workflows. He’s at all times completely satisfied to speak about SQL, data engineering, or helping teams navigate their data validation challenges. Connect with Dave on LinkedIn.

Research for this text was made possible by my colleague Chen En Lu (Popcorny).

Change-Aware Data Validation with Column-Level Lineage

What’s data validation?

A singular review challenge

Two extremes of knowledge validation

A greater approach through understanding data model dependencies

Analyze the connection between models

Discover the impact radius

Classify the SQL change

Non-breaking change

Partial-breaking change

Breaking change

Apply classification to scale back scope

Node C: Non-breaking change

Node B: Partial-breaking change

Node F: Breaking change

Narrowing the scope further with column-level lineage

Conclusion

A structured and repeatable process

Before you go…

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Is Your Training Data Representative? A Guide to Checking with PSI in Python

DoE selects MIT to ascertain a Center for the Exascale Simulation of Coupled High-Enthalpy Fluid–Solid Interactions

Alterego debuts “near-telepathic” AI wearable

Alterego’s Near-Telepathic AI Wearable

Help! My therapist is secretly using ChatGPT

Change-Aware Data Validation with Column-Level Lineage

What’s data validation?

A singular review challenge

Two extremes of knowledge validation

A greater approach through understanding data model dependencies

Analyze the connection between models

Discover the impact radius

Classify the SQL change

Non-breaking change

Partial-breaking change

Breaking change

Apply classification to scale back scope

Node C: Non-breaking change

Node B: Partial-breaking change

Node F: Breaking change

Narrowing the scope further with column-level lineage

Conclusion

A structured and repeatable process

Before you go…

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.