A Complete Guide to Effectively Scale your Data Pipelines and Data Products with Contract Testing and dbt

Artificial Intelligence

A Complete Guide to Effectively Scale your Data Pipelines and Data Products with Contract Testing and dbt

admin

October 25, 2023

A Complete Guide to Effectively Scale your Data Pipelines and Data Products with Contract Testing and dbt

First, we want so as to add two recent dbt packages, dbt-expectations and dbt-utils, that can allow us to make assertions on the schema of our sources and the accepted values.

# packages.ymlpackages:
- package: dbt-labs/dbt_utils
version: 1.1.1
- package: calogica/dbt_expectations
version: 0.8.5

Testing the information sources

Let’s start by defining a contract test for our first source. We pull data from raw_height, a table that accommodates height information from the users of the gym app.

We agree with our data producers that we are going to receive the peak measurement, the units for the measurements, and the user ID. We agree on the information types and that only ‘cm’ and ‘inches’ are supported as units. With all this, we are able to define our first contract within the dbt source YAML file.

The constructing blocks

the previous test, we are able to see several of the dbt-unit-testing macros in use:

dbt_expectations.expect_column_values_to_be_of_type: This assertion allows us to define the expected column data type.
accepted_values: This assertion allows us to define an inventory of the accepeted values for a selected column.
dbt_utils.accepted_range: This assertion allows us to define a numerical range for a given column. In the instance, we expected the column’s value to not be lower than 0.
not null: Finally, built-in assertions like ‘not null’ allow us to define column constraints.

Using these constructing blocks, we added several tests to define the contract expectations described above. Notice also how we have now tagged the tests as “contract-test-source”. This tag allows us to run all contract tests in isolation, each locally, and as we’ll see later, within the CI/CD pipeline:

dbt test --select tag:contract-test-source

We now have seen how quickly we are able to create contract tests for the sources of our dbt app, but what concerning the public interfaces of our data pipeline or data product?

As data producers, we would like to ensure we’re producing data in keeping with the expectations of our data consumers so we are able to satisfy the contract we have now with them and make our data pipeline or data product trustworthy and reliable.

An easy solution to be sure that we’re meeting our obligations to our data consumers is so as to add contract testing for our public interfaces.

Dbt recently released a recent feature for SQL models, model contracts, that permits to define the contract for a dbt model. While constructing your model, dbt will confirm that your model’s transformation will produce a dataset matching up with its contract, or it is going to fail to construct.

Let’s see it in motion. Our mart, body_mass_indexes, produces a BMI metric from the load and height measure data we get from our sources. The contract with our provider establishes the next:

Data types for every column.
User IDs can’t be null
User IDs are all the time greater than 0

Let’s define the contract of the body_mass_indexes model using dbt model contracts:

The constructing blocks

the previous model specification file, we are able to see several metadata that allow us to define the contract.

contract.enforced: This configuration tells dbt that we would like to implement the contract each time the model is run.
data_type: This assertion allows us to define the column type we expect to supply once the model runs.
constraints: Finally, the constraints block gives us the possibility to define useful constraints like that a column can’t be null, set primary keys, and custom expressions. In the instance above we defined a constraint to inform dbt that the user_id should always be greater than 0. You’ll be able to see all of the available constraints here.

A difference between the contract tests we defined for our sources and those defined for our marts or output ports is when the contracts are verified an enforced.

Model contracts are enforced when the model is being generated by dbt run, whereas contracts based on the dbt tests are enforced when the dbt tests run.

If one in all the model contracts just isn’t satisfied, you will note an error whenever you execute ‘dbt run’ with specific details on the failure. You’ll be able to see an example in the next dbt run console output.

1 of 4 START sql table model dbt_testing_example.stg_gym_app__height ........... [RUN]
2 of 4 START sql table model dbt_testing_example.stg_gym_app__weight ........... [RUN]
2 of 4 OK created sql table model dbt_testing_example.stg_gym_app__weight ...... [SELECT 4 in 0.88s]
1 of 4 OK created sql table model dbt_testing_example.stg_gym_app__height ...... [SELECT 4 in 0.92s]
3 of 4 START sql table model dbt_testing_example.int_weight_measurements_with_latest_height  [RUN]
3 of 4 OK created sql table model dbt_testing_example.int_weight_measurements_with_latest_height  [SELECT 4 in 0.96s]
4 of 4 START sql table model dbt_testing_example.body_mass_indexes ............. [RUN]
4 of 4 ERROR creating sql table model dbt_testing_example.body_mass_indexes .... [ERROR in 0.77s]Finished running 4 table models in 0 hours 0 minutes and 6.28 seconds (6.28s).
Accomplished with 1 error and 0 warnings:
Database Error in model body_mass_indexes (models/marts/body_mass_indexes.sql)
recent row for relation "body_mass_indexes__dbt_tmp" violates check constraint 
"body_mass_indexes__dbt_tmp_user_id_check1"
DETAIL:  Failing row accommodates (1, 2009-07-01, 82.5, null, null).
compiled Code at goal/run/dbt_testing_example/models/marts/body_mass_indexes.sql

Until now we have now a test suite of powerful contract tests, but how and when can we run them?

We will run contract tests in two varieties of pipelines.

CI/CD pipelines
Data pipelines

For instance, you’ll be able to execute the source contract tests on a schedule in a CI/CD pipeline targeting the information sources available in lower environments like test or staging. You’ll be able to set the pipeline to fail each time the contract just isn’t met.

These failures provides precious details about contract-breaking changes introduced by other teams before these changes reach production.

Testing the information sources

The constructing blocks

The constructing blocks

LEAVE A REPLY Cancel reply