Many data teams deal with “alert fatigue”–an overload of data test failures that make it difficult to separate signal from noise. Not only can this overwhelm a data team’s resources and mask potentially critical data quality issues, but it can also slow down your overall pipeline performance and run up compute costs in your warehouse.
This article provides an overview of a basic data quality program for a modern data stack, identifies some common issues that data teams run into when trying to employ data quality initiatives (especially with dbt), and recommends some solutions to overcome these challenges.
Overview of basic data quality testing
Before we get into common problems and some possible solutions, it is helpful to define some of the basic testing capabilities of dbt, packages, and other open-source solutions that pair well with dbt.
Assertion tests
dbt assertion tests run SQL to assert that a dataset meets specific conditions (e.g. a column is not_null, unique, or another specific condition written as custom SQL).
Singular tests: custom-written SQL query to test for a specific condition. The query is written such that, if any rows are returned, the test fails.
Generic tests: out-of-the-box tests available from dbt, such as not_null, unique, relationships, and accepted_values. You can also write custom generic tests yourself using Jinja. A generic test is the same as a singular test, except that you pass variables instead of static values to make the test reusable with different input variables.
Unit Testing
Similar to software engineering, unit testing tests a specific piece of logic in a dbt project (typically, confirming that a model’s output matches expected values with a given input). For example, for a monthly recurring revenue model, with given transactions as inputs, you would return the expected MRR value. Currently, dbt does not support unit testing natively, however, this is being worked on and is expected in a future release (1.8). In the meantime, there are a few packages that support this currently, or you can build it yourself. Unit tests are a useful part of CI checks, to ensure that changes to existing logic are not breaking the expected output of a transformation.
Data Diffing
Data-diffing is something that has been developed by Datafold, and aims to catch problems in your development work before they make them into production. It follows the assumption that the output of your data transformations should remain relatively consistent (or at least be part of your QA process) and is akin to diffing code in Git. You can compare the output of changes you have made to a mode in development with the output of that model in production, and get an understanding of changed rows and columns in your model. If you have ever used EXCEPT or MINUS to compare the changes of your data output between production and development environments, then you should really check out Datafold (they have an open-source tool that allows developers to data-diff locally).
Additional testing packages
To get more advanced than the basics mentioned above, additional packages and open-source software can help.
dbt_utils
dbt_utils has some additional generic tests that extend beyond the core four (not_null, unique, relationships, accepted_values). Some of my favourites are: not_null_proportion, and not_accepted_values
dbt_expectations
Based on Great Expectations, dbt_expectations is a package that contains a long list of assertion tests to help you verify the overall quality and state of your data sources: https://hub.getdbt.com/calogica/dbt_expectations/latest
Datafold
As mentioned above, Datafold enables developers to test the impact of their code changes on the output of models, while still in development. This can help reduce reactive data quality issues by catching them before they make it to production and improve overall trust in your data. Datafold Cloud allows you to data-diff with your team and add the output of data-diffing to your PRs (which can help significantly speed up the PR process).
Common problems with scaling data quality initiatives
Given all the information above, managing a data quality program can be a large task and one that changes as your data infrastructure matures. Next, we will go through some common challenges that data teams run into over time.
Balancing coverage and redundancy
When starting out with dbt tests, it is important to develop testing conventions about how and when you apply certain types of tests. As an example, here are testing conventions that were being used with a previous team I worked with:
Sources
- Add generic tests for every source column that is testable.
- The primary key source column must have not_null and unique generic tests.
- All boolean columns must have an accepted_values schema test. The accepted values are true and false.
- Columns that contain category values must have an accepted_values schema test.
- Columns that should never be null must have a not_null schema test.
- Columns that should be unique must have a unique schema test.
Models
- The primary key column must have not_null and unique schema tests.
- All boolean columns must have an accepted_values schema test. The accepted values are true and false.
- Columns that contain category values must have an accepted_values schema test.
- Columns that should never be null must have a not_null schema test.
- Columns that should be unique must have a unique schema test.
- Where possible, use schema tests from the dbt_utils or dbt_expectations packages to perform extra verification.
While these conventions may look reasonable at first glance, they start to become a problem once a dbt project begins to scale. Testing the same concepts in the source and again in models at the staging, intermediate, and core layers may result in as many as 4 test failures for an issue that can be caught with one test at the source. Developing more advanced testing conventions (e.g. only testing the same concept downstream again if there is a JOIN in the model) will help reduce overly redundant tests and reduce testing costs.
Over-reliance on assertion tests
When many data teams start out, they may overly rely on assertion tests to test all types of concepts and data quality issues, rather than seeking out other methods (such as unit testing or data-diffing) to appropriately test for potential issues and edge cases. Using a breadth of testing strategies that align to the appropriate purpose is critical.
Lack of ownership and prioritization
Similar to the broken window theory, if you have many failing tests, and no organized way of prioritizing or managing test failures, then test failures become unimportant and critical data quality issues can go undetected or unresolved. This can be overcome by:
- Assigning ownership of tests to specific individuals (e.g. by department area of focus, by model owner, or other methods) using dbt meta tags.
- Setting priority levels to different types of tests, using dbt meta tags. For example, a unique test failure on a primary key of a critical financial model could suggest fan out and inaccurately reported revenue (so that probably deserves a P0 tag).
- Centralize and triage alerts using tools like Elementary (or build custom alerting with a Slack webhook), you can send test failures to a centralized Slack channel.
- Leverage a “fire team” rotation, where someone from your team is always on-call to deal with issues that arise quickly and triage them.
Flaky tests
Flaky tests are those that fail but are generally not a critical issue or resolve themselves on a subsequent run (e.g. not_null failures caused by latency or build order issues). These issues can sometimes be hard to solve, but monitoring test results over time can help you identify flaky tests that are frequently failing and either look for better ways to test that concept or alter model materializations to avoid these issues.
Test bloat and cost
Assertion tests can be slow and expensive to run, especially on very large datasets, so if you’re overusing these in your project, it can cause project build time and associated costs to balloon. This can slow down team development time, CI checks, and cause other snowballing issues. To combat this, look for ways to reduce redundant tests (remove downstream tests for tests that fail together frequently, test at the source and avoid retesting downstream unless critically necessary). In some cases, testing on very large datasets can also be run incrementally using a WHERE config (so only tests more recent data for example).
Building a comprehensive data quality program
Measure and optimize your efforts
Building a comprehensive data quality program can be a challenging task, but it is critically important to drive trust and accuracy in your data. Ensuring that you build measuring and monitoring of overall health, and trends over time can be really useful to optimizing your program (including things like # of tests, test failure rate, cost of testing, and more). Avoid the challenges of over-alerting by ensuring that redundant tests and removed, overly noisy or flaky tests are optimized, and continuously aim to separate signal from noise around test failures. This will also help retain testing costs and avoid slowing down your dbt project with test bloat.
Lean into proactive versus reactive data quality efforts
The majority of data testing focuses on dealing with issues after they have happened (reactive tests). However, catching data quality issues before they even make it to production generates many benefits, such as driving higher trust with your stakeholders, reducing the risk associated with inaccurate reporting, and reducing resources associated with resolving reactive data quality issues.
Organize, prioritize, and drive accountability
Pushing test failure alerts to a centralized Slack channel for team visibility (Elementary can help you set this up), leveraging automation to open test failure issues in your project management tool, and leveraging a “fire team” or on-call support rotation to take accountability for triaging incoming issues will all help ensure that data quality issues are managed in a timely manner. Assigning priority and ownership of specific tests will help ensure your triage workflow is easily actioned and high-priority issues are dealt with first.
Additional Resources
Get Started with Secoda
Secoda is the only all in one data management platform to help data teams search, catalog, and monitor their data at scale. Kind of like a Google for data.
We work with companies like Vanta, Clover, Remitly, and Cardinal health to get visibility into the health of their entire stack, reduce costs, and help their data teams run more efficiently.
Interested in learning more? Create a free account here