The freshness of our data gives us an edge in making informed decisions. And since every second counts in the world of data, it's crucial to keep our data up-to-date. In this tutorial, we'll be learning how to use the dbt source freshness command to ensure our source tables are updated within a specified time frame.
What Is dbt Source Freshness?
dbt source freshness is a command in the data build tool (dbt) that checks if source tables have been updated within a specified time frame. It's a subcommand of the dbt source command, which provides subcommands for working with source data. The command compares the timestamp column in your source to a time cadence you set, such as a minute, hour, or day. If the data is older than the time the test is run minus the cadence, the test will either fail or warn you.
How To Use dbt Source Freshness
To use the dbt source freshness command effectively, follow these steps:
1. Define Your Sources
First, you need to define your sources in a YAML file. This file should include the source tables you want to check for freshness. For example:
sources:
- name: my_source
tables:
- name: my_table
loaded_at_field: _etl_loaded_at
In this example, we've defined a source named "my_source" with a table named "my_table" and specified the loaded_at_field as "_etl_loaded_at".
2. Set Freshness Criteria
Next, you need to set the freshness criteria for your source tables. This is done by defining a "freshness" block in your sources YAML file. For example:
sources:
- name: my_source
tables:
- name: my_table
loaded_at_field: _etl_loaded_at
freshness:
warn_after: {count: 3, period: day}
error_after: {count: 5, period: day}
In this example, we've set a warning threshold of 3 days and an error threshold of 5 days. If the data is older than 3 days, a warning will be issued, and if it's older than 5 days, an error will be raised.
3. Run dbt Source Freshness
Once you've defined your sources and set the freshness criteria, you can run the dbt source freshness command:
dbt source freshness
This command will compare the timestamp column in your source to the time cadence you set and either pass, warn, or fail based on the freshness criteria.
Common Challenges and Solutions
While using dbt source freshness, you may encounter some common challenges:
- Incorrectly defined sources or freshness criteria: Ensure your sources and freshness criteria are defined correctly in the YAML file.
- Timestamp issues: Make sure your timestamp column is in the correct format and timezone.
- Connectivity issues: Verify that your dbt project can connect to the source data system.
What are Best Practices for dbt Source Freshness?
To make the most of dbt source freshness, follow these best practices:
- Run freshness tests with at least double the frequency of your lowest SLA.
- Use the dbt_utils.recency test to verify that data in downstream tables is being updated.
- Monitor and address warnings and errors promptly to ensure data quality.