Learn cost-effective data transformation techniques for optimizing storage, query performance, and modeling strategies. Discover trade-offs like data freshness and quality while exploring tools like dbt and Secoda to streamline processes and reduce cloud data warehouse costs.
You've probably used an interactive dashboard in your organization before—one click to drill down on a product category, two clicks to change the time range, etc. What many don't realize is the amount of work that goes into getting the data out of the source system and serving it in such a way that it can be used in a dashboard. Often, billions of rows from a table are transformed daily, involving filtering, cleaning, aggregating, and joining them with other rows.
When transforming data, as is the case with many things in life, there are multiple ways to achieve the same result. However, while the result may be the same, the speed and storage requirements are not. In a world with cloud computing, each row and column stored or processed has a cost. Additionally, cost-effective data transformation leads to faster query time.
To prevent unexpected costs, proper data transformation practices are essential. In this article, you'll learn about a few practices to help you cost-effectively transform data.
Why organizations transform data
For analytical purposes, organizations don't rely on transactional/OLTP databases, which are good at inserting and updating records. Instead, they use online analytical processing (OLAP) systems that store data in columns, which is excellent for aggregating and grouping data. They can also parallelize the processing of billions of records from reading operations and have their processing power and storage capabilities decoupled, allowing for near-infinite scaling.
Most organizations have their OLAP stack in the cloud, and these usually have very flexible pricing mechanisms. For example, BigQuery charges for the number of bytes scanned in a query, Snowflake uses a credit-based system, and Databricks has the concept of Databricks Units. Reducing the required processing power for a data transformation is the single most important parameter to cost-effectively scaling a data warehouse.
Besides reducing the required processing power, there are also a variety of other techniques to transform data cost-effectively, but they come with some trade-offs. For example, you could trade off data freshness (*ie* how timely the data is available). While this reduces the number of times data is processed, it could create unwanted delays in business-critical data applications. Trading off data quality is also an option, as an analytics engineer could reduce the number of tests they run on a table with each transformation.
Another solution would be to prioritize business-critical data applications and run data models for less-critical applications on demand. However, this creates unwanted overhead for managing priorities, and you're giving up timeliness—when an end user wants to consult the application, they'll have to wait for the data pipeline to finish, which can take minutes, if not hours.
Techniques to cost-effectively transform data
Various techniques and architectural choices can impact how cost-effectively you transform your data. You can classify these techniques into two broad categories: those that come with trade-offs, and those that don't.
Optimization without trade-offs
To optimize data transformation without sacrificing data freshness or quality, you need to store the data efficiently and write efficient SQL queries. While most modern data warehouses preprocess your query to produce the most efficient execution plan, there are still measures that you can take to optimize your queries.
Optimize how you store data
The cost-effectiveness of data transformations is largely determined by how data is stored.
Consider other storage formats
As described earlier, most data warehouse technologies have separate computing and storage capabilities. This decoupling opens up opportunities. For example, Google BigQuery has OneLake, which allows users to query CSV, JSON, Parquet, and Avro files from within Google BigQuery. While there are valid reasons to do this (such as preventing lock-in, having an easy data landing zone, and so on), it's not the most cost-effective option because the files are not optimized for big data technologies. If you want to reduce data transformation costs, using your data warehouse's proprietary storage technology or working with an open table format are options worth exploring.
Partition your tables
Another popular technique to optimize data storage is partitioning. For files or open table formats, this is usually achieved via Hive-style partitioning. However, most data warehouse vendors also offer some kind of partitioning on their internal storage technologies. Partitioning saves costs by eliminating the need to query entire tables during data transformation. By specifying a date range, only a fraction of the data is scanned and processed.
Choose the right data types
There are also some benefits to choosing appropriate data types. For example, there's no need to store Boolean values (true/false) as text-related data types or integers as Booleans. Doing so takes up more storage and requires more time and processing power to process.
Denormalize your data
Although there are good reasons for using data model frameworks like data vault, they aren't very cost-effective in the era of cloud data warehouses. They require a lot of JOINs, which forces data processing engines to scan a lot of data. Instead, if you model your data with simpler frameworks (like the star framework, you only need a few JOINs and will most likely end up with much cheaper data transformation queries.
Don't use SELECT *
Most data warehouse technologies are column-oriented, which means that they store the data of a single column together, not of the rows. When you use `SELECT *` in your data transformation queries, the processing engine scans and processes a _lot_ of unnecessary columns. Instead, try to specify what columns you actually need for data transformation.
Optimization with trade-offs
The optimization techniques discussed above don't involve any trade-offs, but there are also various techniques for improving the cost-effectiveness of your data transformations that do involve a trade-off. Data could be less fresh, be of lower quality, or take a long time to load. In the following sections, you'll learn about techniques like materializations, model frequency, model priority, and testing, which all involve a trade-off.
Adjust data materialization
One of the most popular data modeling tools is [data build tool](https://www.getdbt.com/) (dbt). It quickly rose to prominence in 2020 when it was still an open source command line tool. Data engineers discovered the intuitiveness and familiarity of dbt's building blocks: SQL and Jinja. Fast-forward a couple of years, and dbt has done some acquisitions, offers a SaaS plan, and is the centerpiece of the data stacks of some of the largest organizations in the world.
dbt offers five ways of materializing data: views, full table loads, ephemeral, incremental loads, and materialized views. Let's briefly discuss what they are and list their advantages and disadvantages when it comes to cost-effectiveness, data freshness, and access speed.
Views
Views do not store any new data; they simply hold the query that produces the resulting table. When they're accessed, they run their query and output the requested data.
Advantages:
+ Data freshness: They always produce the latest data.
Disadvantages:
- Access speed: They could take a long time to produce the requested data.
- Cost: They require a lot of computing power because every time they are executed, they recalculate the query.
You should use views for simple operations like renaming columns, adding constants, and simple aggregations. Don't use them for complex queries, such as intricate regular expression filters, many joins, fuzzy joins, and so on.
Full table loads
dbt interprets a table as a full reload of the data. Every time the table is accessed, it only returns the data that it holds but does not recalculate the most recent data.
Advantages:
+ Access speed: They are extremely fast when accessed.
+ Cost: They only require computing power for their initial calculation. Accessing the resulting table is the most simple operation on a table imaginable.
Disadvantages:
- Data freshness: They do not hold the latest data.
You should use tables instead of views for materializing complex queries on small to medium-sized data. Otherwise, downstream dependencies could take a very long time to load. A second use case is producing tables that need to be consumed by BI tools, which require instant loading of the data.
Ephemeral
Ephemeral materializations do not result in anything tangible. They do not even show up in the data warehouse that dbt is connected to. Instead, they are added as common table expressions (CTEs) to any query downstream model that uses them.
Advantages:
+ Data freshness: Like views, they always produce the latest data when accessed by downstream models.
Disadvantages:
- Access speed: When accessed by downstream models, they take just as much time as a view to calculate.
- Cost-effectiveness: Like views, they require a lot of computing power because every time they are executed, they recalculate the query.
Use ephemeral materializations in upstream models and only for simple calculations.
Please note: You can't use them in every feature of dbt. For example, macros do not support ephemeral materializations.
Incremental loads
An incremental table materialization comes in two flavors: append and upsert (discussed below). Both flavors do not overwrite the whole table but simply update the existing table. This materialization has the same (dis)advantages as tables, but they're better suited for very large data sets.
Advantages:
+ Access speed: They are extremely fast when accessed.
+ Cost: They only require computing power for their initial calculation. Accessing the resulting table is the most simple operation on a table imaginable.
Disadvantages:
- Data freshness: They do not hold the latest data.
Incremental loads should be used in the same situation as tables, but they can also be used for very large tables that are consumed frequently.
Append
An incremental table materialization that uses the append strategy simply adds new records to the bottom of a table. This often produces duplicate records, but that should not be an issue if you're aware of it. By interpreting your data as snapshots and partitioning it properly, appending is a valid strategy. However, it does come with an extra storage cost.
Upsert
An incremental table materialization that uses the upsert strategy not only adds new records but also updates existing records by comparing key columns. This does not produce duplicate records and does not require snapshot partitioning. However, because the old data and the incoming data need to be compared, it does come with an extra computing cost.
Materialized views
Materialized views are like incremental tables, but instead of having your incremental strategy defined within dbt, you hand over the strategy to the data warehouse. While you could replace all your incremental tables with materialized views, you don't have any control over the settings that define the incremental loading of data. What's more, all data warehouse technologies interpret the concept of materialized views slightly differently, which is why this section doesn't give an overview of (dis)advantages. For example, [this article](https://medium.com/@borislitvak/snowflake-vs-bigquery-in-depth-49b48742dc66) describes the differences between Snowflake and BigQuery.
Materialized views are an excellent choice for serving data that requires a high freshness. Typically, you'll have real-time dashboards polling for changes. If you serve those as views, you would trigger numerous executions of the underlying query. Tables, on the other hand, require processing a lot of data because every trigger would execute a full reload. Materialized views add new data with a predefined frequency.
#### Run Models Less
If you use dbt Cloud, you can deploy your models as scheduled jobs. If you use dbt Core, you're probably using some kind of orchestration tool (like [Airflow](https://airflow.apache.org/), [Prefect](https://www.prefect.io/), or [Dagster](https://dagster.io/)). Cost-effectively transforming data could be as simple as lowering the frequency with which you run your models. If your source data arrives with a twenty-four-hour frequency, there's no need to schedule your model to calculate every hour. The reverse is also true. If the end users consult their data products once in the afternoon, there's no need to update your model more often, even if the source data arrives at a high frequency.
* In dbt Cloud, you can simply lower the job frequency in the [job scheduler UI](https://docs.getdbt.com/docs/deploy/job-scheduler).
* In Airflow, you'd adjust the [`schedule_interval` argument](https://airflow.apache.org/docs/apache-airflow/1.10.1/scheduler.html) of your job.
* In Prefect, you would adjust the [`interval` argument](https://docs.prefect.io/3.0/automate/add-schedules#create-schedules-in-python) of the `serve` or `deploy` method.
* In Dagster, you'd adjust the [`cron_schedule` argument](https://docs.dagster.io/concepts/automation/schedules/examples) of your deployment.
Prioritize models
By tagging your dbt models adequately, you can set the priority of your models. For example, Secoda, an all-in-one data search, catalog, lineage, monitoring, and governance platform, can tag your most and least used models, and you can filter on these in your scheduling tools. Astronomer (Airflow's cloud version) has an excellent article on selecting and excluding dbt models by filtering their tags.
In its most simple form, this would allow you to have two tags: `low_priority` and `high_priority`. You'd also have two triggers in your job schedule: an hourly and a daily trigger. The former would filter and execute all the dbt models with the tag `high_priority` on an hourly basis, while the daily trigger would filter and execute the models tagged with `low_priority`.
Optimize testing
dbt supports two types of testing: data tests and unit tests.
dbt Data Tests
Data tests run validation queries on the underlying data. For instance, you could count the number of rows, the number of distinct values, the number of rows matching a specific filter, etc. For example, the following data test would trigger if it counts that transactions have been completed on a Sunday. For a store that closes on Sunday, that would indicate something is wrong:
```
select
count(order_id) as n_orders
from {{ ref('transaction_lines') }}
where day_of_week = 'sunday'
```
As your data grows, data testing could quickly become an expensive endeavor, as a lot of data would need to be processed.
dbt Unit Tests
In contrast to data tests, unit tests do not execute queries on the underlying data. Instead, you specify a list of examples, with input and the expected result, and run those through your query. With unit tests, you only process a tiny fraction of data, which is much cheaper in terms of processing power. However, you're likely to give up on data quality because there's always a chance that the examples from your unit tests are incomplete and you haven't caught the unknown unknowns. They're also quite lengthy and require a lot more work to define.
Unit tests are not confined to dbt or modeling tools. They can also be included in your data orchestration tools.
To learn more:
* dbt has excellent examples in its docs.
* Airflow has unit tests too.
* Prefect also has documentation on unit testing.
* Dagster has it written out on this page.
Conclusion
In this article, you learned about several techniques for cost-effectively transforming data. These techniques were split into ones with trade-offs and ones without. If you don't want to give up data freshness or data quality, consider optimizing your queries or how you store source and intermediate data. If there's nothing to optimize, try other materializations, lower the frequency of your model executions, or change testing strategies.
If you want to get intuitive clues about how to optimize the cost-effectiveness of your data transformation, consider using a data catalog. Secoda has various automations for detecting orphaned data, identifying candidates for cleanup, and alerting on schema changes. Interested? Schedule a demo today.