DEV Community

BenBirt for Dataform

Posted on • Originally published at

Cut data warehouse costs with run caching

As we've mentioned before, one of the core design goals of Dataform is to make project compilation hermetic. The idea is to ensure that your final ELT pipeline is as reproducible as possible given the same input (your project code), with a few tightly-controlled exceptions (like support for 'incremental' tables).

Being able to reason this way about the code in Dataform pipelines gives us the opportunity to build some cool features into the Dataform framework. An example is our "run caching" feature.

Don't waste time and money re-computing the same data

Most analytics pipelines are executed periodically as part of some schedule. Generally, these schedules are configured to run as often as necessary to keep the final data as up-to-date as the business requires.

Unfortunately, this can lead to a waste of resources. Consider a pipeline that is executed once an hour. If its input data doesn't change between one execution and the next, then the next execution will result in no changes to the output data, but it'll still cost time and money to run.

Instead, we believe that the pipeline should automatically detect if it's not going to change the output data - and if so, then the affected stage(s) should be skipped, saving those resources.

We've built this feature into Dataform.

Run caching in Dataform

Try out an example project with run caching here!

You can turn run caching on in your project with a few small changes which are described here. Once enabled, run caching skips re-execution of code which cannot result in a change to output data.

For example, consider the following SQLX file, which configures Dataform to publish a table age_count containing the transformed results of a query reading a people relation:

config { name: "age_count", type: "table" }

select age, count(1) from ${ref("people")} group by age

Dataform only needs to (re-)publish this table if any of the following conditions are true:

  • The output table age_count doesn't exist
  • The output table age_count has changed since the last time this table was published (i.e. it was modified by something other than Dataform itself)
  • The query has changed since the last time the age_count table was published
  • The input table people has changed since the last time the age_count table was published (or, if people is a view, then if any of the input(s) to people have changed)

Dataform uses these rules to decide whether or not to publish the table. If all of the tests fail, i.e. re-publishing the table would result in no change to the output table, then this action is skipped.

Building in intelligence so you don't have to

At Dataform we believe that you shouldn't have to manage the infrastructure involved in running analytics workloads.

This philosophy is what drives us to build out features like run caching, which automatically help to manage and operationalize analytics workloads, so that you don't have to. All you need to do is define your business-logic transformations, and we'll handle the rest.

If you'd like to learn more, the Dataform framework documentation is here. Join us on Slack and let us know what you think!

Top comments (0)