Stephen

Posted on Sep 20

Working with Gigantic Google BigQuery Partitioned Tables in DBT

#bigquery #dbt #dataengineering #googlecloud

The Problem

When working with large tables (think TBs), I encountered a very serious problem with my dbt models - We were scanning too much data from the source tables !!!

Despite doing :-

Partitioning our tables by day
Using incremental loading with is_incremental macro and using the partition column in our queries, we were still scanning a whole lot of data (and getting billed for it as well).

And that is how I embarked on my quest optimise our dbt models and save some $$$ as well (honestly though as per my boss save the $$$ was the main quest).

The Face Palm Moment

When building the initial models, I had expected our tables to increase in size pretty fast, so day partitioning based on ingestion time was already in there along with incrementally building the models. The code for those looked something like this:-

{{ config(
    materialized="incremental",
    partition_by={
        "field": "ingested_at",
        "data_type": "timestamp",
        "granularity": "day"
    }
}}
with max_ingested_this as (
    select max(ingested_at) as max_ingested_at
    from {{ this }}
),

source1 as (
    select *
    from {{ ref("source1") }} as s
    inner join max_ingested_this as mit
        on s.ingested_at > mit.ingested_at
)

......

On the surface that code looked ok, at least to me, two years ago. The idea was ingested_at column exists in every table and we just process the new data from the source and append to our table.

Except for, this clear and succinctly written documentation on why my above query will not help BigQuery prune partitions¹ from the source1 table and no I am not being sarcastic about how good the documentation is.

So, for partition pruning to happen, my query had to be something like this:-

source1 as (
    select *
    from {{ ref("source1") }}
    ingested_at > "2012-10-10T13:00:00"
)

Not sure how I missed this (or maybe the documentation was updated to make it more clear^ø) the first time around, but seriously it meant that the nice smart query I had written previously was truly dumb^#.

The Solution

Once we had discovered the true cause of our queries scanning so much data and getting over a mini panic attack arising from thinking I chose the wrong tool for the job. We set out figuring out a solution.

We needed the dates to be dynamic(they needed to be derived at run time), yet they needed to be static in the compiled query (which would be sent to BigQuery).

And this is where one of my favourite dbt macros comes in, get_single_value.

So instead of getting the max(ingested_at) in the query, we switch to something like this:-

{% set sql_statement %}
    select max(ingested_at) from {{ this }}
{% endset %}

{%- set max_ingested_this = dbt_utils.get_single_value(sql_statement, default="'1970-01-01'") -%}

with source1 as (
    select *
    from {{ ref("source1") }}
    where ingested_at > {{ '"' ~ max_ingested_this ~ '"' }}
)

...

Now on dbt compile we will get the following output:-

with source1 as (
    select *
    from source1_bq_table
    where ingested_at > "2022-10-10T13:00:00"
)
...

Now as you can see when the final BigQuery job to process your data will be fired, it will have the static date in the SQL which means BigQuery can now do partition pruning and scan only the relevant partitions for your job^µ.

Obviously the best way to see the reduction in amount of data scanned is to paste this generated SQL into BigQuery Studio and see in the top right the amount of data scanned of the new query vs the old one.

Notes

* - Pruning partitions refers to the part where based on your query BigQuery is able to calculate in advance which partitions could contain the data you are interested in and so doesn't even bother scanning the data in other partitions. The direct effect of this is obviously lesser data scanned = faster query performance and / or you query costing lesser.

# - This is where I guess the ancient wisdom of the Gods applies. RTFM and read it regularly, you might just stumble upon important new features or re-discover some old ones.

ø - Just to make myself feel better, I am going to hope that documentation was fixed after I read it or my queries worked perfectly at the time and the BigQuery team changed something :P

µ - Note the compromise here though, dbt has to now invoke two BigQuery jobs, first to get the max_ingested_this at compile time and the next which actually builds the model table in BigQuery, so ensure your model output table is big enough to justify this performance penalty.
If your table is big enough you can probably take this a step further and use the BigQuery Information schema models, to get the latest partition and only query that partition for maximum value of the ingested_at columns.

DEV Community

Working with Gigantic Google BigQuery Partitioned Tables in DBT

The Problem

The Face Palm Moment

The Solution

Notes

Top comments (0)

Read next

Day 42 Journal

Day 44 Journal

Day 45 Journal

Day 47 Journal