DEV Community

Cover image for Working with Gigantic Google BigQuery Partitioned Tables in DBT
Stephen
Stephen

Posted on

2

Working with Gigantic Google BigQuery Partitioned Tables in DBT

The Problem

When working with large tables (think TBs), I encountered a very serious problem with my dbt models - We were scanning too much data from the source tables !!!

Despite doing :-

  1. Partitioning our tables by day
  2. Using incremental loading with is_incremental macro and using the partition column in our queries, we were still scanning a whole lot of data (and getting billed for it as well).

And that is how I embarked on my quest optimise our dbt models and save some $$$ as well (honestly though as per my boss save the $$$ was the main quest).

The Face Palm Moment

When building the initial models, I had expected our tables to increase in size pretty fast, so day partitioning based on ingestion time was already in there along with incrementally building the models. The code for those looked something like this:-

{{ config(
    materialized="incremental",
    partition_by={
        "field": "ingested_at",
        "data_type": "timestamp",
        "granularity": "day"
    }
}}
with max_ingested_this as (
    select max(ingested_at) as max_ingested_at
    from {{ this }}
),

source1 as (
    select *
    from {{ ref("source1") }} as s
    inner join max_ingested_this as mit
        on s.ingested_at > mit.ingested_at
)

......


Enter fullscreen mode Exit fullscreen mode

On the surface that code looked ok, at least to me, two years ago. The idea was ingested_at column exists in every table and we just process the new data from the source and append to our table.

Except for, this clear and succinctly written documentation on why my above query will not help BigQuery prune partitions1 from the source1 table and no I am not being sarcastic about how good the documentation is.

So, for partition pruning to happen, my query had to be something like this:-

source1 as (
    select *
    from {{ ref("source1") }}
    ingested_at > "2012-10-10T13:00:00"
)
Enter fullscreen mode Exit fullscreen mode

Not sure how I missed this (or maybe the documentation was updated to make it more clearø) the first time around, but seriously it meant that the nice smart query I had written previously was truly dumb#.

The Solution

Once we had discovered the true cause of our queries scanning so much data and getting over a mini panic attack arising from thinking I chose the wrong tool for the job. We set out figuring out a solution.

We needed the dates to be dynamic(they needed to be derived at run time), yet they needed to be static in the compiled query (which would be sent to BigQuery).

And this is where one of my favourite dbt macros comes in, get_single_value.

So instead of getting the max(ingested_at) in the query, we switch to something like this:-

{% set sql_statement %}
    select max(ingested_at) from {{ this }}
{% endset %}

{%- set max_ingested_this = dbt_utils.get_single_value(sql_statement, default="'1970-01-01'") -%}

with source1 as (
    select *
    from {{ ref("source1") }}
    where ingested_at > {{ '"' ~ max_ingested_this ~ '"' }}
)

...

Enter fullscreen mode Exit fullscreen mode

Now on dbt compile we will get the following output:-

with source1 as (
    select *
    from source1_bq_table
    where ingested_at > "2022-10-10T13:00:00"
)
...
Enter fullscreen mode Exit fullscreen mode

Now as you can see when the final BigQuery job to process your data will be fired, it will have the static date in the SQL which means BigQuery can now do partition pruning and scan only the relevant partitions for your jobµ.

Obviously the best way to see the reduction in amount of data scanned is to paste this generated SQL into BigQuery Studio and see in the top right the amount of data scanned of the new query vs the old one.

Notes

* - Pruning partitions refers to the part where based on your query BigQuery is able to calculate in advance which partitions could contain the data you are interested in and so doesn't even bother scanning the data in other partitions. The direct effect of this is obviously lesser data scanned = faster query performance and / or you query costing lesser.

# - This is where I guess the ancient wisdom of the Gods applies. RTFM and read it regularly, you might just stumble upon important new features or re-discover some old ones.

ø - Just to make myself feel better, I am going to hope that documentation was fixed after I read it or my queries worked perfectly at the time and the BigQuery team changed something :P

µ - Note the compromise here though, dbt has to now invoke two BigQuery jobs, first to get the max_ingested_this at compile time and the next which actually builds the model table in BigQuery, so ensure your model output table is big enough to justify this performance penalty.
If your table is big enough you can probably take this a step further and use the BigQuery Information schema models, to get the latest partition and only query that partition for maximum value of the ingested_at columns.

Do your career a big favor. Join DEV. (The website you're on right now)

It takes one minute, it's free, and is worth it for your career.

Get started

Community matters

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Explore a sea of insights with this enlightening post, highly esteemed within the nurturing DEV Community. Coders of all stripes are invited to participate and contribute to our shared knowledge.

Expressing gratitude with a simple "thank you" can make a big impact. Leave your thanks in the comments!

On DEV, exchanging ideas smooths our way and strengthens our community bonds. Found this useful? A quick note of thanks to the author can mean a lot.

Okay