BigQuery: query and table optimization to save some money

#googlecloud #bigquery #partitioning #clustering

What is BigQuery?

BigQuery belongs to the Google Cloud Platform suite and it is one of the most used platforms for data warehousing in the data world. You can host tables and query them to transform data and provide useful insights.

The way users are charged is based on the bytes processed by each query. The cost will depend on:

The number of columns scanned in each row
The total amount of data read, not the number of rows returned

To help understand this post, let's assume we have a table with a list of orders with the following columns:customer_id, product_id, store_id, product_value, product_name, purchase_date

For a query like the one below, we are retrieving all columns from all rows. This translates to scanning the entire table, which may be expensive depending on its size:

SELECT
  *
FROM orders

One way to reduce the cost of the above query is to select just the fields that are required. For example, if we want to know which products were bought by which customers and on which date, we are only interested in customer_id, product_id, and purchase_date. There's no need to load fields like store_id or product_value:

SELECT
  customer_id,
  product_id,
  purchase_date
FROM orders

That is how we can optimize queries during development. However, there are two other ways to optimize costs at the table design level, by preparing the table for more efficient querying.

Partitioning

Partitioning breaks the table into logical segments based on a date or timestamp column. For example, if we select daily partitioning, the table is split into partitions for each day.

Using the orders table example, imagine we want to retrieve product purchases made during 2024. The query below will only scan the partitions that fall within that date range, in this case 365 partitions (one per day). This helps reduce the amount of data processed by skipping older data.

SELECT
  customer_id,
  product_id,
  purchase_date
FROM orders
WHERE DATE(partition_field) BETWEEN '2024-01-01' AND '2024-12-31'

Note: To take advantage of partition pruning, make sure you filter directly on the partition field without wrapping it in functions like DATE() unless the field itself is a timestamp. If the field is already a DATE type, use it as-is.

Clustering

Clustering breaks the data further within each partition by organizing it based on the values of selected fields. These are usually the fields most frequently used in WHERE clauses.

When a table is partitioned, clustering happens within each partition. This helps BigQuery scan only the blocks relevant to your query.

In our example, if the table is clustered by store_id, and we want to retrieve products purchased by customers in a specific store, we can write:

SELECT 
  customer_id,
  product_id,
  purchase_date
FROM orders
WHERE DATE(partition_field) BETWEEN '2024-01-01' AND '2024-12-31'
  AND store_id = 'store_xyz'

Thanks to partitioning and clustering, this query will skip over irrelevant partitions and blocks, scanning much less data.

See below an image of how a table would look without any partitioning or clustering, and how it looks after adding these optimizations.

Are you aware of any other optimizations? Share them in the comments

Top comments (1)

Charlotte Towell • Nov 19

Nice! I found some interesting results when I was first learning BigQuery on some of the edge cases of the effectiveness of clustering when paired with external queries, I wrote an article about it on here. Have also recently been using materialised views more along with partitioning which has helped cut query costs drastically for me!