DEV Community

Abdelrahman Adnan
Abdelrahman Adnan

Posted on

Part 3: Partitioning & Clustering for Performance 🚀

This is where BigQuery optimization gets really powerful! These two techniques can reduce your query costs by 90% or more.

The Problem We're Solving 🤔

Imagine you have a table with 5 years of taxi trip data - about 500 million rows. Every time you query:

SELECT * FROM taxi_trips WHERE pickup_date = '2024-01-15';
Enter fullscreen mode Exit fullscreen mode

Without optimization, BigQuery scans ALL 500 million rows just to find trips from one day. That's:

  • Slow (lots of data to read)
  • Expensive (you pay for all data scanned)
  • Wasteful (you only needed 0.05% of the data!)

Partitioning and clustering solve this problem!

Partitioning: Dividing Your Table into Sections 📁

Think of partitioning like organizing a filing cabinet. Instead of one giant drawer with all documents, you have:

  • Drawer for January
  • Drawer for February
  • Drawer for March
  • ...and so on

When you need something from March, you ONLY open the March drawer!

How Partitioning Works in BigQuery

-- Create a table partitioned by date
CREATE OR REPLACE TABLE `project.dataset.taxi_partitioned`
PARTITION BY DATE(pickup_datetime) AS
SELECT * FROM `project.dataset.taxi_external`;
Enter fullscreen mode Exit fullscreen mode

Now your table looks like this internally:

taxi_partitioned/
├── 2024-01-01/    (all trips from Jan 1)
├── 2024-01-02/    (all trips from Jan 2)
├── 2024-01-03/    (all trips from Jan 3)
│   ... 
├── 2024-06-30/    (all trips from Jun 30)
└── [metadata]
Enter fullscreen mode Exit fullscreen mode

When you query with a date filter:

SELECT * FROM taxi_partitioned 
WHERE DATE(pickup_datetime) = '2024-03-15';
Enter fullscreen mode Exit fullscreen mode

BigQuery ONLY reads the 2024-03-15 partition! The other 180+ partitions are never touched.

Types of Partitioning

1. Time-based partitioning (most common)

-- Partition by day (default)
PARTITION BY DATE(pickup_datetime)

-- Partition by month (for less granular data)
PARTITION BY DATE_TRUNC(pickup_datetime, MONTH)

-- Partition by year
PARTITION BY DATE_TRUNC(pickup_datetime, YEAR)
Enter fullscreen mode Exit fullscreen mode

2. Integer range partitioning

-- Partition by customer ID ranges
PARTITION BY RANGE_BUCKET(customer_id, GENERATE_ARRAY(0, 1000000, 10000))
Enter fullscreen mode Exit fullscreen mode

3. Ingestion time partitioning

-- Partition by when data was loaded
PARTITION BY _PARTITIONDATE
Enter fullscreen mode Exit fullscreen mode

Partitioning Rules to Remember ⚠️

Rule Details
Max partitions 4,000 per table
Min partition size Aim for at least 1GB per partition
One column only Can only partition on ONE column
Column types DATE, TIMESTAMP, DATETIME, or INTEGER

💡 When NOT to use partitioning:

  • If you'd have < 1GB per partition (use clustering instead)
  • If you'd exceed 4,000 partitions
  • If you rarely filter on the partition column

Clustering: Organizing Data Within Partitions 🗂️

If partitioning is like having separate drawers in a filing cabinet, clustering is like organizing the folders WITHIN each drawer alphabetically.

How Clustering Works

-- Create table with partitioning AND clustering
CREATE OR REPLACE TABLE `project.dataset.taxi_optimized`
PARTITION BY DATE(pickup_datetime)
CLUSTER BY vendor_id, payment_type AS
SELECT * FROM `project.dataset.taxi_external`;
Enter fullscreen mode Exit fullscreen mode

Now within each date partition, data is sorted by vendor_id, then by payment_type:

taxi_optimized/
├── 2024-01-15/
│   ├── vendor_id=1, payment_type=1, ...
│   ├── vendor_id=1, payment_type=2, ...
│   ├── vendor_id=2, payment_type=1, ...
│   └── vendor_id=2, payment_type=2, ...
├── 2024-01-16/
│   └── (similarly organized)
Enter fullscreen mode Exit fullscreen mode

When you query:

SELECT * FROM taxi_optimized 
WHERE DATE(pickup_datetime) = '2024-01-15'
  AND vendor_id = 1;
Enter fullscreen mode Exit fullscreen mode

BigQuery:

  1. Goes directly to the 2024-01-15 partition (thanks to partitioning)
  2. Reads only the vendor_id=1 blocks (thanks to clustering)

Even more data skipped = even faster and cheaper!

Clustering Rules 📏

Rule Details
Max columns Up to 4 clustering columns
Order matters Put most filtered column first
No cost for re-clustering BigQuery automatically re-clusters as data is added
Works with partitioning Best used together!
Minimum table size Most effective for tables > 1GB

Good clustering column candidates:

  • Columns you frequently filter on (WHERE clause)
  • Columns you frequently group by (GROUP BY clause)
  • High-cardinality columns (many distinct values)

Partitioning vs Clustering: When to Use What? 🤷

Scenario Recommendation
Always filter by date Partition by date
Filter by date AND other columns Partition by date, cluster by other columns
Filter by multiple non-date columns Cluster by those columns
Need to know query cost upfront Must use partitioning (clustering doesn't show estimates)
Less than 1GB per potential partition Use clustering instead
Would have > 4,000 partitions Use clustering instead
Data is rarely filtered Maybe neither - analyze your query patterns first

Real-World Performance Comparison 📊

I ran tests on the NYC taxi dataset (about 20 million rows). Here are the results:

Test 1: Filtering by Date Range

SELECT DISTINCT vendor_id 
FROM [table] 
WHERE DATE(pickup_datetime) BETWEEN '2024-03-01' AND '2024-03-15';
Enter fullscreen mode Exit fullscreen mode
Table Type Data Scanned Cost
Non-partitioned 310 MB $0.00155
Partitioned by date 27 MB $0.000135
Savings 91% less! 91% cheaper!

Test 2: Filtering by Date AND Vendor

SELECT COUNT(*) 
FROM [table] 
WHERE DATE(pickup_datetime) BETWEEN '2024-06-01' AND '2024-06-30'
  AND vendor_id = 1;
Enter fullscreen mode Exit fullscreen mode
Table Type Data Scanned
Partitioned only 1.1 GB
Partitioned + Clustered 865 MB
Additional savings 21% less!

Combined savings: Over 90% reduction in costs! 💰

Step-by-Step: Creating an Optimized Table

Here's the full workflow:

-- Step 1: Create external table pointing to your data in GCS
CREATE OR REPLACE EXTERNAL TABLE `my-project.dataset.taxi_external`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://my-bucket/yellow_taxi_2024/*.parquet']
);

-- Step 2: Check how many records we have
SELECT COUNT(*) FROM `my-project.dataset.taxi_external`;
-- Result: 20,332,093 records

-- Step 3: Create optimized table with partitioning and clustering
CREATE OR REPLACE TABLE `my-project.dataset.taxi_optimized`
PARTITION BY DATE(tpep_dropoff_datetime)
CLUSTER BY VendorID AS
SELECT * FROM `my-project.dataset.taxi_external`;

-- Step 4: Verify partitions were created
SELECT 
  table_name, 
  partition_id, 
  total_rows
FROM `dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'taxi_optimized'
ORDER BY partition_id;
Enter fullscreen mode Exit fullscreen mode

Best Practices Cheat Sheet ✅

For Reducing Costs 💵

  • ❌ Never use SELECT *
  • ✅ Only query columns you need
  • ✅ Use partitioned tables
  • ✅ Add clustering for frequently filtered columns
  • ✅ Check estimated bytes before running
  • ✅ Use table previews instead of SELECT for quick looks

For Better Performance ⚡

  • ✅ Filter early - apply WHERE before JOINs
  • ✅ Put largest table first in JOINs
  • ✅ Use ORDER BY at the end of query
  • ✅ Consider approximate functions (APPROX_COUNT_DISTINCT) when exact precision isn't needed
  • ✅ Avoid JavaScript UDFs when possible
  • ✅ Don't over-partition (keep partitions > 1GB)

Quick Reference: Common SQL Patterns for Beginners 📝

Here are the most useful BigQuery SQL commands you'll need:

Basic Queries

-- Count all records in a table
SELECT COUNT(*) FROM `project.dataset.table`;

-- Count distinct values in a column
SELECT COUNT(DISTINCT vendor_id) FROM `project.dataset.taxi`;

-- Get first 10 rows (but remember - this still scans the whole table!)
SELECT * FROM `project.dataset.taxi` LIMIT 10;

-- Better way to preview - use table preview in BigQuery console instead!
Enter fullscreen mode Exit fullscreen mode

Filtering Data

-- Filter by exact value
SELECT * FROM `project.dataset.taxi`
WHERE vendor_id = 1;

-- Filter by date range
SELECT * FROM `project.dataset.taxi`
WHERE DATE(pickup_datetime) BETWEEN '2024-01-01' AND '2024-01-31';

-- Filter with multiple conditions
SELECT * FROM `project.dataset.taxi`
WHERE vendor_id = 1 
  AND fare_amount > 10
  AND DATE(pickup_datetime) = '2024-03-15';
Enter fullscreen mode Exit fullscreen mode

Aggregations

-- Sum, average, min, max
SELECT 
  SUM(fare_amount) as total_fares,
  AVG(fare_amount) as avg_fare,
  MIN(fare_amount) as min_fare,
  MAX(fare_amount) as max_fare,
  COUNT(*) as trip_count
FROM `project.dataset.taxi`;

-- Group by
SELECT 
  vendor_id,
  COUNT(*) as trips,
  AVG(fare_amount) as avg_fare
FROM `project.dataset.taxi`
GROUP BY vendor_id;
Enter fullscreen mode Exit fullscreen mode

Creating Tables

-- Create external table from GCS
CREATE OR REPLACE EXTERNAL TABLE `project.dataset.taxi_external`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://bucket-name/folder/*.parquet']
);

-- Create native table from external
CREATE OR REPLACE TABLE `project.dataset.taxi_native` AS
SELECT * FROM `project.dataset.taxi_external`;

-- Create partitioned + clustered table
CREATE OR REPLACE TABLE `project.dataset.taxi_optimized`
PARTITION BY DATE(pickup_datetime)
CLUSTER BY vendor_id
AS SELECT * FROM `project.dataset.taxi_external`;
Enter fullscreen mode Exit fullscreen mode

Checking Table Info

-- View partition information
SELECT 
  table_name, 
  partition_id, 
  total_rows
FROM `dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'your_table_name'
ORDER BY partition_id;

-- Check table schema
SELECT column_name, data_type 
FROM `dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'your_table_name';
Enter fullscreen mode Exit fullscreen mode

Glossary for Beginners 📚

Term Simple Explanation
Data Warehouse A big database designed for analyzing historical data, not for running apps
OLTP Databases for running applications (fast, small transactions)
OLAP Databases for analysis (complex queries, lots of data)
BigQuery Google's cloud data warehouse service
GCS Google Cloud Storage - where you store files in the cloud
External Table A table that reads data from GCS without copying it
Native Table A table with data stored in BigQuery itself
Partitioning Splitting a table into smaller pieces by date or number
Clustering Sorting data within partitions by specific columns
Columnar Storage Storing data by column instead of row (faster for analytics)
Slot A unit of compute power in BigQuery
Data Scanned How much data BigQuery reads to answer your query (you pay for this!)

Common Mistakes to Avoid ⚠️

  1. Using SELECT * everywhere - Always specify columns you need!

  2. Thinking LIMIT reduces cost - It doesn't! BigQuery scans first, limits after.

  3. Not using partitions - Always partition time-series data by date.

  4. Wrong partition column - Partition by columns you ALWAYS filter on.

  5. Too many partitions - Keep it under 4,000, aim for >1GB per partition.

  6. Ignoring the query validator - Always check estimated bytes before running!

  7. Not using clustering with partitioning - They work best together!


Resources for Learning More 📖


Summary: Key Takeaways 🎯

  1. Data warehouses are for analysis, not running apps - that's why they exist!

  2. BigQuery is serverless - no servers to manage, just write SQL.

  3. Columnar storage = only reads columns you request = faster + cheaper.

  4. External tables = data in GCS, slower but flexible.

  5. Native tables = data in BigQuery, faster but costs more storage.

  6. Partitioning = split table by date, only scan relevant dates.

  7. Clustering = sort data within partitions, skip irrelevant blocks.

  8. Always check estimated bytes before running queries!

  9. Never use SELECT * - specify only the columns you need.

  10. Combine partitioning + clustering for maximum optimization!


DataEngineeringZoomcamp #BigQuery #DataWarehouse #GCP #SQL #CloudComputing

Top comments (0)