This is where BigQuery optimization gets really powerful! These two techniques can reduce your query costs by 90% or more.
The Problem We're Solving 🤔
Imagine you have a table with 5 years of taxi trip data - about 500 million rows. Every time you query:
SELECT * FROM taxi_trips WHERE pickup_date = '2024-01-15';
Without optimization, BigQuery scans ALL 500 million rows just to find trips from one day. That's:
- Slow (lots of data to read)
- Expensive (you pay for all data scanned)
- Wasteful (you only needed 0.05% of the data!)
Partitioning and clustering solve this problem!
Partitioning: Dividing Your Table into Sections 📁
Think of partitioning like organizing a filing cabinet. Instead of one giant drawer with all documents, you have:
- Drawer for January
- Drawer for February
- Drawer for March
- ...and so on
When you need something from March, you ONLY open the March drawer!
How Partitioning Works in BigQuery
-- Create a table partitioned by date
CREATE OR REPLACE TABLE `project.dataset.taxi_partitioned`
PARTITION BY DATE(pickup_datetime) AS
SELECT * FROM `project.dataset.taxi_external`;
Now your table looks like this internally:
taxi_partitioned/
├── 2024-01-01/ (all trips from Jan 1)
├── 2024-01-02/ (all trips from Jan 2)
├── 2024-01-03/ (all trips from Jan 3)
│ ...
├── 2024-06-30/ (all trips from Jun 30)
└── [metadata]
When you query with a date filter:
SELECT * FROM taxi_partitioned
WHERE DATE(pickup_datetime) = '2024-03-15';
BigQuery ONLY reads the 2024-03-15 partition! The other 180+ partitions are never touched.
Types of Partitioning
1. Time-based partitioning (most common)
-- Partition by day (default)
PARTITION BY DATE(pickup_datetime)
-- Partition by month (for less granular data)
PARTITION BY DATE_TRUNC(pickup_datetime, MONTH)
-- Partition by year
PARTITION BY DATE_TRUNC(pickup_datetime, YEAR)
2. Integer range partitioning
-- Partition by customer ID ranges
PARTITION BY RANGE_BUCKET(customer_id, GENERATE_ARRAY(0, 1000000, 10000))
3. Ingestion time partitioning
-- Partition by when data was loaded
PARTITION BY _PARTITIONDATE
Partitioning Rules to Remember ⚠️
| Rule | Details |
|---|---|
| Max partitions | 4,000 per table |
| Min partition size | Aim for at least 1GB per partition |
| One column only | Can only partition on ONE column |
| Column types | DATE, TIMESTAMP, DATETIME, or INTEGER |
💡 When NOT to use partitioning:
- If you'd have < 1GB per partition (use clustering instead)
- If you'd exceed 4,000 partitions
- If you rarely filter on the partition column
Clustering: Organizing Data Within Partitions 🗂️
If partitioning is like having separate drawers in a filing cabinet, clustering is like organizing the folders WITHIN each drawer alphabetically.
How Clustering Works
-- Create table with partitioning AND clustering
CREATE OR REPLACE TABLE `project.dataset.taxi_optimized`
PARTITION BY DATE(pickup_datetime)
CLUSTER BY vendor_id, payment_type AS
SELECT * FROM `project.dataset.taxi_external`;
Now within each date partition, data is sorted by vendor_id, then by payment_type:
taxi_optimized/
├── 2024-01-15/
│ ├── vendor_id=1, payment_type=1, ...
│ ├── vendor_id=1, payment_type=2, ...
│ ├── vendor_id=2, payment_type=1, ...
│ └── vendor_id=2, payment_type=2, ...
├── 2024-01-16/
│ └── (similarly organized)
When you query:
SELECT * FROM taxi_optimized
WHERE DATE(pickup_datetime) = '2024-01-15'
AND vendor_id = 1;
BigQuery:
- Goes directly to the 2024-01-15 partition (thanks to partitioning)
- Reads only the vendor_id=1 blocks (thanks to clustering)
Even more data skipped = even faster and cheaper!
Clustering Rules 📏
| Rule | Details |
|---|---|
| Max columns | Up to 4 clustering columns |
| Order matters | Put most filtered column first |
| No cost for re-clustering | BigQuery automatically re-clusters as data is added |
| Works with partitioning | Best used together! |
| Minimum table size | Most effective for tables > 1GB |
Good clustering column candidates:
- Columns you frequently filter on (WHERE clause)
- Columns you frequently group by (GROUP BY clause)
- High-cardinality columns (many distinct values)
Partitioning vs Clustering: When to Use What? 🤷
| Scenario | Recommendation |
|---|---|
| Always filter by date | Partition by date |
| Filter by date AND other columns | Partition by date, cluster by other columns |
| Filter by multiple non-date columns | Cluster by those columns |
| Need to know query cost upfront | Must use partitioning (clustering doesn't show estimates) |
| Less than 1GB per potential partition | Use clustering instead |
| Would have > 4,000 partitions | Use clustering instead |
| Data is rarely filtered | Maybe neither - analyze your query patterns first |
Real-World Performance Comparison 📊
I ran tests on the NYC taxi dataset (about 20 million rows). Here are the results:
Test 1: Filtering by Date Range
SELECT DISTINCT vendor_id
FROM [table]
WHERE DATE(pickup_datetime) BETWEEN '2024-03-01' AND '2024-03-15';
| Table Type | Data Scanned | Cost |
|---|---|---|
| Non-partitioned | 310 MB | $0.00155 |
| Partitioned by date | 27 MB | $0.000135 |
| Savings | 91% less! | 91% cheaper! |
Test 2: Filtering by Date AND Vendor
SELECT COUNT(*)
FROM [table]
WHERE DATE(pickup_datetime) BETWEEN '2024-06-01' AND '2024-06-30'
AND vendor_id = 1;
| Table Type | Data Scanned |
|---|---|
| Partitioned only | 1.1 GB |
| Partitioned + Clustered | 865 MB |
| Additional savings | 21% less! |
Combined savings: Over 90% reduction in costs! 💰
Step-by-Step: Creating an Optimized Table
Here's the full workflow:
-- Step 1: Create external table pointing to your data in GCS
CREATE OR REPLACE EXTERNAL TABLE `my-project.dataset.taxi_external`
OPTIONS (
format = 'PARQUET',
uris = ['gs://my-bucket/yellow_taxi_2024/*.parquet']
);
-- Step 2: Check how many records we have
SELECT COUNT(*) FROM `my-project.dataset.taxi_external`;
-- Result: 20,332,093 records
-- Step 3: Create optimized table with partitioning and clustering
CREATE OR REPLACE TABLE `my-project.dataset.taxi_optimized`
PARTITION BY DATE(tpep_dropoff_datetime)
CLUSTER BY VendorID AS
SELECT * FROM `my-project.dataset.taxi_external`;
-- Step 4: Verify partitions were created
SELECT
table_name,
partition_id,
total_rows
FROM `dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'taxi_optimized'
ORDER BY partition_id;
Best Practices Cheat Sheet ✅
For Reducing Costs 💵
- ❌ Never use
SELECT * - ✅ Only query columns you need
- ✅ Use partitioned tables
- ✅ Add clustering for frequently filtered columns
- ✅ Check estimated bytes before running
- ✅ Use table previews instead of SELECT for quick looks
For Better Performance ⚡
- ✅ Filter early - apply WHERE before JOINs
- ✅ Put largest table first in JOINs
- ✅ Use ORDER BY at the end of query
- ✅ Consider approximate functions (APPROX_COUNT_DISTINCT) when exact precision isn't needed
- ✅ Avoid JavaScript UDFs when possible
- ✅ Don't over-partition (keep partitions > 1GB)
Quick Reference: Common SQL Patterns for Beginners 📝
Here are the most useful BigQuery SQL commands you'll need:
Basic Queries
-- Count all records in a table
SELECT COUNT(*) FROM `project.dataset.table`;
-- Count distinct values in a column
SELECT COUNT(DISTINCT vendor_id) FROM `project.dataset.taxi`;
-- Get first 10 rows (but remember - this still scans the whole table!)
SELECT * FROM `project.dataset.taxi` LIMIT 10;
-- Better way to preview - use table preview in BigQuery console instead!
Filtering Data
-- Filter by exact value
SELECT * FROM `project.dataset.taxi`
WHERE vendor_id = 1;
-- Filter by date range
SELECT * FROM `project.dataset.taxi`
WHERE DATE(pickup_datetime) BETWEEN '2024-01-01' AND '2024-01-31';
-- Filter with multiple conditions
SELECT * FROM `project.dataset.taxi`
WHERE vendor_id = 1
AND fare_amount > 10
AND DATE(pickup_datetime) = '2024-03-15';
Aggregations
-- Sum, average, min, max
SELECT
SUM(fare_amount) as total_fares,
AVG(fare_amount) as avg_fare,
MIN(fare_amount) as min_fare,
MAX(fare_amount) as max_fare,
COUNT(*) as trip_count
FROM `project.dataset.taxi`;
-- Group by
SELECT
vendor_id,
COUNT(*) as trips,
AVG(fare_amount) as avg_fare
FROM `project.dataset.taxi`
GROUP BY vendor_id;
Creating Tables
-- Create external table from GCS
CREATE OR REPLACE EXTERNAL TABLE `project.dataset.taxi_external`
OPTIONS (
format = 'PARQUET',
uris = ['gs://bucket-name/folder/*.parquet']
);
-- Create native table from external
CREATE OR REPLACE TABLE `project.dataset.taxi_native` AS
SELECT * FROM `project.dataset.taxi_external`;
-- Create partitioned + clustered table
CREATE OR REPLACE TABLE `project.dataset.taxi_optimized`
PARTITION BY DATE(pickup_datetime)
CLUSTER BY vendor_id
AS SELECT * FROM `project.dataset.taxi_external`;
Checking Table Info
-- View partition information
SELECT
table_name,
partition_id,
total_rows
FROM `dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'your_table_name'
ORDER BY partition_id;
-- Check table schema
SELECT column_name, data_type
FROM `dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'your_table_name';
Glossary for Beginners 📚
| Term | Simple Explanation |
|---|---|
| Data Warehouse | A big database designed for analyzing historical data, not for running apps |
| OLTP | Databases for running applications (fast, small transactions) |
| OLAP | Databases for analysis (complex queries, lots of data) |
| BigQuery | Google's cloud data warehouse service |
| GCS | Google Cloud Storage - where you store files in the cloud |
| External Table | A table that reads data from GCS without copying it |
| Native Table | A table with data stored in BigQuery itself |
| Partitioning | Splitting a table into smaller pieces by date or number |
| Clustering | Sorting data within partitions by specific columns |
| Columnar Storage | Storing data by column instead of row (faster for analytics) |
| Slot | A unit of compute power in BigQuery |
| Data Scanned | How much data BigQuery reads to answer your query (you pay for this!) |
Common Mistakes to Avoid ⚠️
Using
SELECT *everywhere - Always specify columns you need!Thinking LIMIT reduces cost - It doesn't! BigQuery scans first, limits after.
Not using partitions - Always partition time-series data by date.
Wrong partition column - Partition by columns you ALWAYS filter on.
Too many partitions - Keep it under 4,000, aim for >1GB per partition.
Ignoring the query validator - Always check estimated bytes before running!
Not using clustering with partitioning - They work best together!
Resources for Learning More 📖
- 📊 BigQuery Official Documentation
- 🎥 DE Zoomcamp Video: Data Warehouse and BigQuery
- 🎥 DE Zoomcamp Video: Partitioning vs Clustering
- 🎥 DE Zoomcamp Video: Best Practices
- 🎥 DE Zoomcamp Video: Internals of BigQuery
- 📝 Course SQL Examples
- 📑 Course Slides
Summary: Key Takeaways 🎯
Data warehouses are for analysis, not running apps - that's why they exist!
BigQuery is serverless - no servers to manage, just write SQL.
Columnar storage = only reads columns you request = faster + cheaper.
External tables = data in GCS, slower but flexible.
Native tables = data in BigQuery, faster but costs more storage.
Partitioning = split table by date, only scan relevant dates.
Clustering = sort data within partitions, skip irrelevant blocks.
Always check estimated bytes before running queries!
Never use
SELECT *- specify only the columns you need.Combine partitioning + clustering for maximum optimization!
Top comments (0)