Abdelrahman Adnan

Posted on Feb 9

Part 3: Partitioning & Clustering for Performance 🚀

#dataengineering #googlecloud #performance #sql

This is where BigQuery optimization gets really powerful! These two techniques can reduce your query costs by 90% or more.

The Problem We're Solving 🤔

Imagine you have a table with 5 years of taxi trip data - about 500 million rows. Every time you query:

SELECT * FROM taxi_trips WHERE pickup_date = '2024-01-15';

Without optimization, BigQuery scans ALL 500 million rows just to find trips from one day. That's:

Slow (lots of data to read)
Expensive (you pay for all data scanned)
Wasteful (you only needed 0.05% of the data!)

Partitioning and clustering solve this problem!

Partitioning: Dividing Your Table into Sections 📁

Think of partitioning like organizing a filing cabinet. Instead of one giant drawer with all documents, you have:

Drawer for January
Drawer for February
Drawer for March
...and so on

When you need something from March, you ONLY open the March drawer!

How Partitioning Works in BigQuery

-- Create a table partitioned by date
CREATE OR REPLACE TABLE `project.dataset.taxi_partitioned`
PARTITION BY DATE(pickup_datetime) AS
SELECT * FROM `project.dataset.taxi_external`;

Now your table looks like this internally:

taxi_partitioned/
├── 2024-01-01/    (all trips from Jan 1)
├── 2024-01-02/    (all trips from Jan 2)
├── 2024-01-03/    (all trips from Jan 3)
│   ... 
├── 2024-06-30/    (all trips from Jun 30)
└── [metadata]

When you query with a date filter:

SELECT * FROM taxi_partitioned 
WHERE DATE(pickup_datetime) = '2024-03-15';

BigQuery ONLY reads the 2024-03-15 partition! The other 180+ partitions are never touched.

Types of Partitioning

1. Time-based partitioning (most common)

-- Partition by day (default)
PARTITION BY DATE(pickup_datetime)

-- Partition by month (for less granular data)
PARTITION BY DATE_TRUNC(pickup_datetime, MONTH)

-- Partition by year
PARTITION BY DATE_TRUNC(pickup_datetime, YEAR)

2. Integer range partitioning

-- Partition by customer ID ranges
PARTITION BY RANGE_BUCKET(customer_id, GENERATE_ARRAY(0, 1000000, 10000))

3. Ingestion time partitioning

-- Partition by when data was loaded
PARTITION BY _PARTITIONDATE

Partitioning Rules to Remember ⚠️

Rule	Details
Max partitions	4,000 per table
Min partition size	Aim for at least 1GB per partition
One column only	Can only partition on ONE column
Column types	DATE, TIMESTAMP, DATETIME, or INTEGER

💡 When NOT to use partitioning:

If you'd have < 1GB per partition (use clustering instead)
If you'd exceed 4,000 partitions
If you rarely filter on the partition column

Clustering: Organizing Data Within Partitions 🗂️

If partitioning is like having separate drawers in a filing cabinet, clustering is like organizing the folders WITHIN each drawer alphabetically.

How Clustering Works

-- Create table with partitioning AND clustering
CREATE OR REPLACE TABLE `project.dataset.taxi_optimized`
PARTITION BY DATE(pickup_datetime)
CLUSTER BY vendor_id, payment_type AS
SELECT * FROM `project.dataset.taxi_external`;

Now within each date partition, data is sorted by vendor_id, then by payment_type:

taxi_optimized/
├── 2024-01-15/
│   ├── vendor_id=1, payment_type=1, ...
│   ├── vendor_id=1, payment_type=2, ...
│   ├── vendor_id=2, payment_type=1, ...
│   └── vendor_id=2, payment_type=2, ...
├── 2024-01-16/
│   └── (similarly organized)

When you query:

SELECT * FROM taxi_optimized 
WHERE DATE(pickup_datetime) = '2024-01-15'
  AND vendor_id = 1;

BigQuery:

Goes directly to the 2024-01-15 partition (thanks to partitioning)
Reads only the vendor_id=1 blocks (thanks to clustering)

Even more data skipped = even faster and cheaper!

Clustering Rules 📏

Rule	Details
Max columns	Up to 4 clustering columns
Order matters	Put most filtered column first
No cost for re-clustering	BigQuery automatically re-clusters as data is added
Works with partitioning	Best used together!
Minimum table size	Most effective for tables > 1GB

Good clustering column candidates:

Columns you frequently filter on (WHERE clause)
Columns you frequently group by (GROUP BY clause)
High-cardinality columns (many distinct values)

Partitioning vs Clustering: When to Use What? 🤷

Scenario	Recommendation
Always filter by date	Partition by date
Filter by date AND other columns	Partition by date, cluster by other columns
Filter by multiple non-date columns	Cluster by those columns
Need to know query cost upfront	Must use partitioning (clustering doesn't show estimates)
Less than 1GB per potential partition	Use clustering instead
Would have > 4,000 partitions	Use clustering instead
Data is rarely filtered	Maybe neither - analyze your query patterns first

Real-World Performance Comparison 📊

I ran tests on the NYC taxi dataset (about 20 million rows). Here are the results:

Test 1: Filtering by Date Range

SELECT DISTINCT vendor_id 
FROM [table] 
WHERE DATE(pickup_datetime) BETWEEN '2024-03-01' AND '2024-03-15';

Table Type	Data Scanned	Cost
Non-partitioned	310 MB	$0.00155
Partitioned by date	27 MB	$0.000135
Savings	91% less!	91% cheaper!

Test 2: Filtering by Date AND Vendor

SELECT COUNT(*) 
FROM [table] 
WHERE DATE(pickup_datetime) BETWEEN '2024-06-01' AND '2024-06-30'
  AND vendor_id = 1;

Table Type	Data Scanned
Partitioned only	1.1 GB
Partitioned + Clustered	865 MB
Additional savings	21% less!

Combined savings: Over 90% reduction in costs! 💰

Step-by-Step: Creating an Optimized Table

Here's the full workflow:

-- Step 1: Create external table pointing to your data in GCS
CREATE OR REPLACE EXTERNAL TABLE `my-project.dataset.taxi_external`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://my-bucket/yellow_taxi_2024/*.parquet']
);

-- Step 2: Check how many records we have
SELECT COUNT(*) FROM `my-project.dataset.taxi_external`;
-- Result: 20,332,093 records

-- Step 3: Create optimized table with partitioning and clustering
CREATE OR REPLACE TABLE `my-project.dataset.taxi_optimized`
PARTITION BY DATE(tpep_dropoff_datetime)
CLUSTER BY VendorID AS
SELECT * FROM `my-project.dataset.taxi_external`;

-- Step 4: Verify partitions were created
SELECT 
  table_name, 
  partition_id, 
  total_rows
FROM `dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'taxi_optimized'
ORDER BY partition_id;

Best Practices Cheat Sheet ✅

For Reducing Costs 💵

❌ Never use SELECT *
✅ Only query columns you need
✅ Use partitioned tables
✅ Add clustering for frequently filtered columns
✅ Check estimated bytes before running
✅ Use table previews instead of SELECT for quick looks

For Better Performance ⚡

✅ Filter early - apply WHERE before JOINs
✅ Put largest table first in JOINs
✅ Use ORDER BY at the end of query
✅ Consider approximate functions (APPROX_COUNT_DISTINCT) when exact precision isn't needed
✅ Avoid JavaScript UDFs when possible
✅ Don't over-partition (keep partitions > 1GB)

Quick Reference: Common SQL Patterns for Beginners 📝

Here are the most useful BigQuery SQL commands you'll need:

Basic Queries

-- Count all records in a table
SELECT COUNT(*) FROM `project.dataset.table`;

-- Count distinct values in a column
SELECT COUNT(DISTINCT vendor_id) FROM `project.dataset.taxi`;

-- Get first 10 rows (but remember - this still scans the whole table!)
SELECT * FROM `project.dataset.taxi` LIMIT 10;

-- Better way to preview - use table preview in BigQuery console instead!

Filtering Data

-- Filter by exact value
SELECT * FROM `project.dataset.taxi`
WHERE vendor_id = 1;

-- Filter by date range
SELECT * FROM `project.dataset.taxi`
WHERE DATE(pickup_datetime) BETWEEN '2024-01-01' AND '2024-01-31';

-- Filter with multiple conditions
SELECT * FROM `project.dataset.taxi`
WHERE vendor_id = 1 
  AND fare_amount > 10
  AND DATE(pickup_datetime) = '2024-03-15';

Aggregations

-- Sum, average, min, max
SELECT 
  SUM(fare_amount) as total_fares,
  AVG(fare_amount) as avg_fare,
  MIN(fare_amount) as min_fare,
  MAX(fare_amount) as max_fare,
  COUNT(*) as trip_count
FROM `project.dataset.taxi`;

-- Group by
SELECT 
  vendor_id,
  COUNT(*) as trips,
  AVG(fare_amount) as avg_fare
FROM `project.dataset.taxi`
GROUP BY vendor_id;

Creating Tables

-- Create external table from GCS
CREATE OR REPLACE EXTERNAL TABLE `project.dataset.taxi_external`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://bucket-name/folder/*.parquet']
);

-- Create native table from external
CREATE OR REPLACE TABLE `project.dataset.taxi_native` AS
SELECT * FROM `project.dataset.taxi_external`;

-- Create partitioned + clustered table
CREATE OR REPLACE TABLE `project.dataset.taxi_optimized`
PARTITION BY DATE(pickup_datetime)
CLUSTER BY vendor_id
AS SELECT * FROM `project.dataset.taxi_external`;

Checking Table Info

-- View partition information
SELECT 
  table_name, 
  partition_id, 
  total_rows
FROM `dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'your_table_name'
ORDER BY partition_id;

-- Check table schema
SELECT column_name, data_type 
FROM `dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'your_table_name';

Glossary for Beginners 📚

Term	Simple Explanation
Data Warehouse	A big database designed for analyzing historical data, not for running apps
OLTP	Databases for running applications (fast, small transactions)
OLAP	Databases for analysis (complex queries, lots of data)
BigQuery	Google's cloud data warehouse service
GCS	Google Cloud Storage - where you store files in the cloud
External Table	A table that reads data from GCS without copying it
Native Table	A table with data stored in BigQuery itself
Partitioning	Splitting a table into smaller pieces by date or number
Clustering	Sorting data within partitions by specific columns
Columnar Storage	Storing data by column instead of row (faster for analytics)
Slot	A unit of compute power in BigQuery
Data Scanned	How much data BigQuery reads to answer your query (you pay for this!)

Common Mistakes to Avoid ⚠️

Using SELECT * everywhere - Always specify columns you need!
Thinking LIMIT reduces cost - It doesn't! BigQuery scans first, limits after.
Not using partitions - Always partition time-series data by date.
Wrong partition column - Partition by columns you ALWAYS filter on.
Too many partitions - Keep it under 4,000, aim for >1GB per partition.
Ignoring the query validator - Always check estimated bytes before running!
Not using clustering with partitioning - They work best together!

Resources for Learning More 📖

📊 BigQuery Official Documentation
🎥 DE Zoomcamp Video: Data Warehouse and BigQuery
🎥 DE Zoomcamp Video: Partitioning vs Clustering
🎥 DE Zoomcamp Video: Best Practices
🎥 DE Zoomcamp Video: Internals of BigQuery
📝 Course SQL Examples
📑 Course Slides

Summary: Key Takeaways 🎯

Data warehouses are for analysis, not running apps - that's why they exist!
BigQuery is serverless - no servers to manage, just write SQL.
Columnar storage = only reads columns you request = faster + cheaper.
External tables = data in GCS, slower but flexible.
Native tables = data in BigQuery, faster but costs more storage.
Partitioning = split table by date, only scan relevant dates.
Clustering = sort data within partitions, skip irrelevant blocks.
Always check estimated bytes before running queries!
Never use SELECT * - specify only the columns you need.
Combine partitioning + clustering for maximum optimization!

DataEngineeringZoomcamp #BigQuery #DataWarehouse #GCP #SQL #CloudComputing

DEV Community