What is BigQuery?
BigQuery is Google's data warehouse in the cloud. It's one of the most popular choices for storing and analyzing large amounts of data because it's:
Serverless - You don't manage any servers. No installing software, no worrying about disk space, no maintenance. Google handles everything.
Fully managed - Google takes care of security, backups, scaling, and updates.
Petabyte-scale - Can handle absolutely massive datasets (1 petabyte = 1,000 terabytes = 1,000,000 gigabytes!)
SQL-based - You just write SQL queries. No need to learn a new programming language!
Why BigQuery is Great for Beginners 🌟
- ☁️ No setup headaches - Create a project, load data, start querying. That's it!
- 💰 Free tier - 1TB of queries and 10GB storage free per month
- 📊 Familiar SQL - If you know basic SQL, you can use BigQuery
- 🔗 Works with everything - Google Sheets, Data Studio, Python, R, etc.
- 🤖 Built-in ML - Train machine learning models using just SQL!
How BigQuery Works Under the Hood 🔧
Understanding the architecture helps you write better queries and save money. Don't worry, I'll keep it simple!
The Secret: Separation of Storage and Compute
Traditional databases store data and process queries on the same machine. BigQuery does something clever - it separates them:
┌─────────────────────────────────────────────────────────┐
│ YOUR SQL QUERY │
└─────────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ DREMEL (Compute Engine) │
│ │
│ Your query gets broken into tiny pieces and │
│ thousands of workers process them in parallel │
└─────────────────────────┬───────────────────────────────┘
│
│ Jupiter Network (super fast!)
│ 1 Terabyte per second
│
▼
┌─────────────────────────────────────────────────────────┐
│ COLOSSUS (Storage) │
│ │
│ Your data lives here in COLUMNAR format │
│ (organized by columns, not rows) │
└─────────────────────────────────────────────────────────┘
What Does "Columnar Storage" Mean? 📋
This is SUPER important for understanding BigQuery performance!
Traditional databases (row-oriented):
Stores data like this:
Row 1: [John, 25, New York, $50000]
Row 2: [Jane, 30, Chicago, $60000]
Row 3: [Bob, 35, Miami, $55000]
To find all salaries, it reads EVERY row, even though you only need one column.
BigQuery (column-oriented):
Stores data like this:
Names column: [John, Jane, Bob]
Ages column: [25, 30, 35]
Cities column: [New York, Chicago, Miami]
Salaries column: [$50000, $60000, $55000]
To find all salaries, it ONLY reads the salary column! Much faster and cheaper!
💡 This is why SELECT * is expensive in BigQuery - it has to read EVERY column. Always specify only the columns you need!
The Dremel Execution Engine 🚀
When you run a query, here's what happens:
- Root Server receives your query
- Query is broken into smaller pieces
- Mixers distribute work to thousands of Leaf Nodes
- Each Leaf Node processes a small chunk of data in parallel
- Results flow back up through Mixers to Root
- You get your final result!
┌──────────┐
│ ROOT │ ← Your query comes here
└────┬─────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ MIXER │ │ MIXER │ │ MIXER │
└───┬────┘ └───┬────┘ └───┬────┘
│ │ │
┌─────┼─────┐ ┌─────┼─────┐ ┌─────┼─────┐
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
[L] [L] [L][L] [L] [L][L] [L] [L]
L = Leaf nodes (thousands of them!)
Why this matters: A query that would take hours on your laptop can run in seconds because thousands of machines work on it simultaneously!
External Tables vs Native Tables 📦
You have two ways to work with data in BigQuery:
Option 1: External Tables (Data stays in GCS)
Your data remains in Google Cloud Storage, BigQuery just reads it when you query.
-- Create external table pointing to files in GCS bucket
CREATE OR REPLACE EXTERNAL TABLE `my-project.my_dataset.taxi_external`
OPTIONS (
format = 'PARQUET',
uris = ['gs://my-bucket/taxi_data/*.parquet']
);
When to use External Tables:
- ✅ You want to save on storage costs (GCS is cheaper than BigQuery storage)
- ✅ One-time or occasional analysis
- ✅ Data is updated frequently in source system
- ✅ Quick exploration before committing to load
Downsides:
- ❌ Slower queries (data needs to be read from GCS each time)
- ❌ No cost estimation before running queries
- ❌ Can't partition or cluster (limited optimization)
Option 2: Native Tables (Data loaded into BigQuery)
Data is copied into BigQuery's own storage (Colossus).
-- Create native table from external table
CREATE OR REPLACE TABLE `my-project.my_dataset.taxi_native` AS
SELECT * FROM `my-project.my_dataset.taxi_external`;
When to use Native Tables:
- ✅ Frequently queried data
- ✅ Need best query performance
- ✅ Want to use partitioning and clustering
- ✅ Need accurate cost estimates before running queries
Downsides:
- ❌ Higher storage costs
- ❌ Data duplication (exists in both GCS and BigQuery)
💡 Pro tip: Start with external tables for exploration, then load into native tables once you know what data you actually need!
Understanding BigQuery Costs 💰
BigQuery has two main pricing models:
On-Demand Pricing (Pay per query)
- $5 per TB of data scanned
- Good for: Occasional users, unpredictable workloads
- You pay for how much data your queries read
Flat-Rate Pricing (Monthly commitment)
- ~$2,000/month for 100 "slots" (compute units)
- Good for: Heavy users, predictable workloads
- Unlimited queries within your slot capacity
How to Estimate Query Cost 🧮
Before running a query, BigQuery shows you how much data it will scan:
┌────────────────────────────────────────────────┐
│ Query Editor │
│ ─────────────────────────────────────────────│
│ SELECT * FROM my_table WHERE date = '2024-01'│
│ │
│ [This query will process 2.5 GB when run] │ ← Check this!
└────────────────────────────────────────────────┘
Cost calculation:
- 2.5 GB = 0.0025 TB
- 0.0025 TB × $5 = $0.0125 (about 1 cent)
But if you run that query 100 times a day... costs add up!
Cost Optimization Tips 💡
-
NEVER use
SELECT *unless you absolutely need every column
-- ❌ Bad - reads ALL columns
SELECT * FROM taxi_data;
-- ✅ Good - reads only what you need
SELECT pickup_time, dropoff_time, fare_amount FROM taxi_data;
Use partitioned tables (covered in Part 3)
Preview before running - Always check the estimated bytes
Use LIMIT wisely - It doesn't reduce data scanned! The filtering happens AFTER reading.
-- ❌ Still scans the whole table!
SELECT * FROM huge_table LIMIT 10;
-- ✅ Better - add a WHERE clause first
SELECT * FROM huge_table WHERE date = CURRENT_DATE() LIMIT 10;
- Cache results - BigQuery caches query results for 24 hours (free!)
BigQuery Caching 🗄️
When you run the same query twice:
- First run: Scans data, costs money
- Second run: Returns cached result, FREE!
Cache is invalidated when:
- Underlying table data changes
- 24 hours pass
- You disable caching in query settings
Top comments (0)