Abdelrahman Adnan

Posted on Feb 9

Part 2: BigQuery Deep Dive 🔍

#beginners #database #googlecloud #sql

What is BigQuery?

BigQuery is Google's data warehouse in the cloud. It's one of the most popular choices for storing and analyzing large amounts of data because it's:

Serverless - You don't manage any servers. No installing software, no worrying about disk space, no maintenance. Google handles everything.
Fully managed - Google takes care of security, backups, scaling, and updates.
Petabyte-scale - Can handle absolutely massive datasets (1 petabyte = 1,000 terabytes = 1,000,000 gigabytes!)
SQL-based - You just write SQL queries. No need to learn a new programming language!

Why BigQuery is Great for Beginners 🌟

☁️ No setup headaches - Create a project, load data, start querying. That's it!
💰 Free tier - 1TB of queries and 10GB storage free per month
📊 Familiar SQL - If you know basic SQL, you can use BigQuery
🔗 Works with everything - Google Sheets, Data Studio, Python, R, etc.
🤖 Built-in ML - Train machine learning models using just SQL!

How BigQuery Works Under the Hood 🔧

Understanding the architecture helps you write better queries and save money. Don't worry, I'll keep it simple!

The Secret: Separation of Storage and Compute

Traditional databases store data and process queries on the same machine. BigQuery does something clever - it separates them:

┌─────────────────────────────────────────────────────────┐
│                    YOUR SQL QUERY                        │
└─────────────────────────┬───────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                DREMEL (Compute Engine)                   │
│                                                         │
│   Your query gets broken into tiny pieces and           │
│   thousands of workers process them in parallel         │
└─────────────────────────┬───────────────────────────────┘
                          │
                          │  Jupiter Network (super fast!)
                          │  1 Terabyte per second
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                 COLOSSUS (Storage)                       │
│                                                         │
│   Your data lives here in COLUMNAR format               │
│   (organized by columns, not rows)                      │
└─────────────────────────────────────────────────────────┘

What Does "Columnar Storage" Mean? 📋

This is SUPER important for understanding BigQuery performance!

Traditional databases (row-oriented):
Stores data like this:

Row 1: [John, 25, New York, $50000]
Row 2: [Jane, 30, Chicago, $60000]
Row 3: [Bob, 35, Miami, $55000]

To find all salaries, it reads EVERY row, even though you only need one column.

BigQuery (column-oriented):
Stores data like this:

Names column:    [John, Jane, Bob]
Ages column:     [25, 30, 35]
Cities column:   [New York, Chicago, Miami]
Salaries column: [$50000, $60000, $55000]

To find all salaries, it ONLY reads the salary column! Much faster and cheaper!

💡 This is why SELECT * is expensive in BigQuery - it has to read EVERY column. Always specify only the columns you need!

The Dremel Execution Engine 🚀

When you run a query, here's what happens:

Root Server receives your query
Query is broken into smaller pieces
Mixers distribute work to thousands of Leaf Nodes
Each Leaf Node processes a small chunk of data in parallel
Results flow back up through Mixers to Root
You get your final result!

                    ┌──────────┐
                    │   ROOT   │  ← Your query comes here
                    └────┬─────┘
                         │
           ┌─────────────┼─────────────┐
           ▼             ▼             ▼
      ┌────────┐    ┌────────┐    ┌────────┐
      │ MIXER  │    │ MIXER  │    │ MIXER  │
      └───┬────┘    └───┬────┘    └───┬────┘
          │             │             │
    ┌─────┼─────┐ ┌─────┼─────┐ ┌─────┼─────┐
    ▼     ▼     ▼ ▼     ▼     ▼ ▼     ▼     ▼
   [L]   [L]   [L][L]   [L]   [L][L]   [L]   [L]

   L = Leaf nodes (thousands of them!)

Why this matters: A query that would take hours on your laptop can run in seconds because thousands of machines work on it simultaneously!

External Tables vs Native Tables 📦

You have two ways to work with data in BigQuery:

Option 1: External Tables (Data stays in GCS)

Your data remains in Google Cloud Storage, BigQuery just reads it when you query.

-- Create external table pointing to files in GCS bucket
CREATE OR REPLACE EXTERNAL TABLE `my-project.my_dataset.taxi_external`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://my-bucket/taxi_data/*.parquet']
);

When to use External Tables:

✅ You want to save on storage costs (GCS is cheaper than BigQuery storage)
✅ One-time or occasional analysis
✅ Data is updated frequently in source system
✅ Quick exploration before committing to load

Downsides:

❌ Slower queries (data needs to be read from GCS each time)
❌ No cost estimation before running queries
❌ Can't partition or cluster (limited optimization)

Option 2: Native Tables (Data loaded into BigQuery)

Data is copied into BigQuery's own storage (Colossus).

-- Create native table from external table
CREATE OR REPLACE TABLE `my-project.my_dataset.taxi_native` AS
SELECT * FROM `my-project.my_dataset.taxi_external`;

When to use Native Tables:

✅ Frequently queried data
✅ Need best query performance
✅ Want to use partitioning and clustering
✅ Need accurate cost estimates before running queries

Downsides:

❌ Higher storage costs
❌ Data duplication (exists in both GCS and BigQuery)

💡 Pro tip: Start with external tables for exploration, then load into native tables once you know what data you actually need!

Understanding BigQuery Costs 💰

BigQuery has two main pricing models:

On-Demand Pricing (Pay per query)

$5 per TB of data scanned
Good for: Occasional users, unpredictable workloads
You pay for how much data your queries read

Flat-Rate Pricing (Monthly commitment)

~$2,000/month for 100 "slots" (compute units)
Good for: Heavy users, predictable workloads
Unlimited queries within your slot capacity

How to Estimate Query Cost 🧮

Before running a query, BigQuery shows you how much data it will scan:

┌────────────────────────────────────────────────┐
│  Query Editor                                  │
│  ─────────────────────────────────────────────│
│  SELECT * FROM my_table WHERE date = '2024-01'│
│                                                │
│  [This query will process 2.5 GB when run]    │ ← Check this!
└────────────────────────────────────────────────┘

Cost calculation:

2.5 GB = 0.0025 TB
0.0025 TB × $5 = $0.0125 (about 1 cent)

But if you run that query 100 times a day... costs add up!

Cost Optimization Tips 💡

NEVER use SELECT * unless you absolutely need every column

   -- ❌ Bad - reads ALL columns
   SELECT * FROM taxi_data;

   -- ✅ Good - reads only what you need
   SELECT pickup_time, dropoff_time, fare_amount FROM taxi_data;

Use partitioned tables (covered in Part 3)
Preview before running - Always check the estimated bytes
Use LIMIT wisely - It doesn't reduce data scanned! The filtering happens AFTER reading.

   -- ❌ Still scans the whole table!
   SELECT * FROM huge_table LIMIT 10;

   -- ✅ Better - add a WHERE clause first
   SELECT * FROM huge_table WHERE date = CURRENT_DATE() LIMIT 10;

Cache results - BigQuery caches query results for 24 hours (free!)

BigQuery Caching 🗄️

When you run the same query twice:

First run: Scans data, costs money
Second run: Returns cached result, FREE!

Cache is invalidated when:

Underlying table data changes
24 hours pass
You disable caching in query settings

DataEngineeringZoomcamp #BigQuery #DataWarehouse #GCP #SQL #CloudComputing

DEV Community