DEV Community

Abdelrahman Adnan
Abdelrahman Adnan

Posted on

Part 2: BigQuery Deep Dive 🔍

What is BigQuery?

BigQuery is Google's data warehouse in the cloud. It's one of the most popular choices for storing and analyzing large amounts of data because it's:

  1. Serverless - You don't manage any servers. No installing software, no worrying about disk space, no maintenance. Google handles everything.

  2. Fully managed - Google takes care of security, backups, scaling, and updates.

  3. Petabyte-scale - Can handle absolutely massive datasets (1 petabyte = 1,000 terabytes = 1,000,000 gigabytes!)

  4. SQL-based - You just write SQL queries. No need to learn a new programming language!

Why BigQuery is Great for Beginners 🌟

  • ☁️ No setup headaches - Create a project, load data, start querying. That's it!
  • 💰 Free tier - 1TB of queries and 10GB storage free per month
  • 📊 Familiar SQL - If you know basic SQL, you can use BigQuery
  • 🔗 Works with everything - Google Sheets, Data Studio, Python, R, etc.
  • 🤖 Built-in ML - Train machine learning models using just SQL!

How BigQuery Works Under the Hood 🔧

Understanding the architecture helps you write better queries and save money. Don't worry, I'll keep it simple!

The Secret: Separation of Storage and Compute

Traditional databases store data and process queries on the same machine. BigQuery does something clever - it separates them:

┌─────────────────────────────────────────────────────────┐
│                    YOUR SQL QUERY                        │
└─────────────────────────┬───────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                DREMEL (Compute Engine)                   │
│                                                         │
│   Your query gets broken into tiny pieces and           │
│   thousands of workers process them in parallel         │
└─────────────────────────┬───────────────────────────────┘
                          │
                          │  Jupiter Network (super fast!)
                          │  1 Terabyte per second
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                 COLOSSUS (Storage)                       │
│                                                         │
│   Your data lives here in COLUMNAR format               │
│   (organized by columns, not rows)                      │
└─────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

What Does "Columnar Storage" Mean? 📋

This is SUPER important for understanding BigQuery performance!

Traditional databases (row-oriented):
Stores data like this:

Row 1: [John, 25, New York, $50000]
Row 2: [Jane, 30, Chicago, $60000]
Row 3: [Bob, 35, Miami, $55000]
Enter fullscreen mode Exit fullscreen mode

To find all salaries, it reads EVERY row, even though you only need one column.

BigQuery (column-oriented):
Stores data like this:

Names column:    [John, Jane, Bob]
Ages column:     [25, 30, 35]
Cities column:   [New York, Chicago, Miami]
Salaries column: [$50000, $60000, $55000]
Enter fullscreen mode Exit fullscreen mode

To find all salaries, it ONLY reads the salary column! Much faster and cheaper!

💡 This is why SELECT * is expensive in BigQuery - it has to read EVERY column. Always specify only the columns you need!

The Dremel Execution Engine 🚀

When you run a query, here's what happens:

  1. Root Server receives your query
  2. Query is broken into smaller pieces
  3. Mixers distribute work to thousands of Leaf Nodes
  4. Each Leaf Node processes a small chunk of data in parallel
  5. Results flow back up through Mixers to Root
  6. You get your final result!
                    ┌──────────┐
                    │   ROOT   │  ← Your query comes here
                    └────┬─────┘
                         │
           ┌─────────────┼─────────────┐
           ▼             ▼             ▼
      ┌────────┐    ┌────────┐    ┌────────┐
      │ MIXER  │    │ MIXER  │    │ MIXER  │
      └───┬────┘    └───┬────┘    └───┬────┘
          │             │             │
    ┌─────┼─────┐ ┌─────┼─────┐ ┌─────┼─────┐
    ▼     ▼     ▼ ▼     ▼     ▼ ▼     ▼     ▼
   [L]   [L]   [L][L]   [L]   [L][L]   [L]   [L]

   L = Leaf nodes (thousands of them!)
Enter fullscreen mode Exit fullscreen mode

Why this matters: A query that would take hours on your laptop can run in seconds because thousands of machines work on it simultaneously!

External Tables vs Native Tables 📦

You have two ways to work with data in BigQuery:

Option 1: External Tables (Data stays in GCS)

Your data remains in Google Cloud Storage, BigQuery just reads it when you query.

-- Create external table pointing to files in GCS bucket
CREATE OR REPLACE EXTERNAL TABLE `my-project.my_dataset.taxi_external`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://my-bucket/taxi_data/*.parquet']
);
Enter fullscreen mode Exit fullscreen mode

When to use External Tables:

  • ✅ You want to save on storage costs (GCS is cheaper than BigQuery storage)
  • ✅ One-time or occasional analysis
  • ✅ Data is updated frequently in source system
  • ✅ Quick exploration before committing to load

Downsides:

  • ❌ Slower queries (data needs to be read from GCS each time)
  • ❌ No cost estimation before running queries
  • ❌ Can't partition or cluster (limited optimization)

Option 2: Native Tables (Data loaded into BigQuery)

Data is copied into BigQuery's own storage (Colossus).

-- Create native table from external table
CREATE OR REPLACE TABLE `my-project.my_dataset.taxi_native` AS
SELECT * FROM `my-project.my_dataset.taxi_external`;
Enter fullscreen mode Exit fullscreen mode

When to use Native Tables:

  • ✅ Frequently queried data
  • ✅ Need best query performance
  • ✅ Want to use partitioning and clustering
  • ✅ Need accurate cost estimates before running queries

Downsides:

  • ❌ Higher storage costs
  • ❌ Data duplication (exists in both GCS and BigQuery)

💡 Pro tip: Start with external tables for exploration, then load into native tables once you know what data you actually need!

Understanding BigQuery Costs 💰

BigQuery has two main pricing models:

On-Demand Pricing (Pay per query)

  • $5 per TB of data scanned
  • Good for: Occasional users, unpredictable workloads
  • You pay for how much data your queries read

Flat-Rate Pricing (Monthly commitment)

  • ~$2,000/month for 100 "slots" (compute units)
  • Good for: Heavy users, predictable workloads
  • Unlimited queries within your slot capacity

How to Estimate Query Cost 🧮

Before running a query, BigQuery shows you how much data it will scan:

┌────────────────────────────────────────────────┐
│  Query Editor                                  │
│  ─────────────────────────────────────────────│
│  SELECT * FROM my_table WHERE date = '2024-01'│
│                                                │
│  [This query will process 2.5 GB when run]    │ ← Check this!
└────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Cost calculation:

  • 2.5 GB = 0.0025 TB
  • 0.0025 TB × $5 = $0.0125 (about 1 cent)

But if you run that query 100 times a day... costs add up!

Cost Optimization Tips 💡

  1. NEVER use SELECT * unless you absolutely need every column
   -- ❌ Bad - reads ALL columns
   SELECT * FROM taxi_data;

   -- ✅ Good - reads only what you need
   SELECT pickup_time, dropoff_time, fare_amount FROM taxi_data;
Enter fullscreen mode Exit fullscreen mode
  1. Use partitioned tables (covered in Part 3)

  2. Preview before running - Always check the estimated bytes

  3. Use LIMIT wisely - It doesn't reduce data scanned! The filtering happens AFTER reading.

   -- ❌ Still scans the whole table!
   SELECT * FROM huge_table LIMIT 10;

   -- ✅ Better - add a WHERE clause first
   SELECT * FROM huge_table WHERE date = CURRENT_DATE() LIMIT 10;
Enter fullscreen mode Exit fullscreen mode
  1. Cache results - BigQuery caches query results for 24 hours (free!)

BigQuery Caching 🗄️

When you run the same query twice:

  • First run: Scans data, costs money
  • Second run: Returns cached result, FREE!

Cache is invalidated when:

  • Underlying table data changes
  • 24 hours pass
  • You disable caching in query settings

DataEngineeringZoomcamp #BigQuery #DataWarehouse #GCP #SQL #CloudComputing

Top comments (0)