DEV Community

Cover image for Study Notes 3.2.2: BigQuery Internal Architecture
Pizofreude
Pizofreude

Posted on

Study Notes 3.2.2: BigQuery Internal Architecture

Core Components

1. Storage (Colossus)

  • Uses columnar storage format
  • Separated from compute resources
  • Highly cost-effective for data storage
  • Cost optimization: Only pay for storage when data is at rest

2. Network Infrastructure (Jupiter)

  • High-speed internal network within BigQuery data centers
  • Bandwidth: ~1 terabyte per second
  • Enables efficient communication between separated compute and storage
  • Critical for maintaining low query latency

3. Query Engine (Dremel)

  • Handles query execution and processing
  • Uses tree-based architecture for query distribution
  • Breaks down complex queries into smaller subqueries
  • Components:
    • Root server: Initial query reception and planning
    • Mixers: Query subdivision and result aggregation
    • Leaf nodes: Direct data access and basic operations

BiqQuery Internal Architecture

Storage Architecture

Column-Oriented vs Record-Oriented Storage

  1. Record-Oriented (Traditional)
    • Similar to CSV structure
    • Data stored row by row
    • Better for full record retrieval
  2. Column-Oriented (BigQuery's Approach)
    • Data stored by columns
    • Advantages:
      • Improved column-based aggregations
      • Efficient for queries accessing subset of columns
      • Better compression and performance

GCS Data Storage Architecture

Query Processing Workflow

  1. Query Submission
    • Root server receives query
    • Initial query analysis and planning
  2. Query Distribution
    • Root server breaks down query into sub-modules
    • Mixers further divide into smaller operations
    • Leaf nodes receive specific tasks
  3. Data Processing
    • Leaf nodes communicate with Colossus
    • Execute assigned operations
    • Return partial results to mixers
  4. Result Aggregation
    • Mixers combine results from leaf nodes
    • Root server performs final aggregation
    • Returns complete result set

Query processing workflow

Key Benefits

  1. Performance
    • Distributed query processing
    • High-speed network infrastructure
    • Efficient columnar storage
  2. Cost Efficiency
    • Separated storage and compute
    • Pay primarily for query processing
    • Economical data storage
  3. Scalability
    • Distributed architecture
    • Efficient handling of large datasets
    • Automatic resource management

Best Practices Note

While understanding internals isn't mandatory for basic usage, it can be valuable for:

  • Building optimized data products
  • Making informed architectural decisions
  • Understanding performance characteristics
  • Implementing cost-effective solutions

This architecture enables BigQuery to handle massive datasets efficiently while maintaining quick query response times through its distributed processing approach.

Sentry image

Hands-on debugging session: instrument, monitor, and fix

Join Lazar for a hands-on session where you’ll build it, break it, debug it, and fix it. You’ll set up Sentry, track errors, use Session Replay and Tracing, and leverage some good ol’ AI to find and fix issues fast.

RSVP here →

Top comments (0)

Billboard image

Create up to 10 Postgres Databases on Neon's free plan.

If you're starting a new project, Neon has got your databases covered. No credit cards. No trials. No getting in your way.

Try Neon for Free →