Study Notes 3.2.2: BigQuery Internal Architecture

#dataengineering #dezoomcamp #bigquery #architecture

Core Components

1. Storage (Colossus)

Uses columnar storage format
Separated from compute resources
Highly cost-effective for data storage
Cost optimization: Only pay for storage when data is at rest

2. Network Infrastructure (Jupiter)

High-speed internal network within BigQuery data centers
Bandwidth: ~1 terabyte per second
Enables efficient communication between separated compute and storage
Critical for maintaining low query latency

3. Query Engine (Dremel)

Handles query execution and processing
Uses tree-based architecture for query distribution
Breaks down complex queries into smaller subqueries
Components:
- Root server: Initial query reception and planning
- Mixers: Query subdivision and result aggregation
- Leaf nodes: Direct data access and basic operations

Storage Architecture

Column-Oriented vs Record-Oriented Storage

Record-Oriented (Traditional)
- Similar to CSV structure
- Data stored row by row
- Better for full record retrieval
Column-Oriented (BigQuery's Approach)
- Data stored by columns
- Advantages:
  - Improved column-based aggregations
  - Efficient for queries accessing subset of columns
  - Better compression and performance

Query Processing Workflow

Query Submission
- Root server receives query
- Initial query analysis and planning
Query Distribution
- Root server breaks down query into sub-modules
- Mixers further divide into smaller operations
- Leaf nodes receive specific tasks
Data Processing
- Leaf nodes communicate with Colossus
- Execute assigned operations
- Return partial results to mixers
Result Aggregation
- Mixers combine results from leaf nodes
- Root server performs final aggregation
- Returns complete result set

Key Benefits

Performance
- Distributed query processing
- High-speed network infrastructure
- Efficient columnar storage
Cost Efficiency
- Separated storage and compute
- Pay primarily for query processing
- Economical data storage
Scalability
- Distributed architecture
- Efficient handling of large datasets
- Automatic resource management

Best Practices Note

While understanding internals isn't mandatory for basic usage, it can be valuable for:

Building optimized data products
Making informed architectural decisions
Understanding performance characteristics
Implementing cost-effective solutions

This architecture enables BigQuery to handle massive datasets efficiently while maintaining quick query response times through its distributed processing approach.

DEV Community