Pizofreude

Posted on Feb 11

Study Note 3.2.1: BigQuery Best Practices

#dataengineering #dezoomcamp #bigquery #beginners

Cost Reduction Strategies

Column Selection
- Avoid using SELECT *
- Always specify required column names explicitly
- Rationale: BigQuery uses column storage, so selecting specific columns minimizes data read
Query Planning
- Always check query pricing before execution
- Price is visible in the top right corner
- Use clustering and partitioning effectively
Data Operations
- Use streaming inserts cautiously as they can significantly increase costs
- Materialize queries into stages, especially when using CTEs in multiple locations
- Leverage BigQuery's query result caching

Query Performance Optimization

Data Structure and Filtering
- Filter on partitioned or clustered columns
- Denormalize data when dealing with complex structures
- Utilize nested or repeated columns for denormalization
Data Source Management
- Use external data sources judiciously
- Reading from Google Cloud Storage may incur additional costs
- Reduce data volume before performing joins
Query Pattern Optimization
- Avoid treating WITH clause as a prepared statement
- Prevent over-sharding of tables
- Avoid JavaScript and user-defined functions
- Use approximation aggregation functions (e.g., HyperLogLog) instead of complete ones
Join Optimization
- Optimize join patterns by following table size order:
  1. Place largest table first
  2. Follow with tables having fewest rows
  3. Arrange remaining tables in decreasing size order
- Rationale: Largest table gets distributed evenly, while smaller tables are broadcasted to all nodes
Query Structure
- Place ORDER BY statements at the end of queries for maximum performance

Comprehensive BigQuery Best Practices Guide

1. Cost Reduction Strategies (Detailed)

A. Column Selection Best Practices

-- Bad Practice
SELECT * FROM transactions;

-- Good Practice
SELECT transaction_id, user_id, amount, timestamp
FROM transactions;

Detailed Explanation:

BigQuery charges based on the amount of data scanned
In columnar storage, each column is stored separately
When using SELECT *, BigQuery must:
- Scan all columns in the table
- Process and transfer all data
- Use more compute resources
Cost Impact Example:
- Table with 100 columns, only need 5
- SELECT * scans 100% of data
- Specific columns scan only 5% of data
- Potential 95% cost reduction

B. Query Planning and Optimization

Price Checking:

Use the INFORMATION_SCHEMA.JOBS_BY_PROJECT view to analyze query costs
Enable cost estimates in the UI
Set custom cost controls:

SET custom_quota = '1000000'; -- Set 1TB processing limit

Partitioning Strategies:

Time-based Partitioning:

CREATE TABLE dataset.table
PARTITION BY DATE(timestamp_column)
AS SELECT * FROM source_table;

Integer Range Partitioning:

CREATE TABLE dataset.table
PARTITION BY RANGE_BUCKET(user_id, GENERATE_ARRAY(0, 100, 10))
AS SELECT * FROM source_table;

C. Data Operations Optimization

Streaming Inserts:

Cost: $0.01 per 200MB
Better alternatives:
1. Batch loading
2. Load jobs
3. Scheduled uploads

Query Materialization:

-- Instead of repeated CTE usage
WITH base_cte AS (
  SELECT complex_calculation FROM huge_table
)
-- Materialize into temporary table
CREATE TEMP TABLE materialized_results AS
SELECT complex_calculation FROM huge_table;

-- Use materialized results
SELECT * FROM materialized_results;

2. Query Performance Optimization (Detailed)

A. Data Structure and Filtering

Partitioning and Clustering Example:

-- Create partitioned and clustered table
CREATE TABLE sales_data
PARTITION BY DATE(transaction_date)
CLUSTER BY customer_id, product_id
AS SELECT * FROM source_sales;

-- Efficient query using partition filter
SELECT *
FROM sales_data
WHERE DATE(transaction_date) = '2024-02-10'
  AND customer_id = 12345;

Nested and Repeated Columns:

-- Using nested structures
CREATE TABLE users
(
  user_id INT64,
  name STRING,
  addresses ARRAY<STRUCT
    street STRING,
    city STRING,
    country STRING
  >>
);

B. Join Optimization Techniques

Optimal Join Order Example:

-- Good Practice (Largest to smallest)
SELECT *
FROM large_table a -- 1M rows
JOIN medium_table b -- 100K rows
  ON a.id = b.id
JOIN small_table c -- 10K rows
  ON b.id = c.id;

-- Poor Practice (Small to large)
SELECT *
FROM small_table c -- 10K rows
JOIN medium_table b -- 100K rows
  ON c.id = b.id
JOIN large_table a -- 1M rows
  ON b.id = a.id;

C. Advanced Performance Optimization

Using Approximation Functions:

-- Instead of COUNT(DISTINCT)
SELECT APPROX_COUNT_DISTINCT(user_id)
FROM events;

-- Instead of exact percentiles
SELECT APPROX_QUANTILES(value, 100)[OFFSET(90)]
FROM measurements;

Query Pattern Optimization:

-- Bad Practice (Multiple subqueries)
SELECT *
FROM (
  SELECT *
  FROM (
    SELECT * FROM large_table
  )
);

-- Good Practice (Flattened query)
SELECT *
FROM large_table;

D. Monitoring and Maintenance

Regular Performance Monitoring:

SELECT
  creation_time,
  user_email,
  total_bytes_processed,
  total_slot_ms
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
ORDER BY total_bytes_processed DESC;

Table Statistics Monitoring:

SELECT
  table_name,
  row_count,
  size_bytes,
  last_modified_time
FROM `project.dataset.__TABLES__`
ORDER BY size_bytes DESC;

3. Additional Best Practices

Resource Management:
- Set appropriate job priorities
- Use reservation slots for predictable workloads
- Implement cost controls and quotas
Data Quality:
- Implement column-level constraints
- Use standardized data types
- Regular data validation checks
Security:
- Implement column-level security
- Use authorized views
- Regular audit of access patterns

DEV Community