DEV Community

Cover image for Study Note 3.2.1: BigQuery Best Practices
Pizofreude
Pizofreude

Posted on

Study Note 3.2.1: BigQuery Best Practices

Cost Reduction Strategies

  1. Column Selection
    • Avoid using SELECT *
    • Always specify required column names explicitly
    • Rationale: BigQuery uses column storage, so selecting specific columns minimizes data read
  2. Query Planning
    • Always check query pricing before execution
    • Price is visible in the top right corner
    • Use clustering and partitioning effectively
  3. Data Operations
    • Use streaming inserts cautiously as they can significantly increase costs
    • Materialize queries into stages, especially when using CTEs in multiple locations
    • Leverage BigQuery's query result caching

Query Performance Optimization

  1. Data Structure and Filtering
    • Filter on partitioned or clustered columns
    • Denormalize data when dealing with complex structures
    • Utilize nested or repeated columns for denormalization
  2. Data Source Management
    • Use external data sources judiciously
    • Reading from Google Cloud Storage may incur additional costs
    • Reduce data volume before performing joins
  3. Query Pattern Optimization
    • Avoid treating WITH clause as a prepared statement
    • Prevent over-sharding of tables
    • Avoid JavaScript and user-defined functions
    • Use approximation aggregation functions (e.g., HyperLogLog) instead of complete ones
  4. Join Optimization
    • Optimize join patterns by following table size order:
      1. Place largest table first
      2. Follow with tables having fewest rows
      3. Arrange remaining tables in decreasing size order
    • Rationale: Largest table gets distributed evenly, while smaller tables are broadcasted to all nodes
  5. Query Structure
    • Place ORDER BY statements at the end of queries for maximum performance

Comprehensive BigQuery Best Practices Guide

1. Cost Reduction Strategies (Detailed)

A. Column Selection Best Practices

-- Bad Practice
SELECT * FROM transactions;

-- Good Practice
SELECT transaction_id, user_id, amount, timestamp
FROM transactions;

Enter fullscreen mode Exit fullscreen mode

Detailed Explanation:

  • BigQuery charges based on the amount of data scanned
  • In columnar storage, each column is stored separately
  • When using SELECT *, BigQuery must:
    • Scan all columns in the table
    • Process and transfer all data
    • Use more compute resources
  • Cost Impact Example:
    • Table with 100 columns, only need 5
    • SELECT * scans 100% of data
    • Specific columns scan only 5% of data
    • Potential 95% cost reduction

B. Query Planning and Optimization

Price Checking:

  • Use the INFORMATION_SCHEMA.JOBS_BY_PROJECT view to analyze query costs
  • Enable cost estimates in the UI
  • Set custom cost controls:
SET custom_quota = '1000000'; -- Set 1TB processing limit

Enter fullscreen mode Exit fullscreen mode

Partitioning Strategies:

  1. Time-based Partitioning:
CREATE TABLE dataset.table
PARTITION BY DATE(timestamp_column)
AS SELECT * FROM source_table;

Enter fullscreen mode Exit fullscreen mode
  1. Integer Range Partitioning:
CREATE TABLE dataset.table
PARTITION BY RANGE_BUCKET(user_id, GENERATE_ARRAY(0, 100, 10))
AS SELECT * FROM source_table;

Enter fullscreen mode Exit fullscreen mode

C. Data Operations Optimization

Streaming Inserts:

  • Cost: $0.01 per 200MB
  • Better alternatives:
    1. Batch loading
    2. Load jobs
    3. Scheduled uploads

Query Materialization:

-- Instead of repeated CTE usage
WITH base_cte AS (
  SELECT complex_calculation FROM huge_table
)
-- Materialize into temporary table
CREATE TEMP TABLE materialized_results AS
SELECT complex_calculation FROM huge_table;

-- Use materialized results
SELECT * FROM materialized_results;

Enter fullscreen mode Exit fullscreen mode

2. Query Performance Optimization (Detailed)

A. Data Structure and Filtering

Partitioning and Clustering Example:

-- Create partitioned and clustered table
CREATE TABLE sales_data
PARTITION BY DATE(transaction_date)
CLUSTER BY customer_id, product_id
AS SELECT * FROM source_sales;

-- Efficient query using partition filter
SELECT *
FROM sales_data
WHERE DATE(transaction_date) = '2024-02-10'
  AND customer_id = 12345;

Enter fullscreen mode Exit fullscreen mode

Nested and Repeated Columns:

-- Using nested structures
CREATE TABLE users
(
  user_id INT64,
  name STRING,
  addresses ARRAY<STRUCT
    street STRING,
    city STRING,
    country STRING
  >>
);

Enter fullscreen mode Exit fullscreen mode

B. Join Optimization Techniques

Optimal Join Order Example:

-- Good Practice (Largest to smallest)
SELECT *
FROM large_table a -- 1M rows
JOIN medium_table b -- 100K rows
  ON a.id = b.id
JOIN small_table c -- 10K rows
  ON b.id = c.id;

-- Poor Practice (Small to large)
SELECT *
FROM small_table c -- 10K rows
JOIN medium_table b -- 100K rows
  ON c.id = b.id
JOIN large_table a -- 1M rows
  ON b.id = a.id;

Enter fullscreen mode Exit fullscreen mode

C. Advanced Performance Optimization

Using Approximation Functions:

-- Instead of COUNT(DISTINCT)
SELECT APPROX_COUNT_DISTINCT(user_id)
FROM events;

-- Instead of exact percentiles
SELECT APPROX_QUANTILES(value, 100)[OFFSET(90)]
FROM measurements;

Enter fullscreen mode Exit fullscreen mode

Query Pattern Optimization:

-- Bad Practice (Multiple subqueries)
SELECT *
FROM (
  SELECT *
  FROM (
    SELECT * FROM large_table
  )
);

-- Good Practice (Flattened query)
SELECT *
FROM large_table;

Enter fullscreen mode Exit fullscreen mode

D. Monitoring and Maintenance

  1. Regular Performance Monitoring:
SELECT
  creation_time,
  user_email,
  total_bytes_processed,
  total_slot_ms
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
ORDER BY total_bytes_processed DESC;

Enter fullscreen mode Exit fullscreen mode
  1. Table Statistics Monitoring:
SELECT
  table_name,
  row_count,
  size_bytes,
  last_modified_time
FROM `project.dataset.__TABLES__`
ORDER BY size_bytes DESC;

Enter fullscreen mode Exit fullscreen mode

3. Additional Best Practices

  1. Resource Management:
    • Set appropriate job priorities
    • Use reservation slots for predictable workloads
    • Implement cost controls and quotas
  2. Data Quality:
    • Implement column-level constraints
    • Use standardized data types
    • Regular data validation checks
  3. Security:
    • Implement column-level security
    • Use authorized views
    • Regular audit of access patterns

AWS Security LIVE!

Tune in for AWS Security LIVE!

Join AWS Security LIVE! for expert insights and actionable tips to protect your organization and keep security teams prepared.

Learn More

Top comments (0)

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more