Hiroshi Toyama

Posted on Mar 22

BigQuery Global Queries: Join Data Across Regions Without ETL

#bigquery #gcp #dataengineering #sql

As of February 2026, Google released BigQuery Global Queries in Preview. It lets you join tables from completely different geographic regions — say, asia-northeast1 (Tokyo) and us-central1 (Iowa) — in a single SQL statement. No ETL, no data movement pipelines, no manual copying.

This post covers how it actually works under the hood, what it costs, and the gotchas you need to know before using it in production.

The Old Problem

BigQuery historically required all datasets referenced in a single query to live in the same location. If your sales data was in Tokyo and your user master was in the US, you had two options:

Copy one dataset to the other region (ETL pipeline, operational overhead).
Run two separate queries and join the results in application code.

Global Queries eliminates this constraint.

How It Works: 4-Stage Execution

When you run a global query, BigQuery orchestrates the execution across regions transparently:

1. Distributed Execution

The Query Optimizer analyzes the query, identifies which tables live in which regions, and assigns the querying region as the Primary Region (the "leader"). Workers in each remote region receive their execution assignments in parallel.

2. Data Pushdown

This is the most critical stage — and the one that makes global queries economically viable.

Before any data crosses the network, BigQuery applies three types of pushdown to minimize transfer size:

Predicate Pushdown: WHERE clause filters run in the remote region, before the data moves. A 100M-row table filtered to 100 rows transfers 100 rows — not 100M.
Projection Pushdown: Only the columns named in SELECT are read from remote storage. BigQuery's columnar storage (Capacitor) makes this efficient.
Aggregation Pushdown: GROUP BY/SUM/COUNT operations run as partial aggregations in the remote region. A billion-row transaction table can be summarized to 365 rows (daily totals) before transfer.

3. Data Transfer

Filtered, minimized results travel over Google's internal network to the Primary Region, where they're stored in temporary internal tables for up to 8 hours. This is where cross-region egress charges are incurred.

4. Final Join

The Primary Region merges local data with the temporary remote data, as if everything were in one place. The query result returned to the user looks like any normal BigQuery result.

-- Executed from asia-northeast1 (Tokyo)
SELECT
  t1.product_id,
  t1.sales + t2.sales AS total_global_sales
FROM `project.japan_dataset.sales` AS t1   -- local
JOIN `project.us_dataset.sales` AS t2      -- remote (auto-transferred)
ON t1.product_id = t2.product_id
WHERE t1.date = '2026-03-01'               -- pushed down to both regions

IAM Permissions

Global Queries require two layers of setup.

Project-level opt-in (admin task)

-- Enable execution from the primary region
ALTER PROJECT `your-project-id`
SET OPTIONS (
  `region-asia-northeast1.enable_global_queries_execution` = true
);

-- Enable data access from the remote region
ALTER PROJECT `your-project-id`
SET OPTIONS (
  `region-us-central1.enable_global_queries_data_access` = true
);

User-level permissions

Role	Description
`bigquery.jobs.createGlobalQuery`	Required to initiate a global query. Currently only included in `roles/bigquery.admin` — create a custom role for regular users.
`roles/bigquery.dataViewer`	Required on every dataset being referenced, in every region.

Cost Structure

Global queries have three billing components instead of the usual one:

Component	Details	Approximate Price (2026)
Compute	Bytes scanned across all regions	$6.25 / 1 TB (on-demand)
Egress	Data transferred from remote to primary region	~$0.08–$0.12 / 1 GB (intercontinental)
Temporary Storage	Intermediate data stored for up to 8 hours	~$0.02/GB-month (prorated)

Cost simulation

Scenario: Query from Tokyo, scanning a 1 TB table in us-central1, with a WHERE clause that reduces the data transferred to 1 GB.

Compute: 1 TB × $6.25 = $6.25
Egress: 1 GB × $0.12 = $0.12
Total: ~$6.37

If you skip the WHERE clause and transfer the full 1 TB: egress alone exceeds $100. Pushdown is not optional — it's the entire cost model.

Dry run before executing

Use the BigQuery Console (it shows estimated bytes scanned before you click Run) or the CLI:

bq query --dry_run --use_legacy_sql=false 'SELECT ...'

Note: As of the current preview, dry runs may not accurately estimate egress (only compute bytes). Budget conservatively.

Key Considerations

Latency

Cross-region queries are always slower than single-region queries. Physical distance adds hundreds of milliseconds of network latency, plus multi-region orchestration overhead. Expect a minimum of 5–10 seconds even for modest cross-region joins. Real-time dashboards are not a good fit.

Data Residency

The Primary Region is where remote data lands temporarily. If GDPR or local privacy laws prohibit data from Region A leaving Region A, you must run the query from Region A as the primary — not from a region outside it. VPC Service Controls perimeters are also respected.

Current Limitations (Preview, March 2026)

No Query Cache

Global queries never use the query cache. Since data can change in any remote region at any time, BigQuery always reads fresh data. Every execution incurs full compute and egress costs.

Workaround: For frequently-used cross-region joins, materialize results into a local table using CREATE TABLE AS SELECT and query that instead.

No INFORMATION_SCHEMA from Remote Regions

You cannot query INFORMATION_SCHEMA views from a remote region within a global query. Joining metadata across regions requires first exporting that metadata into regular tables.

Unsupported Table Types

BigLake Apache Iceberg tables in remote regions are not supported as remote sources.
Partition pseudo-columns (_PARTITIONTIME, _PARTITIONDATE) may not pushdown correctly (more on this below).

No Sandbox Support

Billing Account required. The Sandbox (free tier) does not support Global Queries because egress charges can exceed the free quota.

The Partition Pseudo-Column Trap

This is the most dangerous limitation in production, and deserves its own section.

Background: Pseudo-columns vs. Physical columns

BigQuery offers two partitioning strategies:

Type	Partition Key	Access
Ingestion-time partitioned	Arrival timestamp, managed by BigQuery	Via `_PARTITIONTIME` / `_PARTITIONDATE` (pseudo-columns)
Column-based partitioned	An actual column in your table schema (e.g., `event_date`)	Via the column name directly

Pseudo-columns are not part of the formal table schema. They're metadata-level constructs.

Why pushdown fails for pseudo-columns

When the Query Optimizer sends execution instructions to a remote region, it works from the table's schema definition. Pseudo-columns aren't in that definition, so the optimizer can't reliably communicate partition pruning constraints to the remote worker.

Worst case: A filter like WHERE _PARTITIONDATE = '2026-03-01' is silently ignored in the remote region. The remote worker scans the entire table across all partitions and begins transferring everything to the primary region. Your query either times out or generates a very large bill.

The fix: Migrate to column-based partitioning

-- Create a new table with an explicit physical partition column
CREATE TABLE `project.dataset.new_table`
PARTITION BY event_date
AS
SELECT
  *,
  CAST(_PARTITIONDATE AS DATE) AS event_date  -- materialize the pseudo-column
FROM `project.dataset.old_table`

With a physical column, the optimizer sees it in the schema, understands the partition structure, and confidently applies pushdown in the remote region.

Workaround B: Aliasing via Views (use with caution)

If migrating the table isn't possible, you can create a view in the remote region that aliases the pseudo-column:

-- View in us-central1
CREATE VIEW `project.us_dataset.v_sales` AS
SELECT
  *,
  _PARTITIONDATE AS partition_date_col
FROM `project.us_dataset.ingestion_time_partitioned_table`

Then query the view from the primary region:

SELECT * FROM `project.us_dataset.v_sales`
WHERE partition_date_col = '2026-03-01'

This sometimes works for simple queries, but pushdown is not guaranteed. In complex queries with JOINs or aggregations, the optimizer often loses the connection between the aliased column and the underlying partition structure, falls back to full-scan, and transfers everything.

Always verify that pushdown is working by checking the Query Execution Plan and confirming the remote READ stage shows filtered row counts — not the full table row count.

Operational Best Practices

Problem	Recommendation
No query cache	Materialize frequent cross-region joins into local intermediate tables
Need metadata across regions	Export metadata to regular tables on a schedule
Ingestion-time partitioned tables	Migrate to column-based partitioning before using as remote sources
Unclear cost pre-execution	Use dry run + estimate egress separately; add a buffer

Summary

BigQuery Global Queries is a genuinely useful feature that eliminates an entire category of ETL pipelines. The execution model is well-designed — pushdown at the predicate, projection, and aggregation levels means you're typically only transferring the data you actually need.

The key things to internalize:

Pushdown is the cost model. Filter early, select only the columns you need, push aggregations to the remote side.
Ingestion-time partitioned tables are a liability in global queries. Migrate to column-based partitioning.
It's Preview — no query cache, no INFORMATION_SCHEMA cross-region, no BigLake Iceberg remotes. Design your architecture around these constraints.

Check the official documentation for the latest changes as this feature moves toward GA.

DEV Community