Mirlan

Posted on Jun 11

How to Query Parquet Files in Object Storage with BigQuery External Tables

#dataengineering #bigquery #opensource #tutorial

If you're storing analytics events, logs, or any append-only data as Parquet files in object storage, there's a way to query them with BigQuery without copying them into BigQuery-managed storage. It's called external tables, and it's one of the most underrated features in modern data engineering.

This tutorial walks through how to set up BigQuery external tables on top of Parquet files in GCS, plus Amazon S3 via BigQuery Omni/BigLake. For other S3-compatible storage, the same Parquet layout still works with engines like DuckDB, Trino, Spark, ClickHouse, or Polars.

By the end, you'll have:

Parquet files in object storage that BigQuery can query directly
Partitioning that BigQuery understands (huge cost difference)
A workflow that lets you switch query engines later without moving data

Let's go.

Why external tables matter

The typical pattern with BigQuery is: ingest data, BigQuery stores it in its native columnar format, you pay for storage and queries separately.

External tables flip this. The data stays in your object storage. BigQuery only reads it when you query. You pay BigQuery only for compute (the query), not for storage.

Three reasons this matters:

Avoiding duplicate storage: GCS standard storage is ~$0.02/GB/month. BigQuery native storage is ~$0.02/GB/month for active, but only $0.01/GB/month for long-term. For active data the difference is small, but you still get the next two benefits.
Vendor portability: the Parquet files in your bucket are queryable by BigQuery, DuckDB, Trino, Spark, ClickHouse, Polars — anything that reads Parquet. If you ever need to switch off BigQuery, you point a new engine at the same files. No migration.
Reprocessing: if a transformation has a bug, you reprocess the raw Parquet into new derived tables. The source data is untouched and re-readable. You can't do this with most hosted analytics platforms — they aggregate on ingest.

The tradeoff: external tables are slightly slower than native BigQuery tables for the same query (data has to be read from GCS, not from BigQuery's internal storage). For many ad-hoc analytics workloads, this is acceptable. For sub-second query SLAs, use native tables.

The setup we'll build

Your events  →  Parquet files in GCS  →  BigQuery external table  →  SQL queries
                (partitioned by date)        (no data movement)

Specifically:

Storage: GCS bucket gs://your-bucket/events/year=2026/month=01/day=15/*.parquet
Format: Parquet (columnar, compressed, schema-aware)
External table: BigQuery sees events.raw_events as a queryable table
Partitioning: BigQuery uses the path structure to prune partitions

Step 1: Set up the GCS bucket and Parquet layout

First, create a bucket and decide on partitioning. Don't skip this — wrong partitioning costs 10-100x more in query bills.

# Create bucket (one-time)
gcloud storage buckets create gs://my-events-bucket \
  --location=europe-west1 \
  --uniform-bucket-level-access

Now decide partition keys. The rule: partition by what you filter on most. For event analytics, that's almost always date.

Recommended layout:

gs://my-events-bucket/
  events/
    year=2026/
      month=01/
        day=15/
          batch-001.parquet
          batch-002.parquet
        day=16/
          batch-001.parquet
      month=02/
        day=01/
          batch-001.parquet

The year=2026/month=01/day=15/ style is called Hive partitioning. BigQuery recognizes it automatically. We'll use that.

Step 2: Write Parquet files with the right structure

Here's a minimal Python example using pyarrow. Save this as writer.py — we'll import from it in the complete example later.

# writer.py
from collections import defaultdict
from datetime import datetime, timezone
from uuid import uuid4
import os
import tempfile

import pyarrow as pa
import pyarrow.parquet as pq
from google.cloud import storage


def write_events_batch(events: list[dict], bucket_name: str) -> None:
    """Write events as Parquet files to Hive-style daily partitions."""
    if not events:
        return

    # Group events by UTC date so each Parquet file lands in the right partition.
    partitions: dict[tuple[int, str, str], list[dict]] = defaultdict(list)

    for event in events:
        event_time = datetime.fromtimestamp(event["timestamp"], tz=timezone.utc)
        partition_key = (
            event_time.year,
            f"{event_time.month:02d}",
            f"{event_time.day:02d}",
        )
        partitions[partition_key].append(event)

    client = storage.Client()
    bucket = client.bucket(bucket_name)

    for (year, month, day), partition_events in partitions.items():
        table = pa.Table.from_pylist(partition_events)

        file_id = uuid4().hex
        gcs_path = f"events/year={year}/month={month}/day={day}/batch-{file_id}.parquet"

        with tempfile.NamedTemporaryFile(suffix=".parquet", delete=False) as tmp:
            local_path = tmp.name

        try:
            pq.write_table(
                table,
                local_path,
                compression="snappy",
                use_dictionary=True,
            )

            blob = bucket.blob(gcs_path)
            blob.upload_from_filename(local_path, if_generation_match=0)
        finally:
            os.remove(local_path)

A few production notes:

Compression: use snappy for speed (most analytics), gzip for storage savings (if you query rarely), or zstd (newer, often best balance — well-supported by BigQuery).
Row group size: default is fine for most cases. Tune row_group_size=100000 if you have very wide rows.
Don't write tiny files: aim for 64 MB+ per Parquet file. BigQuery (and every query engine) hates many small files. Batch your writes.

Step 3: Create the BigQuery external table

Now the magic. Create a BigQuery dataset, then create an external table that points at the GCS bucket.

-- Create the dataset
CREATE SCHEMA IF NOT EXISTS `my-project.events`
OPTIONS (location = "EU");

-- Create the external table
CREATE OR REPLACE EXTERNAL TABLE `my-project.events.raw_events`
WITH PARTITION COLUMNS (
  year INT64,
  month INT64,
  day INT64
)
OPTIONS (
  format = "PARQUET",
  uris = ["gs://my-events-bucket/events/*"],
  hive_partition_uri_prefix = "gs://my-events-bucket/events/",
  require_hive_partition_filter = true
);

Key parts:

format = "PARQUET" — BigQuery reads the schema from the Parquet files themselves
uris = ["gs://my-events-bucket/events/*"] — glob pattern matching all Parquet files
WITH PARTITION COLUMNS (...) — tells BigQuery about the Hive-style partitions
hive_partition_uri_prefix — where partitions start in the path
require_hive_partition_filter = true — critical for cost control (see below)

Step 4: Query the external table

-- Basic query
SELECT 
  event_name,
  COUNT(*) as event_count
FROM `my-project.events.raw_events`
WHERE year = EXTRACT(YEAR FROM CURRENT_DATE())
  AND month = EXTRACT(MONTH FROM CURRENT_DATE())
  AND day = EXTRACT(DAY FROM CURRENT_DATE())
GROUP BY event_name
ORDER BY event_count DESC
LIMIT 10;

That's it. BigQuery reads only the Parquet files in year=2026/month=01/day=15/, parses them on the fly, and returns results.

Step 5: The cost gotcha — partition filters

Here's the part that bites people. External tables charge by data scanned. If you forget a partition filter, BigQuery scans every file in the bucket.

That's why I set require_hive_partition_filter = true. With this option, queries without a partition filter (year, month, day) fail outright:

-- This will fail with require_hive_partition_filter = true:
SELECT COUNT(*) FROM `my-project.events.raw_events`;
-- Error: Cannot query over table without a filter that can be used for partition elimination.

This is what you want. It prevents accidentally scanning the entire lake. Always set this option on production external tables.

If you need a cross-partition aggregate, write the filter explicitly:

SELECT COUNT(*) 
FROM `my-project.events.raw_events`
WHERE year = 2026;  -- BigQuery prunes to just year=2026 partitions

Schema evolution: adding fields without breaking things

Eventually you'll want to add fields to your event schema. Parquet handles this gracefully if you write files with the new schema going forward — old files keep their old schema, new files have the new one.

Parquet stores schema in each file, and BigQuery can infer schema from self-describing formats like Parquet. In production, don’t rely on automatic schema merging across many files. Keep schemas backward-compatible, add nullable fields carefully, and consider providing an explicit schema or recreating the external table when the schema changes.

Two practical rules:

Add fields freely, but never remove or rename fields. If you need to remove, set it to NULL going forward; if you need to rename, add the new name and stop writing the old.
For breaking schema changes, write to a new path (e.g., events_v2/) and create a new external table. Migrate gradually.

Performance tip: Parquet column projection

BigQuery only reads the columns you reference in your query. With wide event tables (50+ columns), this is a huge cost saver.

-- Reads only event_name and user_id columns from Parquet, not full rows
SELECT event_name, COUNT(DISTINCT user_id) 
FROM `my-project.events.raw_events`
WHERE year = 2026 AND month = 1 AND day = 15
GROUP BY event_name;

Vs:

-- Reads ALL columns (avoid)
SELECT * FROM `my-project.events.raw_events`
WHERE year = 2026 AND month = 1 AND day = 15
LIMIT 100;

SELECT * is fine for debugging but expensive at scale. Be explicit about columns in production.

Working with S3 instead of GCS

The same pattern works with AWS S3 via BigQuery Omni or by using a different engine entirely (Trino, Athena, DuckDB) that reads from S3 directly.

For S3 specifically with BigQuery Omni:

CREATE OR REPLACE EXTERNAL TABLE `my-project.events.raw_events`
WITH CONNECTION `my-project.us.s3-connection`
WITH PARTITION COLUMNS
OPTIONS (
  format = "PARQUET",
  uris = ["s3://my-events-bucket/events/*"],
  hive_partition_uri_prefix = "s3://my-events-bucket/events/"
);

If you want vendor-portability without BigQuery Omni, use DuckDB or Trino — both query Parquet in S3 natively. The Parquet layout is identical; only the query engine changes.

A complete example: putting it together

Step 2 defined write_events_batch(). Here's how you'd put it to work, then query the result:

# main.py — using the writer from Step 2
from datetime import datetime, timezone
from writer import write_events_batch  # the function we defined earlier

# Build a batch of events (in production, these come from your collector)
events = [
    {
        "timestamp": int(datetime.now(timezone.utc).timestamp()),
        "event_name": "page_view",
        "user_id": "user_123",
        "session_id": "sess_abc",
        "properties": {"page": "/pricing"},
    },
    {
        "timestamp": int(datetime.now(timezone.utc).timestamp()),
        "event_name": "signup_complete",
        "user_id": "user_123",
        "session_id": "sess_abc",
        "properties": {"plan": "free"},
    },
    # ... more events
]

# Write the batch to GCS as a partitioned Parquet file
write_events_batch(events, bucket_name="my-events-bucket")

Then in BigQuery (after creating the external table from Step 3):

-- query.sql
SELECT event_name, COUNT(*) as cnt
FROM `my-project.events.raw_events`
WHERE year = 2026 AND month = 1 AND day = 15
GROUP BY event_name;

That's the entire pipeline. Write Parquet → query with SQL → never load into BigQuery.

In production you'd run write_events_batch() from inside your event collector (or a scheduled job that drains a buffer like NATS or Kafka). The function is intentionally small — keep your batching/buffering logic separate from the Parquet writer concern.

When this pattern makes sense (and when it doesn't)

This is great for:

Analytics events (high volume, append-only)
Application logs you want to query later
Long-term archive that needs occasional ad-hoc queries
Any "data lake" pattern where you want schema-on-read flexibility

This is NOT great for:

Sub-second query SLAs (use native BigQuery tables)
High-cardinality joins across huge tables (native is faster)
OLTP workloads (use a proper database)

Most analytics teams I've worked with run a hybrid: external tables for the raw event archive (cheap, portable), native tables for materialized aggregates (fast, common queries). dbt or SQLMesh runs incremental transforms from external to native, hourly or daily.

Wrapping up

External tables on Parquet are one of the most cost-effective and portable patterns in modern data engineering. The setup takes about an hour. The savings (in vendor lock-in and storage costs) compound for years.

If you want to see this pattern in production form, I'm building Rawbbit — an open-source self-hosted analytics pipeline that uses exactly this architecture (HTTP collector → NATS JetStream → Parquet in GCS → BigQuery external tables → SQLMesh). It's Apache 2.0 if you want to read the code or copy patterns.

Questions or feedback — drop them in the comments below. Always interested in how other teams have built their event pipelines.

Top comments (1)

Hayrullah Kar • Jun 18

Setting require_hive_partition_filter = true is a life-saver for keeping cloud bills from exploding. My only warning: schema evolution can get tricky if upstream systems alter data types mid-batch. Still, an incredibly clean architecture blueprint!