Aniket Abhishek Soni

Posted on Jun 14

Querying Petabytes of Iceberg Tables via BigLake without Breaking Production

#gcp #bigquery #iceberg #dataengineering

Why I chose this topic: I spent three weeks debugging a BigLake performance regression caused by misconfigured partition evolution and I’m tired of seeing "just point it at your bucket" advice. You need to understand the underlying manifest file overhead before you put your production dashboards on top of an Iceberg lakehouse.

If you’ve been working in financial services or healthcare, you know the drill: the data warehouse is a black box, it’s expensive, and moving data out of it is a compliance nightmare. BigLake, combined with Apache Iceberg, finally promises to decouple the storage from the compute without forcing you into the proprietary BigQuery storage format.

But "open standards" doesn't mean "free lunch." If you treat BigLake like a simple external table, your latency will spike, your costs will balloon, and you’ll be dealing with "file not found" errors because your metadata snapshots are out of sync. This guide is the result of shipping these tables into production environments where a 500ms query delay triggers an SRE alert.

1. Stop relying on Hive-style partitioning

The industry is moving toward Iceberg for a reason: hidden partitioning. Stop creating directory structures like /year=2023/month=10/day=05/. It’s legacy overhead that makes your schema evolution brittle.

BigLake handles Iceberg’s hidden partitioning natively. Let the table metadata manage the transformation. If your ingestion pipeline is still doing dt=YYYY-MM-DD folder partitioning, you are paying for the compute power to scan the file path strings unnecessarily.

-- Use Iceberg's metadata to handle partitioning, not the file system
CREATE OR REPLACE EXTERNAL TABLE `project.dataset.table`
WITH CONNECTION `project.region.connection`
OPTIONS (
  format = 'ICEBERG',
  uris = ['gs://your-bucket/path/to/metadata/']
);

2. The metadata cache is not optional

By default, BigLake might try to reach out to GCS to list files if the metadata is stale. In a production environment with millions of small files, this will kill your latency. You need to enable the metadata cache in your connection configuration.

If you don't set a max_staleness, BigQuery will perform a metadata refresh on every query. For high-frequency dashboards, set this to at least 30 minutes. Your users won't notice the drift, but your wallet will notice the lack of excessive API calls.

# Set up the BigLake connection with caching
gcloud bigquery connections create your-connection \
  --location=US \
  --project=your-project \
  --cloud-resource \
  --metadata-cache-mode=AUTOMATIC \
  --metadata-cache-max-staleness=INTERVAL '30' MINUTE

3. Beware the "Small File" death spiral

Iceberg allows you to append data constantly. That’s great for streaming. But if you have a Spark job running every 5 minutes writing 10MB of data, you are creating a manifest file explosion. BigQuery has to read all those manifest files to determine which data files to scan.

If your query times are high but your data volume is low, run rewrite_data_files using the Spark Iceberg actions. You need to aim for 128MB to 512MB file sizes. If you don't compact, BigQuery’s query planner will spend more time reading manifest metadata than actual data rows.

# Use Spark to compact files if BigQuery performance dips
df = spark.table("your_table")
spark.actions.rewrite_data_files(table="your_table") \
    .option("target-file-size-bytes", 536870912) \
    .execute()

4. Column projection is your best friend

When you define a BigLake table, don't use SELECT *. I know it’s convenient, but in an Iceberg/Parquet lakehouse, SELECT * is a footgun. Parquet is columnar, but if you select every column, you are forcing the engine to deserialize everything.

Because BigLake tables on Iceberg support column-level metadata, BigQuery can prune chunks effectively. If you only need three columns, select only three. I’ve seen 40% performance gains on 10TB datasets simply by restricting the schema in the view layer.

5. IAM is not enough: Use Connection Objects

Don't use the standard service account that runs the BigQuery engine for your external access. You’ll be tempted to give your main service account storage.objectViewer on the bucket. Don't.

Use a dedicated Service Account for the BigLake connection object. It allows you to audit exactly what the connection has access to in GCS. If you ever have a security audit (and in healthcare, you will), being able to point to a specific, restricted Service Account attached to a Connection Object is the difference between a "pass" and a "pending" status.

6. Monitor the `INFORMATION_SCHEMA.JOBS`

When a BigLake query hangs, don't just look at the dashboard. Go to INFORMATION_SCHEMA.JOBS. Check the total_bytes_processed and total_bytes_billed columns.

If you see a query processing way more data than it should, you are failing to prune partitions. This usually happens because you are applying filters on columns that aren't partitioned. Always verify that your WHERE clause maps to an Iceberg-partitioned column. If it doesn't, you are full-scanning the bucket.

SELECT
  job_id,
  total_bytes_processed,
  query
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
ORDER BY total_bytes_processed DESC;

7. Handle Schema Evolution like a pro

One of the biggest perks of Iceberg is schema evolution. You can add columns without rewriting the whole dataset. However, BigLake can sometimes get confused if you change column types (e.g., INT to BIGINT).

If you are using an automated schema crawler, be prepared for the table to lock up when a schema change propagates. Always test schema changes in a staging dataset first. If you push a schema change to production and the BigLake metadata isn't refreshed, your queries will return nulls or fail entirely.

Conclusion

BigLake and Iceberg are the best way to escape the BigQuery vendor lock-in cycle, but they require you to act like a storage engineer, not just a SQL user. You have to manage your file sizes, respect your metadata cache, and be disciplined about your IAM boundaries. If you ignore these, you’re just building a slower, more expensive version of the data warehouse you were trying to leave.

The technology is mature enough for enterprise, but are your ingestion patterns mature enough to support it?

DEV Community

Querying Petabytes of Iceberg Tables via BigLake without Breaking Production

1. Stop relying on Hive-style partitioning

2. The metadata cache is not optional

3. Beware the "Small File" death spiral

4. Column projection is your best friend

5. IAM is not enough: Use Connection Objects

6. Monitor the `INFORMATION_SCHEMA.JOBS`

7. Handle Schema Evolution like a pro

Conclusion

Top comments (0)

1. Stop relying on Hive-style partitioning

2. The metadata cache is not optional

3. Beware the "Small File" death spiral

4. Column projection is your best friend

5. IAM is not enough: Use Connection Objects

6. Monitor the INFORMATION_SCHEMA.JOBS

7. Handle Schema Evolution like a pro

Conclusion

6. Monitor the `INFORMATION_SCHEMA.JOBS`