When working with Google BigQuery, one limitation often catches many engineers off guard:
A table function can return only a single table with a predefined schema.
However, in real-world data pipelines, the output is rarely just one dataset. A typical pipeline may generate several types of results, such as:
- Cleaned or transformed data
- Validation errors
- Pipeline metrics
- Execution logs
This raises an important question:
What should you do if your function logically produces multiple outputs?
Many engineers assume the only option is to use multiple queries or stored procedures to handle each output separately.
But there’s actually a clever workaround that allows you to structure multiple outputs within a single function.
Let’s take a closer look at how this can be achieved. 🚀
The Problem
Suppose you are building a data pipeline that processes raw data and produces:
- Final transformed data
- Validation errors
- Pipeline execution metrics
Ideally, a function would return multiple result sets.
However, BigQuery table functions can only return one result table.
So how do we design around this limitation?
The Multi-Output Function Pattern
The trick is to return a single STRUCT that contains multiple arrays.
Each array represents a logical dataset (or “table”).
CREATE OR REPLACE TABLE FUNCTION dataset.pipeline_function()
AS (
SELECT
STRUCT(
-- Main data output
ARRAY(
SELECT STRUCT(id, name)
FROM dataset.source_table
) AS data_rows,
-- Validation errors
ARRAY(
SELECT STRUCT(id, error_message)
FROM dataset.validation_errors
) AS validation_errors,
-- Pipeline metrics
STRUCT(
CURRENT_TIMESTAMP() AS run_time,
(SELECT COUNT(*) FROM dataset.source_table) AS total_rows
) AS metrics
) AS result
);
Now the function returns one column (result), but inside it contains multiple datasets.
Running the Function
SELECT *
FROM dataset.pipeline_function();
Output structure:
BigQuery returns one result, we now have multiple logical outputs.
result
├── data_rows ARRAY<STRUCT<id INT64, name STRING>>
├── validation_errors ARRAY<STRUCT<id INT64, error_message STRING>>
└── metrics STRUCT<run_time TIMESTAMP, total_rows INT64>
Splitting the Outputs
Now, we extract each dataset using UNNEST.
Insert main data
INSERT INTO dataset.final_table
SELECT
d.id,
d.name
FROM dataset.pipeline_function(),
UNNEST(result.data_rows) AS d;
Insert validation errors
INSERT INTO dataset.error_table
SELECT
e.id,
e.error_message
FROM dataset.pipeline_function(),
UNNEST(result.validation_errors) AS e;
Store pipeline metrics
INSERT INTO dataset.pipeline_metrics
SELECT
result.metrics.run_time,
result.metrics.total_rows
FROM dataset.pipeline_function();
Important Performance Tip
If your function contains expensive logic, avoid calling it multiple times. Instead, store the result once:
CREATE TEMP TABLE temp_result AS
SELECT *
FROM dataset.pipeline_function();
Then extract outputs from temp_result.
This prevents the function from executing repeatedly.
Why This Pattern Is Useful
This design enables self-monitoring pipelines.
Your function can return:
- Data
- Validation errors
- Metrics
- Logs
- Metadata
Then the pipeline splits them into separate tables.
JSON vs STRUCT (Important)
Some engineers try to return JSON instead.
However, for large datasets:
STRUCT + ARRAY is the better choice. Why?
- BigQuery natively optimizes nested STRUCT types
- JSON requires parsing
- STRUCT improves query performance
Final Thoughts
BigQuery table functions may seem limited at first.
But with the STRUCT + ARRAY pattern, you can design functions that behave like they return multiple tables.
This small trick unlocks powerful pipeline patterns.
Sometimes the best engineering solutions come from working around platform constraints creatively.
💬 Question for data engineers:
How do you handle multi-output logic in BigQuery pipelines?
Top comments (0)