How to use BigQuery Query Caching with Dynamic Wildcard Tables

#bigdata #googlecloud #bigquery #python

The Problem: Caching does not work with wildcard tables

Let's say you have some tables named my_data_2023_*, where the asterisk represents various months. You want to analyze data across all these tables. Since BigQuery doesn't know automatically when new tables were created, it will invalidate any available cache and run a fresh query, so cache won't be used.

Just for reference, it's not a good practice to use date sharded tables:

Recently I faced a scenario where tables where dynamically created based on a business domain field, the date example is only for illustration purposes, if you are using sharded tables, the better solution is to migrate it to BigQuery partitions instead.

The Solution: Union THEM ALL!

Enter the BigQuery Information Schema:

The BigQuery INFORMATION_SCHEMA views are read-only, system-defined views that provide metadata information about your BigQuery objects.

We can use the tables view to dynamically generate a list of all tables matching our pattern (e.g., my_data_2023_*). Then, we leverage UNION to combine individual queries for each identified table.

Here's a sample using Python:

from google.cloud import bigquery

client = bigquery.Client()

# Specify the dataset and wildcard pattern
dataset_id = "your-project.your_dataset"
wildcard_pattern = "my_data_2023_"

# Query the INFORMATION_SCHEMA to get matching table names
query = f"""
    SELECT table_name
    FROM `{dataset_id}.INFORMATION_SCHEMA.TABLES`
    WHERE table_name LIKE '{wildcard_pattern}%'
"""

rows = list(client.query(f"SELECT table_name FROM `{dataset_id}.INFORMATION_SCHEMA.TABLES` "
                                f"where table_name like '{your_table_prefix_}%'"))

if not rows:
    return

view_query = __create_sql(dict(rows[0])["table_name"])
for row in table_names[1:]:
    view_query = f"""
    {view_query}
    UNION ALL
    {__create_sql(dict(row['table_name'])}
"""

I omitted the __create_sql function, which is just a logic that creates a complex SQL based on each table name, with the generated SQL then you can use it to create a BigQuery view:

view = bigquery.Table(table_ref)
view.view_query = view_query
client.create_table(view, exists_ok=True)

Hope that helps, cheers!

DEV Community

How to use BigQuery Query Caching with Dynamic Wildcard Tables

Top comments (0)