Building a Production-Ready Serverless App on Google Cloud (Part 2: The Data Contract)

#ai #dataengineering #python #googlecloud

This is Part 2 of a 3-part series on building production-ready, data-intensive applications on Google Cloud. If you haven't read it yet, check out Part 1: Architecture to understand the foundational serverless components we are connecting today.

The Danger of Decoupling

In Part 1 of this series, we praised the decoupled architecture. By splitting our compute (Cloud Run) from our analytics (BigQuery) using a buffer (Pub/Sub), we created a system that scales infinitely and costs nothing when idle.

But decoupling introduces a massive architectural danger: The Data Swamp.

If your web application can throw any random JSON payload into a Pub/Sub topic, and that topic blindly dumps it into a data warehouse, your analytics team will spend 80% of their time cleaning malformed strings and fixing broken dashboards.

To prevent this, we must establish a strict Data Contract at the very edge of our ingestion layer.

The Bouncer: Enforcing the Pub/Sub Schema

A professional data pipeline does not rely on the application code to "hopefully" send the right data types. It enforces rules at the infrastructure level.

For the Dog Finder app, we attached a strict Apache Avro schema to our Pub/Sub topic. This acts as the "bouncer" for our data warehouse. If Cloud Run attempts to publish a sighting with a missing field or the wrong data type, Pub/Sub rejects it immediately.

By inspecting pubsub_schema.json, you can see standard Data Engineering practices enforced natively:

Precision Typing: We explicitly defined latitude and longitude as double precision. This prevents the backend from accidentally sending coordinates as strings, which would break spatial queries later.
Consistent Naming: We enforced snake_case for all fields, such as sighting_date and image_url.

The Vault: Designing the BigQuery Schema

BigQuery is where our data lives permanently. The schema here needs to mirror our Pub/Sub contract, but also provide the metadata necessary for reliable analytics.

If you look at bigquery_schema.json, we didn't just copy the business fields. We intentionally included metadata fields like message_id and publish_time. Because Pub/Sub guarantees "at-least-once" delivery, duplicate messages can occasionally occur. Capturing the message_id is essential for the analytics team to efficiently deduplicate records.

More importantly, we didn't just create a basic table. In our setup_resources.sh script, we enforced a partitioning strategy directly at creation:

bq mk --table \
    --time_partitioning_field sighting_date \
    --time_partitioning_type DAY \
    "${GOOGLE_CLOUD_PROJECT}:${BIGQUERY_DATASET}.${BIGQUERY_TABLE}" \
    "$PROJECT_ROOT/schemas/bigquery_schema.json"

By partitioning the table by sighting_date, we ensure that when a Looker Studio dashboard queries for "lost dogs this week" or an analyst performs research, BigQuery scans only the relevant daily partitions. This single command is the difference between a query that costs $1 and a query that costs $1,000 as your dataset grows.

The Serverless Bridge: Zero-Code Ingestion

Now for the architectural magic trick. We have a secure Pub/Sub topic and a partitioned BigQuery table. How do we move data between them?

Traditionally, developers write a Cloud Function or spin up a Dataflow job to consume from Pub/Sub, transform the payload, and insert it into BigQuery. That means writing code, managing deployments, and paying for intermediate compute.

Instead, we used a native BigQuery Subscription. This is a powerful serverless pattern that requires zero code. Here is the exact command from our setup script:

gcloud pubsub subscriptions create "$SUBSCRIPTION_ID" \
    --topic="$TOPIC_ID" \
    --bigquery-table="${GOOGLE_CLOUD_PROJECT}:${BIGQUERY_DATASET}.${BIGQUERY_TABLE}" \
    --use-topic-schema \
    --write-metadata \
    --project="$GOOGLE_CLOUD_PROJECT"

Notice the two critical flags:

--use-topic-schema: This tells the subscription to natively map the fields from our Avro schema directly to the BigQuery columns.
--write-metadata: This automatically populates those message_id and publish_time fields we added to our BigQuery schema for auditing.

Designing for Failure: The Dead Letter Topic (DLT)

But an architect must always design for failure. What happens if a schema evolution causes a mismatch, or BigQuery temporarily rejects an insert? By default, Pub/Sub will continually retry the delivery, but once the retention period or retry limit is exhausted, that message is dropped forever. Data loss in a production pipeline is unacceptable.

To prevent this, we must configure a Dead Letter Topic (DLT) alongside our subscription. This is a core defensive engineering practice.

By adding the --dead-letter-topic and --max-delivery-attempts flags to your subscription configuration, you create a safety net. If a message fails to write to BigQuery after, say, 5 attempts (perhaps due to an unforeseen schema mismatch), Pub/Sub automatically routes that specific message to the DLT and continues processing the rest of the queue.

Instead of losing the sighting, the malformed data is safely quarantined. You can set up an alert on the DLT, inspect the failing payload, patch your schema or application code, and then easily replay the dead-lettered message back into the main pipeline. Zero dropped records, zero panic.

With this configuration, GCP handles all the plumbing. As soon as the Cloud Run backend publishes a validated event to Pub/Sub, the infrastructure automatically streams it into BigQuery - securely and resiliently - with absolutely zero intermediate compute costs.

Conclusion

By enforcing a Data Contract via an Avro schema and utilizing native BigQuery subscriptions, we eliminated the "glue code" that normally plagues data pipelines. Our analytics team gets perfectly structured, partitioned data, and our application developers don't have to manage a single ingestion worker.