Twio_AI

Posted on Jun 2

From pg-boss to Cloud Tasks: How Twio Solved Queue Bursts and Database Connection Failures in a Serverless Architecture

#architecture #infrastructure #postgres #serverless

At Twio, the job system looked like a solved problem early on.

We started with pg-boss. It was simple, reliable, and worked directly inside PostgreSQL, which was already our core database.
Later, as we moved more of our workload into a serverless architecture, that same decision started creating unexpected problems:
database connections became fragile, Neon could not suspend cleanly, and an empty queue still kept the database awake.

We then considered Pub/Sub. It solved the polling problem, but introduced a different kind of risk: messages moved too fast,
retries amplified fan-out workloads, and downstream services could be overwhelmed.

Eventually, we moved key parts of Twio’s async pipeline to Google Cloud Tasks.

This article is not about proving that one queue is universally better than another. It is about how the right queue changes
when your runtime, database, and workload change.

## Twio’s Workload: Why We Needed a Job System

Twio is an AI SaaS platform for loan brokers. It helps brokers automate daily operational work, especially around documents,
emails, and structured client data.

One important workflow is email processing. Twio needs to download emails, parse the body and attachments, persist the data,
prepare RAG-ready content, classify the email context, and convert unstructured information into structured user data.

That workflow is naturally asynchronous.

A single email can contain multiple attachments. Each attachment may require document parsing, OCR, LLM classification, database
writes, and indexing. A batch of uploaded files can quickly become dozens or hundreds of background tasks.

So from the beginning, Twio needed a reliable job system to coordinate these workflows.

## The pg-boss Phase: A Very Reasonable First Choice

Twio’s core database was PostgreSQL, running on Neon. Because of that, pg-boss was the natural first choice.

It had one major advantage: no extra infrastructure.

The queue lived inside the Postgres database we already had. We did not need Redis, SQS, Pub/Sub, or Cloud Tasks. For an early
SaaS product, this mattered. Every additional system adds deployment, monitoring, permissions, cost, and failure modes.

pg-boss also had a much deeper advantage: transactional enqueue.

Because jobs are stored in the same database as the business data, we could create jobs inside the same transaction as our
application writes. Either both the business row and the job committed, or neither did.

That avoided the classic dual-write problem you get with external queues: the database write succeeds, then the queue API call
fails, leaving your system in an inconsistent state.

pg-boss also came with many useful job semantics:

Delayed jobs and cron-style scheduling
Retries with backoff
Dead-letter queues
Singleton keys and deduplication
Rate limiting, throttling, and debouncing
Job chaining
Retention of completed jobs
Full visibility through SQL

The SQL visibility was especially useful. Jobs were just rows. We could inspect queued, failed, retried, or stuck jobs using
plain SQL, and build quick dashboards or debugging queries without learning a separate operational surface.

For a Postgres-first system running on always-on infrastructure, pg-boss is an excellent tool.

But Twio was moving toward serverless.

## The Serverless Problem: pg-boss and Neon Wanted Opposite Things

As Twio grew, we moved more document parsing and heavy processing modules to serverless services such as Cloud Run. At the same
time, we continued using Neon as our serverless Postgres database.

This is where pg-boss started to hurt.

pg-boss is a polling-based queue. It regularly runs queries to find the next available job. The default polling interval is
around two seconds, and many teams tune it closer to one second. It also runs maintenance and monitoring queries.

That means even when there are no jobs, pg-boss still keeps sending queries to Postgres.

Neon, on the other hand, is designed to autosuspend compute after a period of inactivity. If nothing touches the database, Neon
can scale down. But if pg-boss polls every second, Neon’s idle timer keeps resetting.

This created two problems.

The first was cost.

If the queue keeps polling, the database never really becomes idle. Neon compute stays awake, and the main cost advantage of
serverless Postgres disappears.

The second was connection stability.

If Neon had already suspended and pg-boss fired a polling query, Neon had to wake the compute. That wake-up can take hundreds of
milliseconds or a few seconds. During that window, the query that triggered the wake-up could time out or get dropped, causing
errors such as Connection terminated, ECONNRESET, or connection timeouts.

Connection pools made this worse. A pool could hold sockets that were already invalid because the server-side connection had
been closed during suspend. The next polling cycle would pick up a stale connection and fail.

This was not really a pg-boss bug. It was an architectural mismatch.

pg-boss wants a database that is always online and ready to answer polling queries. Neon wants to scale to zero when there is no
real work. Those two models fight each other.

So we needed a queue that did not keep touching the database when there was no work.

## Considering Pub/Sub: Event-Driven, But Too Fast

The obvious next candidate was GCP Pub/Sub.

Pub/Sub is event-driven. There is no polling loop constantly hitting Postgres. When there is no work, the database can stay idle
and Neon can suspend freely. That seemed like the right fix for the pg-boss problem.

But Pub/Sub introduced a different issue.

Pub/Sub is a high-throughput messaging system. It is great at moving messages quickly. But Twio needed a controlled job
pipeline, not just a fast message bus.

Our workload often involves fan-out. For example, one email import job may create 100 child parse jobs.

With Pub/Sub’s at-least-once delivery model, duplicates are normal. If the parent import job publishes child messages and then
fails before acking, Pub/Sub redelivers the parent message. The parent runs again and publishes another 100 child messages.

After a few retries, you do not just have a retried parent job. You have hundreds of duplicate child jobs.

This is retry amplification.

Pub/Sub also does not provide native job-level rate limiting in the way we needed. Subscribers consume as fast as they can. If
300 messages appear at once, they can quickly hit the parser, database, LLM provider, and third-party APIs at the same time.

For Twio, that was dangerous. Our downstream systems were not designed to absorb unlimited bursts.

There is also the ack-deadline problem. If a long-running parse job exceeds the ack deadline and the lease is not extended
correctly, Pub/Sub assumes the job failed and redelivers it, possibly while the original job is still running.

These problems can be managed, but they require additional design:

Idempotency keys for every job
Fan-out separated from retryable work
An outbox or staging table for child jobs
Bounded retries and dead-letter topics
Subscriber-side flow control
Exponential backoff retry policy

The lesson was clear: Pub/Sub solved the polling problem, but it did not give us the dispatch control we needed.

It was too good at delivering messages quickly.

We needed a queue with a built-in throttle.

## Why Cloud Tasks Fit Better

Cloud Tasks was a better match for this part of Twio’s architecture.

It is push-based. Google manages the queue, and when a task is due, Cloud Tasks sends an HTTP request to our handler. If there
are no tasks, it does not touch our database.

That solved the pg-boss and Neon conflict:

No polling loop against Postgres
Neon can suspend when there is no work
No constant database wake-ups
Fewer connection errors around suspend and resume
No always-on database cost caused by an empty queue

But the bigger reason Cloud Tasks worked for us was dispatch control.

Each queue can be configured with:

maxDispatchesPerSecond
maxConcurrentDispatches
maxAttempts
minBackoff
maxBackoff
maxDoublings

This means that even if we enqueue 300 tasks in one second, Cloud Tasks does not have to deliver all of them immediately. It can
pace dispatch according to the limits we define.

That protects our parsers, Neon, LLM providers, and downstream APIs.

Cloud Tasks also gives better operational control than Pub/Sub for this workload. We can list tasks, inspect queue depth, pause
a queue, and purge a bad batch. When a bad fan-out happens, we have a way to stop and recover.

That operational control matters in a production SaaS system.

## What Cloud Tasks Does Not Solve

Cloud Tasks fixed our infrastructure mismatch, but it did not remove the need for correctness design.

It is still an at-least-once system. A handler can finish the work, but if the HTTP response is lost or times out, Cloud Tasks
may dispatch the task again.

So handlers still need to be idempotent.

Fan-out amplification can also still happen if you design it badly. Suppose an import task creates 100 parse tasks and then
fails before returning 200. Cloud Tasks retries the import task. If the import task creates the same 100 child tasks again, you
still get duplicates.

Cloud Tasks gives us a cleaner way to solve this: deterministic task names.

For example, child tasks can be named using business identifiers such as:

parse-{emailId}-{attachmentId}

If the parent task retries and tries to create the same child task again, the duplicate task name can be rejected or
deduplicated within Cloud Tasks’ retention window.

But this is not automatic. You have to design for it.

Cloud Tasks also does not recover pg-boss’s strongest feature: transactional enqueue.

Because Cloud Tasks is outside the database, creating a task after writing business data is still a dual-write operation. The
database write can succeed, and the Cloud Tasks API call can fail.

If strict atomicity is required, the right pattern is still a transactional outbox:

Write the business data and outbox row in the same database transaction.
A separate relay reads the outbox.
The relay publishes tasks to Cloud Tasks.
The relay marks outbox rows as published.

No external queue can magically solve this. It has to be handled at the architecture level.

## The Final Selection Methodology

The biggest lesson for us was that queue selection is not about finding the “best” tool. It is about matching the tool to the
workload and runtime model.

pg-boss is still a strong option for internal jobs that run in an always-on service and need tight transactional consistency
with Postgres. Its transactional enqueue and SQL visibility are real advantages.

Pub/Sub is excellent for event broadcasting and high-throughput system integration. But for long-running, side-effect-heavy,
fan-out job pipelines, it requires careful idempotency, flow control, and retry design.

Cloud Tasks is the best fit for Twio’s serverless-heavy business workflows where we need controlled concurrency, bounded
retries, and protection for downstream systems.

Our current approach is:

Use pg-boss for small internal jobs that benefit from Postgres transactionality and run in stable, always-on environments.

Use Cloud Tasks for cross-system, heavy, serverless workflows, especially when we need to protect third-party APIs, LLM
providers, parsers, or Neon from bursts.

And regardless of the queue, we keep three rules:

Every handler must be idempotent.
Fan-out child jobs must have deterministic keys or another deduplication mechanism.
If enqueueing must be atomic with a business write, use the outbox pattern.

Cloud Tasks solved the operational problems that pg-boss and Pub/Sub created in our serverless setup. But the deeper improvement
was not just changing the queue. It was clarifying what the queue should and should not be responsible for.

Infrastructure can help with scheduling, retries, and rate limits.

Correctness still belongs to the application design.