Twio_AI

Posted on Jun 2

From pg-boss to Cloud Tasks: Fixing Queue Bursts and DB Connection Failures on Serverless

#architecture #cloud #postgres #serverless

At Twio we picked pg-boss for our job queue, ran into trouble when we went serverless, looked at Pub/Sub, and ended up on Google Cloud Tasks. This is what each queue got right, what it got wrong for our workload, and the rule we landed on for choosing between them.

The workload

Twio is an AI SaaS for loan brokers. The piece that needs a job queue is email processing: download an email, parse the body and attachments, OCR, classify with an LLM, write structured data, and index for RAG. One email with five attachments easily becomes 30+ background jobs. A batch upload becomes hundreds.

Why pg-boss worked — until it didn't

Our database was Postgres on Neon, so pg-boss was the obvious starting point. No extra infrastructure, and one feature we genuinely loved: transactional enqueue. Because jobs live in the same database as business data, you can create a job in the same transaction as the row that triggered it. No dual-write problem, no "DB succeeded but the queue API failed" inconsistency.

It also gave us retries, delayed jobs, dead-letter queues, dedup keys, and full SQL visibility into stuck or failed jobs. For a Postgres-first app on always-on infra, it's an excellent tool.

Then we moved heavy processing to Cloud Run, and the cracks showed up.

pg-boss polls. Neon suspends. They want opposite things.

pg-boss runs a query roughly every 1–2 seconds to look for the next job, plus maintenance queries. Neon autosuspends compute when nothing touches the database. If the queue is polling every second, Neon's idle timer never expires — you pay for always-on compute even when the queue is empty.

Worse, when Neon did manage to suspend, the next poll had to wake it. That wake-up takes hundreds of ms to a few seconds, and queries that triggered it would fail with Connection terminated, ECONNRESET, or timeouts. Pooled connections made it worse: the pool kept sockets that the server had already closed during suspend, and the next polling cycle picked one up and broke.

This isn't a pg-boss bug. It's an architectural mismatch.

Why Pub/Sub wasn't the answer

Pub/Sub is event-driven — no polling against Postgres, Neon can suspend freely. That fixed the obvious problem, but introduced a worse one for our shape of work.

Pub/Sub is built to move messages fast. We needed a queue that moves messages carefully.

Two specific failure modes hit us:

Retry amplification. A parent import job publishes 100 child parse messages, then crashes before acking. Pub/Sub redelivers the parent. The parent re-publishes 100 children. After a few retries, you have hundreds of duplicate child jobs.
No native job-level pacing. If 300 messages land at once, subscribers consume them as fast as they can — slamming our parser, Neon, the LLM provider, and third-party APIs simultaneously. Pub/Sub has flow control on the subscriber side, but it's not the kind of per-queue dispatch throttle we needed.

Plus the ack-deadline problem on long parse jobs, where a missed lease extension causes redelivery while the original is still running.

All of these are solvable with idempotency keys, outboxes, and bounded retries — but at that point you're rebuilding what a job queue should give you out of the box.

Why Cloud Tasks fit

Cloud Tasks is push-based: when a task is due, Google sends an HTTP request to our handler. When there are no tasks, nothing touches our database. That alone resolved the pg-boss/Neon conflict — Neon suspends, costs drop, no more wake-up connection errors.

But the real reason it fit was per-queue dispatch control:

# queue.yaml
- name: email-parse
  rateLimits:
    maxDispatchesPerSecond: 10
    maxConcurrentDispatches: 20
  retryConfig:
    maxAttempts: 5
    minBackoff: 10s
    maxBackoff: 600s
    maxDoublings: 4

Enqueue 300 tasks in a second and Cloud Tasks won't deliver them all at once — it paces dispatch to the limits we set. Our parsers, Neon, and the LLM provider stay protected from bursts.

It also gives us operational levers Pub/Sub doesn't: list tasks, inspect depth, pause a queue, purge a bad batch. When a fan-out goes wrong, we can stop it.

What Cloud Tasks doesn't solve

Two things, both important.

It's still at-least-once. A handler can finish the work and Cloud Tasks can still redeliver if the HTTP response is lost. Handlers must be idempotent.

Fan-out duplication is still possible. If the parent creates 100 child tasks and then fails before returning 200, the retried parent creates them again. The fix here is deterministic task names:

parse-{emailId}-{attachmentId}

Cloud Tasks rejects duplicate names within its retention window, so the second attempt is a no-op. But you have to design for it — it's not automatic.

And it doesn't recover transactional enqueue. Cloud Tasks lives outside the database, so creating a task after a DB write is a dual-write. If you need strict atomicity, the answer is still an outbox: write the business row and an outbox row in one transaction, have a relay publish to Cloud Tasks and mark the row published. No external queue makes this go away.

The rule we landed on

Queue selection isn't about finding the best queue. It's about matching the queue to the runtime model.

pg-boss for small internal jobs in always-on services where Postgres transactionality matters.
Cloud Tasks for cross-system, serverless workflows where we need to protect Neon, LLM providers, and third-party APIs from bursts.

And three rules that apply regardless:

Every handler is idempotent.
Fan-out children have deterministic keys.
If enqueue must be atomic with a business write, use an outbox.

Cloud Tasks fixed our infrastructure mismatch, but the real win was clarifying what the queue is responsible for. Infrastructure handles scheduling, retries, and rate limits. Correctness belongs to the application.

Top comments (2)

Andres Victoria • Jun 15

The architectural mismatch framing is exactly right — pg-boss polling against Neon is two incompatible lifecycle models fighting each other. I hit a related version of this on Aurora Serverless v2 with Vercel Fluid Compute: the function can stay warm between invocations, so naive connection pool settings meant each warm instance was holding open its own connections, and the DB kept seeing connection count climb even under light load. The fix was max: 1 on the postgres-js client so each function instance only ever holds one connection, and the pooler (RDS Proxy) does the actual multiplexing. Different runtime, same root cause — the DB connection lifecycle assumptions from always-on servers just don't map to serverless. Wrote up the full tradeoff here if useful: dev.to/member_8be1f66f/why-i-repla...

Twio_AI • Jun 21

Love this — your Aurora + Fluid Compute case is the same story from the other side. That's exactly the generalization I was reaching for: the fix isn't "tune the pool," it's move connection multiplexing out of the ephemeral app instances and into a dedicated pooler. max: 1 + RDS Proxy is the clean version of that. Going to read your write-up — thanks for dropping it :)