Twio_AI

Posted on Jun 30 • Edited on Jul 1

Inside Twio: Optimizing Queue System for AI Workflows Instead of Scaling Compute

#architecture #googlecloud #programming #backend

Whenever a few users ran Twio's email import at the same time, it slowed to a crawl — and adding servers wouldn't have helped. Here's where the real bottleneck was hiding, and the simple fix that sorted it out.

What this system actually does

At Twio, we build software for mortgage brokers. One of our features scans a broker's mailbox and assembles a "loan book" — every client, every loan, every upcoming rate-expiry date.

Behind that one button sits a real pipeline. We pull the broker's mail, make sense of it, and fold it into a clean, deduplicated book of loans:

Broker's Gmail
    |  download emails + attachments
    v
Extract & clean ........ parse PDFs, strip quoted replies
    |
    v
Understand ............. Gemini reads each email & doc
    |
    v
Index for retrieval .... RAG embeddings
    |
    v
Consolidate ............ dedupe, group by client & household
    |
    v
Loan book

The parsing step is where the model earns its keep: Gemini reads each email and each PDF and pulls out the facts — who, which bank, how much, and when the rate expires. Most of this runs on our own machines. But two steps lean on services we don't control: the bulk download from Gmail and the parsing by Gemini. Hold onto that — the whole story grows out of those two.

The symptom: a job that looked stuck for fifteen minutes

One day a broker opens a ticket: "Your scan is broken — it's been stuck for fifteen minutes."

We pull the logs. The job isn't broken. It hasn't even started downloading its first batch. It's sitting in line, behind three other brokers who happened to click "scan" in the same five-minute window.

Our background jobs ran on Google Cloud Tasks, with a deliberately simple setup: one job type, one queue, one task at a time. To avoid hogging that single slot, a scan splits itself into small batches: download a batch, re-enqueue itself at the back of the line, download the next, and repeat until done.

That "one queue, one at a time" is where it breaks. When N brokers scan at once, their batches interleave in the same line:

One shared queue (FIFO):   A1  B1  C1  A2  B2  C2  A3  B3  C3  ...
                           A's 3rd batch waits for the 7th slot -> 3x slower

Every scan slows to 1/N speed. And the frontend has a safety net: if progress doesn't move for ~12 minutes, it declares the scan "stalled" and tells the user to retry. So the brokers stuck at the back of the line — whose jobs were perfectly healthy — hit a "stalled" screen for a problem that didn't exist.

The reflex that doesn't work

When something is slow, the first instinct is always the same: throw machines at it. We run on Cloud Run, which autoscales to — for all practical purposes — infinite compute. Still not fast enough? Just let more tasks run at once.

And normally, that instinct is right. Picture a supermarket checkout. If cashier A rings people up five times faster than cashier B, the line behind A clears five times faster. Faster cashier, shorter line — that's almost the definition of the job. So more lanes and quicker workers should mean shorter queues. Obviously.

Except it would have done nothing for us. Because in our store, the bottleneck was never the cashier.

It was the customer.

Picture the person at the front of the line moving at the speed of Flash — the sloth from Zootopia. You can sit a Formula 1 driver at the register; he'll just sit there, tapping the conveyor belt, while Flash s-l-o-w-l-y reaches for his wallet and s-l-o-w-l-y counts out the change. The moment the customer is the slow part, the cashier's speed stops mattering at all.

That sloth is Gmail.

The real bottleneck lives somewhere else

So why is Gmail our sloth? A per-user rate limit.

Gmail caps how fast you can call its API for any single user. For one broker's mailbox you get so many calls a second and not one more; push past it and Gmail hands back a flat 429. It doesn't matter how fast our worker runs — Gmail feeds us that one mailbox at sloth speed regardless. More machines don't make the broker move faster; they just park more Formula 1 drivers behind the same slow customer.

(The far end of the pipeline pulls the same trick, by the way. Gemini — the model reading the emails — meters us too, and we tame that one with a different valve that deserves its own post. I only bring it up because the shape keeps repeating: both ends of this pipeline run on a budget we don't get to set.)

And here's what stung: the parallelism we needed was sitting right there, unused. Every broker's Gmail budget is independent — broker A draining A's quota does nothing to B's. They could have run fully in parallel from the start. Our one shared queue was quietly throwing that away.

The diagnosis was almost embarrassing in hindsight: the limit is per user, but our queue was global. We had built a single chokepoint in front of a crowd of independent, parallelizable customers.

The fix: one queue per user

Replace "one global queue" with "one queue per user, processed in parallel," and you get two things at once:

Speed: different users run in their own queues at the same time, so throughput climbs with the number of active users instead of collapsing to 1/N.
Safety: a single user's batches still run one queue, one at a time, which naturally keeps them inside that user's Gmail budget — no self-inflicted 429.

One property — a private lane per user, many lanes in parallel — cures both the slowness and the rate-limiting.

It sounds simple. But the most obvious way to build it is the one thing you must not do.

The trap: the 7-day rule

"One queue per user" leads straight to: create a queue named after the user on demand, delete it when it goes idle. Users grow without bound, so cleaning up feels mandatory.

And that walks right into a quiet rule in Cloud Tasks:

Once you delete a queue, its name can't be reused for about 7 days.

Which means: a broker scans today, the queue gets deleted — and if they come back to scan within 7 days, they hit a name that's still "on hold," and the job fails to enqueue. The users who return within a week are exactly the active ones you most want to serve.

Put bluntly: "one queue per user" is the right idea; "a queue you create and delete per user" is the wrong mechanism.

The solution: a fixed pool, assigned by hashing

So we flipped it around. Instead of building a queue per user, we provision a fixed pool of queues once, at startup — we run 128 — and from then on we only use them, never delete them. A simple hash then assigns each user to one, evenly and permanently:

const POOL_SIZE = 128

// Every user maps to the same queue, every time.
function queueForUser(userId: number): string {
  const slot = userId % POOL_SIZE   // userId is an auto-increment int -> modulo spreads evenly
  return `email-import--q${slot}`
}

hash(userId) % 128  ->  one fixed queue, always the same one

   Broker A  ->  q5     (A and C happen to collide
   Broker C  ->  q5      on the same lane)
   Broker B  ->  q42

   Different lanes run in parallel; the same lane runs one at a time.

Don't let "hashing" scare you off — here it does just two plain things:

Even: it spreads users across all 128 queues instead of piling them onto one.
Stable: a given user always lands on the same queue — so their scan stays serialized (Gmail-safe), while different users land on different queues (parallel, fast).

The best part: this design never creates or deletes a queue — so the 7-day trap simply doesn't exist. The queue count is also pinned at 128 (independent of how many users we have, and well under the cloud's ~1,000-per-region cap).

The cost? Tiny. Occasionally two users hash to the same queue and wait on each other (above, A and C both land on q5). But collisions among 128 lanes are rare, and even when they happen it's just a little slower — never wrong, and never a rate-limit problem.

The takeaway

The real lesson here has little to do with queues or hashing:

When something gets slow and your first instinct is "add machines" — stop, and find the resource that's actually the bottleneck. It's often not on your machines at all, but inside some external service that meters you per user, per call.

Find it, then shape your concurrency around it — not around your compute. Our compute can be infinite. The bottleneck never was ours.

DEV Community