DEV Community: Twio_AI

Can You Vibe-Code Without Understanding the Code? I Tried

Twio_AI — Sat, 11 Jul 2026 09:10:52 +0000

Lessons from 100,000+ lines of LLM-written code at a two-person startup.

twio is an AI workspace for mortgage advisers, built by two engineers. Since our first commit in early 2026, the repo has grown to nearly 2,000 commits and over 100,000 lines of code — almost all of it written by LLMs. Along the way I reversed my position completely: I started out believing "if there's a bug, let the LLM fix it — why bother understanding the code?" I now believe you have to own every architectural decision the LLM makes, from day one.

I believed the promise

Vibe coding sounds irresistible: agents running in loops, code you never read, bugs pasted back for the model to fix. And on small web projects it genuinely works — the app runs, iteration is fast. I also let LLMs build things fully autonomously, with the same result: fine while the project was small, painful once it got complex or crossed a few tens of thousands of lines.

What I had missed is that the term was scoped from birth. When Karpathy coined vibe coding, he added a qualifier: "not too bad for throwaway weekend projects." I had absorbed the swagger and lost the qualifier.

The moment it flipped

What changed my mind wasn't a bug. It was an ordinary design discussion. The LLM and I talked through a feature, and it produced a plan that looked perfectly reasonable. Then it hit me: I had no way to tell whether the plan was sound — and no way to tell whether the code it produced was right.

I wasn't making an engineering decision. I was rolling dice.

You can shrug and trust the model; in a small project the cost of that is invisible. But as the codebase grows, the plans go wrong more and more often, until pretending stops being an option.

Why it breaks down

Three mechanisms, none of them bad luck.

The bottleneck moved; the cost didn't vanish. Vibe coding removes the cost of writing code. The cost of verifying it is merely deferred — with interest. Every piece of code nobody understands is a liability; small projects feel fast because the debt isn't due yet.

Errors compound. Every unreviewed diff adds a little entropy, and an LLM working in a messier codebase produces messier output. That's why the decline isn't linear — past some point it snowballs.

LLMs reviewing LLMs doesn't save you. I tried the review-agent recipes. They miss the problems that matter, because reviewer and author share the same blind spots. And human review has a prerequisite: you can't review what you don't understand.

Peter Naur said all of this forty years ago in Programming as Theory Building: a program's real substance isn't the code but the theory in its builders' heads — the architecture, the trade-offs, the why it's shaped this way. If every line is AI-generated and no human ever builds that theory, the code exists but the theory never did. My "can't judge the plan" moment was me discovering I held no theory.

Different symptoms, one disease

Use these tools long enough and you learn their temperaments. GPT loves sprawling master plans that ship code nobody asked for. Cursor's plans are sometimes pointed the wrong way, sometimes just under-thought. Claude tends to propose the safest plan, not the best one. And nearly all of them share one instinct: no bold refactoring — they hate deleting code, tiptoe around existing structure, and only ever add.

These aren't vendor bug lists. They're symptoms of one disease: there is no referee in the system. When nobody is present to ask "is this plan right? should this code die?", every model slides toward entropy.

Where vibe coding still wins

Prototypes, one-off scripts, spikes, anything you'd happily throw away and rewrite — vibe coding is the optimal strategy there, and I still use it that way. Karpathy's qualifier was right all along. The mistake is extending it to software you intend to maintain.

What we do instead

The principle is simple: the engineer owns understanding and decisions; the LLM writes the code. At twio that has settled into four habits.

Plans are signed off before code is written. Every feature starts as a design discussion; I judge the direction, surface the trade-offs, and make the call. Our working agreement — a CLAUDE.md loaded into every LLM session — requires the model to stop and ask the moment a plan starts getting complicated, instead of silently picking the heavier option.

Rules written against the model's instincts. The agreement's core clauses invert default LLM behavior: simplicity matters more than functionality; refactor, don't just accumulate; delete, don't preserve. Models add by nature, so deletion has to be written policy.

Review to understand, not to nitpick. Reviewing isn't hunting bugs line by line. It's staying fluent in the architecture — knowing what the code actually does — so that when the next plan lands, you can tell which approach genuinely fits the project.

One commit per stage. Big tasks are cut into small, self-contained increments, each type-checked and tested before its own commit. I never review one giant diff — only a chain of small ones I can actually read.

None of this stops the LLM from being wrong. It guarantees something better: when a mistake happens, someone who holds the theory is in the room. The architecture still lives in my head, and I can call every plan right or wrong.

The part that isn't outsourced

What AI makes obsolete isn't the programmer — it's the typing. Turning vague requirements into a precise spec, judging whether a plan is right, owning the outcome: none of that has been outsourced by an inch. With the marginal cost of producing code approaching zero, it matters more than ever.

Building twio taught me this the concrete way: you can outsource the typing, but not the understanding. And understanding has no shortcut — it accumulates from the very first line.

Inside twio's CI/CD: How a Two-Engineer AI Startup Ships

Twio_AI — Sat, 11 Jul 2026 06:59:21 +0000

No per-push CI, exactly one hard gate, and LLM evals that never block a merge — and why.

twio is an AI workspace for mortgage advisers. Its core features — email parsing, loan refixing, application-pack preparation — are all LLM-driven. The engineering team is two people. There is no dedicated QA and no dedicated ops. The stack is a Node/TypeScript monolith with a React frontend, running on GCP Cloud Run, backed by Neon Postgres.

That combination of constraints — a tiny team, an LLM-centric product, fully managed infrastructure — produced a CI/CD pipeline that looks noticeably different from the textbook version: there is no "run CI on every push," the entire pipeline has exactly one mandatory gate, and LLM evals never block anything. This post walks through the whole flow along two tracks — CI (how code gets verified) and CD (how code reaches an environment) — and explains the reasoning behind each design decision.

The big picture

The pipeline consists of seven GitHub Actions workflows, one Cloud Build trigger on the GCP side, and one git hook that has since been deleted:

Part 1: CI — how code gets verified

CI has to answer two questions: what verification exists, and when each kind runs.

1.1 Four layers of verification

The first two layers are unremarkable. The third and fourth are specific to LLM products and worth unpacking.

LLM evals. Each eval is a Vitest test: it assembles the real system prompt for the unit under test (a tool, a sub-agent, or a workflow step), calls a real model, but mocks every tool return — no database, no network beyond the LLM. Cheap deterministic assertions run first, then a Gemini-based LLM-as-judge checks the semantic claims. "Real model" implies two properties: it costs money, and any single run is noisy — the same code can pass today and fail tomorrow. Those two properties dictate where evals can live in CI: on-demand or as monitoring, never as a gate. Our working agreements (the CLAUDE.md at the repo root) encode this as a hard rule: "Gate only on stable, discriminating signals."

End-to-end. The opposite extreme: real database, real model, real /api/chat, driving an entire business flow as a black box — for example, a loan refix from creating the contact all the way to sending the application to the lender. A cold-start case takes minutes. It runs locally, by hand, and never appears in any workflow.

1.2 The rise and retirement of the pre-push hook

The original trigger mechanism was a Husky pre-push hook. The entire thing was one line:

# .husky/pre-push
npm run build && npm test

Build plus the full unit suite, enforced locally before every push. As the test suite grew, this model hit its limit: roughly five minutes of waiting per push, on a high-frequency action. Waiting cost multiplied by push frequency stopped being worth it. In late April the hook was deleted and verification moved into CI, on demand. The commit message is refreshingly plain:

Move build-and-test validation from local Husky pre-push into a comment-triggered GitHub Action so contributors can run checks consistently in CI.

Beyond the waiting, consistency was the second motive: a hook only constrains machines that have it installed, local environments drift, and git push --no-verify walks straight around it. A check that runs in CI is identical for everyone — and its result is recorded on the PR.

1.3 On-demand verification, driven by PR comments

The pre-push hook was replaced by three PR comment commands:

The architecture is "one router plus N reusable workflows": a single workflow, pr-comment.yml, listens for issue_comment events, parses the command with a regex, and invokes the matching workflow via workflow_call. Adding a new command means one new regex branch in the router and one new workflow file — the trigger logic never changes, and you never end up with several workflows racing for the same event.

A few deliberate choices here:

On demand, not on every PR. The full unit suite costs five minutes; evals cost real API money; and plenty of PRs never touch LLM-adjacent code. The author decides what to run. Single-file mode (say, /eval:tool build_refix_proposal_from_text) drives the cost of one verification about as low as it can go.
A closed interaction loop. A recognized command immediately gets a 🚀 reaction ("heard you"), and the run posts its result — ✅/❌, commit, command, log link — back onto the PR. A new command on the same PR cancels the previous run (the concurrency group is keyed by PR number).
Fork safety. Evals need LLM API keys, which live in a GitHub Environment called LLM Test. Workflows triggered from fork PRs can't see those secrets, so the router detects forks and replies with a polite refusal instead. /ut needs no keys and runs on forks just fine.

1.4 After the merge: nightly-eval as the safety net

On-demand verification has a structural hole: it relies on discipline, and sooner or later a PR merges without enough eval coverage. The plug for that hole is nightly-eval — the full eval:tool and eval:subagent suites against main, every night:

on:
  schedule:
    - cron: "17 15 * * *"   # ~3 a.m. NZ; deliberately off the top
                            # of the hour, the slot GitHub delays most
  workflow_dispatch: {}      # can also be triggered manually
concurrency:
  group: nightly-eval
  cancel-in-progress: false  # never cancel an in-flight eval run

Its most important property is that it is report-only: it observes, it never blocks. The workflow's header comment gives two reasons:

These evals call real models and grade with an LLM judge, so a single pass/fail is noisy — in the comment's words, "unfit to gate merges on."
It runs after the merge. The code is already on main; failing can't stop anything. Its entire value is trend monitoring.

The alerting channel is self-healing: a failure opens an issue labeled eval-regression; consecutive failures append comments to that same issue rather than opening a new one every day; a green run posts "Recovered" and closes it. At any moment there is at most one open regression issue, which makes the Issues tab a de facto health light for LLM behavior on staging.

1.5 The only hard gate: before production

The whole pipeline has exactly one mandatory checkpoint, and it lives inside the production deploy workflow (deploy-prod.yml):

jobs:
  test:                        # the only hard gate
    steps:
      # ...checkout main, npm ci...
      - run: npx tsc --noEmit
      - run: npm test
  deploy:
    needs: test                # no green, no deploy

Why only one gate, and why only deterministic checks? A gate's job is to stop things that shouldn't ship, which means it has to stand on stable signals. A gate that goes red at random gets ignored or re-run until it passes — and takes the pipeline's credibility down with it. So type checks and unit tests gate the release; LLM quality is the nightly monitor's job.

Part 2: CD — how code reaches an environment

2.1 The environment model

Two environments, fully symmetrical, in the same GCP project and region:

The branch model: all day-to-day work lands on main through PRs. The prod branch takes no direct commits — it is rebased onto main at each release. Every position of prod therefore corresponds to an actual production release: it is both the deploy source and the release history.

2.2 Staging: every push is a deploy

Code merged into main shows up on dev.twio.ai a few minutes later. GitHub Actions does nothing on this path — a Cloud Build trigger on the GCP side watches the repository, builds from the Dockerfile at the repo root, and deploys to twio-main.

Why we're comfortable with a zero-gate staging: mistakes are cheap (it's an internal trial environment), safety nets exist (nightly evals surface an LLM regression within a day at worst, and the release gate stops deterministic breakage from reaching production), and the fix path is short (the next push is the next deploy). Flip it around: a gate here would tax every single push with a wait, and buy protection the release gate already provides.

Worth stating honestly: after the pre-push hook was deleted, this became the most aggressive trade-off in the whole design. A PR that never ran /ut can, in principle, put code that fails unit tests onto staging — until someone runs the tests, the nightly alarm fires, or the release gate catches it. For a two-person team that risk is acceptable. For a bigger team it wouldn't be.

2.3 Production: a manually triggered pipeline

A release is one click of "Run workflow" in the GitHub Actions tab. Five steps follow:

The reasoning, step by step:

Manual, not automatic. Release cadence is a human decision: batch a few PRs, avoid customers' active hours. A small team doesn't need merge-to-production continuous deployment; it needs "one click when we choose to ship, verification guaranteed every time."
Rebase, not merge. prod stays commit-for-commit identical to main (linear history), and each release leaves an exact snapshot. The push uses --force-with-lease rather than bare --force: if the remote branch moved unexpectedly, the push fails instead of silently overwriting. If prod doesn't exist, it's created from main — a defense added later covering the deleted-or-never-created case.
Keyless auth via WIF. GitHub authenticates to GCP through Workload Identity Federation: no service-account JSON key stored anywhere, GitHub's OIDC token is exchanged for short-lived credentials, and the trust relationship is scoped to this one repository. A key that doesn't exist can't leak.
Pinning APP_BASE_URL. This variable is boot-critical — the container uses it to provision its task queues at startup, and without it the service won't come up (the workflow comment: "its absence crashes boot"). So every deploy re-asserts it via --update-env-vars, protecting against someone editing service config in the console, losing the variable, and getting a mysterious explosion several deploys later.
A health check as the last word. A "successful" Cloud Run deploy only means the new revision started; it doesn't mean the app is healthy. After deploying, the workflow curls the service URL and fails the run on anything outside 200–399.
Uncancelable. The concurrency group sets cancel-in-progress: false — in the comment's words, "never cancel a prod deploy mid-flight." New triggers queue up; a half-finished production deploy never gets killed. (The old staging workflow was configured the opposite way, cancel-in-progress: true: a newer push simply supersedes the deploy in progress, because staging only ever cares about the latest version.)

2.4 Self-contained at runtime: no post-deploy checklist

The container's start command:

CMD ["sh", "-c", "node dist/db/migrate.js && node dist/index.js"]

Database migrations run automatically on every container start. The app then bootstraps all of its infrastructure dependencies before binding the port — creating or reconciling Cloud Tasks queues, Cloud Scheduler jobs, and Pub/Sub topics. If critical configuration is missing, startup fails loudly (the container won't come up and the deploy goes red) instead of limping along.

For a small team, the point of this choice is that deploys have no runbook. There is no "remember to run migrations after deploying," no "remember to create the queue" folklore — environmental dependencies are either handled by the code itself or the deploy fails and tells you immediately. Health check + loud failure + self-provisioning together guarantee that a green deploy means the environment is correct.

2.5 Cleanup: cost is part of the pipeline

Every gcloud run deploy --source produces a new container image (stored in Artifact Registry, billed by storage) and a new Cloud Run revision. Two environments deploying at high frequency means the bill only goes one way unless something pushes back. The hardening commit recorded the motivation:

Each deploy via gcloud run deploy --source . accumulates Docker images in Artifact Registry, driving up storage costs.

cleanup-images.yml does the housekeeping, hourly (plus on every push, plus manual):

Cloud Run revisions: each service keeps only the newest one plus whatever is serving traffic;
Images: each service keeps the two most recent; anything referenced by a live revision is skipped; the rest are deleted.

Its own history tracks usage growth: it began as an inline cleanup step in the deploy workflow, was split into a standalone daily cron the same day, and moved to hourly a month later once deploy frequency picked up.

Summary

Every trigger point in the pipeline:

Each decision is small on its own. Together they reduce to three consistent principles:

Gate only on deterministic signals. The pipeline's single mandatory checkpoint (before production) contains nothing but type checks and unit tests. Making a randomly-red check a gate just teaches everyone to ignore gates.
Noisy signals monitor; they don't block. The value of LLM evals is in the trend: run nightly, open an issue on failure, auto-close on recovery — not in vetoing any particular merge.
Expensive operations run on demand. The time-expensive (five minutes of unit tests) and the money-expensive (real-model evals) are both behind explicit commands, with the author deciding what to run; whatever slips through is caught by the nightly net and the release gate.

Finally, the fine print: this design assumes a team small enough for mutual trust, a staging environment where mistakes are cheap, and a release cadence of two or three ships a week. As the team grows, as compliance requirements appear, or as evals get stable enough to trust, the number and placement of gates should be revisited. This pipeline is itself the product of several rewrites within a single year — there's no reason to believe today's shape is final.

Originally published on Medium.

Inside Twio: Optimizing Queue System for AI Workflows Instead of Scaling Compute

Twio_AI — Tue, 30 Jun 2026 10:01:34 +0000

Whenever a few users ran Twio's email import at the same time, it slowed to a crawl — and adding servers wouldn't have helped. Here's where the real bottleneck was hiding, and the simple fix that sorted it out.

What this system actually does

At Twio, we build software for mortgage brokers. One of our features scans a broker's mailbox and assembles a "loan book" — every client, every loan, every upcoming rate-expiry date.

Behind that one button sits a real pipeline. We pull the broker's mail, make sense of it, and fold it into a clean, deduplicated book of loans:

Broker's Gmail
    |  download emails + attachments
    v
Extract & clean ........ parse PDFs, strip quoted replies
    |
    v
Understand ............. Gemini reads each email & doc
    |
    v
Index for retrieval .... RAG embeddings
    |
    v
Consolidate ............ dedupe, group by client & household
    |
    v
Loan book

The parsing step is where the model earns its keep: Gemini reads each email and each PDF and pulls out the facts — who, which bank, how much, and when the rate expires. Most of this runs on our own machines. But two steps lean on services we don't control: the bulk download from Gmail and the parsing by Gemini. Hold onto that — the whole story grows out of those two.

The symptom: a job that looked stuck for fifteen minutes

One day a broker opens a ticket: "Your scan is broken — it's been stuck for fifteen minutes."

We pull the logs. The job isn't broken. It hasn't even started downloading its first batch. It's sitting in line, behind three other brokers who happened to click "scan" in the same five-minute window.

Our background jobs ran on Google Cloud Tasks, with a deliberately simple setup: one job type, one queue, one task at a time. To avoid hogging that single slot, a scan splits itself into small batches: download a batch, re-enqueue itself at the back of the line, download the next, and repeat until done.

That "one queue, one at a time" is where it breaks. When N brokers scan at once, their batches interleave in the same line:

One shared queue (FIFO):   A1  B1  C1  A2  B2  C2  A3  B3  C3  ...
                           A's 3rd batch waits for the 7th slot -> 3x slower

Every scan slows to 1/N speed. And the frontend has a safety net: if progress doesn't move for ~12 minutes, it declares the scan "stalled" and tells the user to retry. So the brokers stuck at the back of the line — whose jobs were perfectly healthy — hit a "stalled" screen for a problem that didn't exist.

The reflex that doesn't work

When something is slow, the first instinct is always the same: throw machines at it. We run on Cloud Run, which autoscales to — for all practical purposes — infinite compute. Still not fast enough? Just let more tasks run at once.

And normally, that instinct is right. Picture a supermarket checkout. If cashier A rings people up five times faster than cashier B, the line behind A clears five times faster. Faster cashier, shorter line — that's almost the definition of the job. So more lanes and quicker workers should mean shorter queues. Obviously.

Except it would have done nothing for us. Because in our store, the bottleneck was never the cashier.

It was the customer.

Picture the person at the front of the line moving at the speed of Flash — the sloth from Zootopia. You can sit a Formula 1 driver at the register; he'll just sit there, tapping the conveyor belt, while Flash s-l-o-w-l-y reaches for his wallet and s-l-o-w-l-y counts out the change. The moment the customer is the slow part, the cashier's speed stops mattering at all.

That sloth is Gmail.

The real bottleneck lives somewhere else

So why is Gmail our sloth? A per-user rate limit.

Gmail caps how fast you can call its API for any single user. For one broker's mailbox you get so many calls a second and not one more; push past it and Gmail hands back a flat 429. It doesn't matter how fast our worker runs — Gmail feeds us that one mailbox at sloth speed regardless. More machines don't make the broker move faster; they just park more Formula 1 drivers behind the same slow customer.

(The far end of the pipeline pulls the same trick, by the way. Gemini — the model reading the emails — meters us too, and we tame that one with a different valve that deserves its own post. I only bring it up because the shape keeps repeating: both ends of this pipeline run on a budget we don't get to set.)

And here's what stung: the parallelism we needed was sitting right there, unused. Every broker's Gmail budget is independent — broker A draining A's quota does nothing to B's. They could have run fully in parallel from the start. Our one shared queue was quietly throwing that away.

The diagnosis was almost embarrassing in hindsight: the limit is per user, but our queue was global. We had built a single chokepoint in front of a crowd of independent, parallelizable customers.

The fix: one queue per user

Replace "one global queue" with "one queue per user, processed in parallel," and you get two things at once:

Speed: different users run in their own queues at the same time, so throughput climbs with the number of active users instead of collapsing to 1/N.
Safety: a single user's batches still run one queue, one at a time, which naturally keeps them inside that user's Gmail budget — no self-inflicted 429.

One property — a private lane per user, many lanes in parallel — cures both the slowness and the rate-limiting.

It sounds simple. But the most obvious way to build it is the one thing you must not do.

The trap: the 7-day rule

"One queue per user" leads straight to: create a queue named after the user on demand, delete it when it goes idle. Users grow without bound, so cleaning up feels mandatory.

And that walks right into a quiet rule in Cloud Tasks:

Once you delete a queue, its name can't be reused for about 7 days.

Which means: a broker scans today, the queue gets deleted — and if they come back to scan within 7 days, they hit a name that's still "on hold," and the job fails to enqueue. The users who return within a week are exactly the active ones you most want to serve.

Put bluntly: "one queue per user" is the right idea; "a queue you create and delete per user" is the wrong mechanism.

The solution: a fixed pool, assigned by hashing

So we flipped it around. Instead of building a queue per user, we provision a fixed pool of queues once, at startup — we run 128 — and from then on we only use them, never delete them. A simple hash then assigns each user to one, evenly and permanently:

const POOL_SIZE = 128

// Every user maps to the same queue, every time.
function queueForUser(userId: number): string {
  const slot = userId % POOL_SIZE   // userId is an auto-increment int -> modulo spreads evenly
  return `email-import--q${slot}`
}

hash(userId) % 128  ->  one fixed queue, always the same one

   Broker A  ->  q5     (A and C happen to collide
   Broker C  ->  q5      on the same lane)
   Broker B  ->  q42

   Different lanes run in parallel; the same lane runs one at a time.

Don't let "hashing" scare you off — here it does just two plain things:

Even: it spreads users across all 128 queues instead of piling them onto one.
Stable: a given user always lands on the same queue — so their scan stays serialized (Gmail-safe), while different users land on different queues (parallel, fast).

The best part: this design never creates or deletes a queue — so the 7-day trap simply doesn't exist. The queue count is also pinned at 128 (independent of how many users we have, and well under the cloud's ~1,000-per-region cap).

The cost? Tiny. Occasionally two users hash to the same queue and wait on each other (above, A and C both land on q5). But collisions among 128 lanes are rare, and even when they happen it's just a little slower — never wrong, and never a rate-limit problem.

The takeaway

The real lesson here has little to do with queues or hashing:

When something gets slow and your first instinct is "add machines" — stop, and find the resource that's actually the bottleneck. It's often not on your machines at all, but inside some external service that meters you per user, per call.

Find it, then shape your concurrency around it — not around your compute. Our compute can be infinite. The bottleneck never was ours.

From Monolith Prompt to Event-Driven Agent — twio's Architecture Story

Twio_AI — Mon, 22 Jun 2026 09:43:42 +0000

TL;DR — Our goal was a free-form agent—like Cursor or Claude Code—where users start anywhere, ask anything, and never march through a fixed pipeline. Getting there meant progressively moving responsibility off the prompt and onto the harness: first sequencing and state, then lifecycle and triage. The conversational freedom at the top was only safe because we kept adding structural guardrails underneath.

We wanted twio to feel like Cursor, but for mortgage brokers: a free-form agent you simply talk to. Jump into the middle of a refix, ask "when does their rate expire?", paste a customer's email, and let the agent carry on. No fixed pipelines, no "step 1 of 4"—just work the way it actually arrives.

Our first version was the exact opposite of this. It could complete a refix end-to-end, but only if the broker followed the script's exact sequence. Start in the middle, ask a side question, or reply three days later, and the system would break.

This is the story of how we chased that conversational freedom across three architectures. On the surface, each rebuild looked like a structural change. Underneath, it was a steady offloading of responsibilities—stripping away what the LLM was bad at holding, and handing it to the harness.

(To understand the evolution, just know that a **refix—renewing a mortgage rate—sounds like a 4-step linear process, but in reality, it's fragmented, non-linear, and spans days. Keep that in mind.)

Architecture 1: The Monolith Script (and the Illusion of Control)

When we started, we had no prior experience with LLM harnesses. Freedom was the goal, but first we had to answer a simpler question: Can an LLM handle a real mortgage workflow end-to-end at all?

Our first architecture was one giant prompt driving the entire refix. It looked roughly like this:

You are a mortgage assistant. To do a refix:
1. Look up the customer by name; if several match, ask which one...
2. Ask for the new rate and term — unless the customer emailed...
3. Build the proposal. Repayment = P&I, unless interest-only...
4. Fill the lender form. ANZ uses fields X/Y; ASB uses...
5. Draft the sign-off email. Warm but professional...
...and ~300 more lines of rules, edge cases, and tone notes.

It was simple, predictable, and legible. Then it met production reality, and fractured under two fatal flaws:

Cognitive Overload: The longer the workflow, the more the prompt ballooned with edge cases, scattering the model's attention. It assumed a fixed sequence, but brokers don't work that way.
Tangled Coupling: One prompt carrying fetch, validate, draft, and edge-case logic became an entangled wall of text. You couldn't test "build the proposal" without running the whole machine.

The recurring disruption that exposed these flaws: "The customer replies three days later." In the monolith, there was simply no place for this. The run was over, or it was blocked waiting for a turn that wasn't coming.

The Diagnosis: A monolithic prompt fuses three things that should never mix.

Fused together, changing one risks breaking the others.

The mandate for our first refactor was clear: Pull these three apart, and stop hardcoding the sequence.

Architecture 2: Planner + Steps (Decoupling Flow and State)

We shattered the monolith into Steps. A step is a small, single-purpose agent with a typed contract: a focused prompt, a whitelist of tools, and strictly defined I/O schemas for state.

A Planner produced an ordered sequence, and an orchestrator executed them. This cleanly separated the monolith's responsibilities:

The planner orders typed steps; each runs its own small, scoped LLM prompt.

The Breakthrough: Context as Memory

The most critical insight of this era was how we handled state. Steps didn't get the whole conversation history; they got a scoped view.

refix_proposal reads parties, writes proposal, and never sees the rest. This isn't just access control; it's memory management. A model reasons far more accurately over a small, relevant context than a massive one. Scoping the view didn't just tidy the code—it made the steps sharper.

Architecture 2 was a massive leap forward. We shipped a lot of features on it. But it carried a hidden, fatal assumption: It assumed work is synchronous and continuous.

To start work, the broker had to land on a homepage and select the workflow type from a dropdown (e.g., "Refix"). Choosing "Refix" launched the fixed plan. That demands the broker knows the shape of the work before anything runs. (You don't pick "refactor" from a menu before Cursor listens—you just start typing).

Furthermore, the plan assumed run-to-completion. So, our recurring disruption returned: The customer replies three days later. By then, the pipeline had finished or stalled. We tried bolting on a separate "inbox" to triage incoming items, but it felt like stapling an async patch onto a synchronous system. We deleted it.

The Diagnosis: A refix is not a workflow you execute. It's a long-lived case, fed by disconnected events over days.

The mandate for the next refactor: Stop making the broker declare the work up front. Stop pretending work runs to completion.

Architecture 3: The Open Case (Event-Driven Paradigm)

We collapsed the dropdown menu of workflow types into one universal Case. Every inbound event—email, chat message, late reply—flows into the same entry point. The first thing that runs is the planner, but it now has a completely different job: Triage.

Instead of sequencing a known workflow, the planner looks at a single event and picks a path:

Triage routes each event — answer directly, wake a case, or compose steps. A late reply just re-enters triage.

Three mechanisms make this architecture work:

1. Long-Lived Cases with Implicit Waiting
After acting, a case goes quiet. In our system, the absence of new messages is the waiting state—there is no dangling pipeline. What broke Architecture 1 and stalled Architecture 2 is now trivial.

2. First-Class "Do Nothing" Path
When a broker asks "when does their rate expire?", the triage planner answers it directly and stops. No four-step plan is generated. Often, no plan is the right plan.

3. Declarative Playbooks
Domain knowledge moved out of code and into Markdown playbooks. This allows mortgage experts to edit the "how" without touching the engine code.

Guarding the Freedom: Data-Flow Validation

Handing a language model the freedom to compose its own step sequences is dangerous. We put a hard floor under it: a data-flow validator that rejects an incoherent plan before a single token of work is spent.

If the planner emits refix_proposal before parties, the validator hands back a precise error, and the planner self-corrects.

This introduces a brilliant mechanism we call Soft Reads vs. Hard Reads:

Hard Read (Correctness): An ordering constraint. form_filler must wait for parties. If it's missing, the validator blocks it.
Soft Read (Composability): Visibility-only. form_filler can see refix_proposal if it exists, but won't force it into the plan. This lets the step be dropped into a one-off request without dragging the whole refix sequence along.

One reads declaration scopes the step's memory and acts as an edge in the dependency graph. One declaration, two jobs.

This is where twio finally felt like the free-form agent we set out to build. It’s the synthesis of our journey: Freedom at the top, structure at the bottom, knowledge in prose.

Why Not Just Use LangGraph?

We evaluated LangGraph and similar agent frameworks before building our own. They are excellent tools, but designed for a fundamentally different paradigm: the static graph. Our workflow is a dynamic, event-driven beast.

Forcing our architecture into a graph exposed three core mismatches:

1. The engine isn't the graph; it's the contracts.
Frameworks give you nodes, edges, and shared state. But twio's real value isn't topology—it's the typed step contracts (reads/softReads), the triage planner's judgment, and the prose playbooks. We'd still have to write all of that, just wrapped in someone else's StateGraph syntax. We'd take on a heavy dependency while our actual core logic (like the 40-line data-flow validator) becomes harder to tweak.

2. Pre-compiled graphs vs. runtime composition.
A graph requires defining paths ahead of time. Our planner composes a fresh step sequence at runtime for every single event, checked by a validator before execution. You can force a static framework to do dynamic routing, but then the graph stops being the source of truth. You end up fighting the framework's core abstraction.

3. Shared state vs. strict isolation.
Our central thesis is that a step should only see its declared context slices. Graph frameworks default to passing a massive shared state object through every node. We’d be constantly fighting the framework's default behavior to enforce the one rule we care about most.

If our workflows were mostly static and predictable, LangGraph would save us real work. Ours aren't.

But the deeper point is this: our event-driven case design entirely sidesteps the need for durable execution. Frameworks earn their keep by handling suspend/resume, timers, and exactly-once side effects. Because "no new messages" is our waiting state, the system naturally suspends without needing underlying infrastructure to maintain it. If we ever actually need durable execution, we’ll reach for a dedicated engine like Temporal—not an agent framework.

What's Still Hard

Top-level freedom introduces new failure modes. The planner can match the wrong playbook. It can compose a valid but suboptimal plan. Prose playbooks can drift from what the steps actually execute. Our guardrails catch a majority of this, but this is a live frontier, not a solved problem.

The through-line of this whole journey is that you don't arrive at the architecture—you get pushed into it by reality. Ultimately, the customer who replies three days late wrote more of our architecture than we did.

Architectural Takeaways

If you are building long-lived, agentic systems:

Don't make the model hold what the harness holds better. Offload sequencing, durable state, and lifecycle management to traditional code.
Treat context as memory. Strictly scope each step's view. Models reason better over small, relevant contexts.
Model the arriving unit as an event, not a task. Work is fragmented; your architecture should expect interruption.
Build a first-class "just respond" path. Forcing a workflow when a direct answer suffices destroys the user experience.
Earn freedom with guardrails. Use static validation (like our data-flow checker) to make self-composing agents safe.

We're building twio, an AI assistant for mortgage brokers. If you're wrestling with the same questions about long-lived, event-driven agent architectures, we'd love to compare notes.

Why Twio Chose Vertex AI Search over pgvector for Production RAG

Twio_AI — Mon, 22 Jun 2026 00:00:58 +0000

When we first built RAG at Twio, pgvector was the obvious pick. Our business data was already in PostgreSQL, and dropping embeddings into the same database was the fastest path to a working product.

For the first version, that was right. As we scaled, the problem stopped being "how do we store vectors?" and became "how do we reliably understand thousands of broker documents, emails, and attachments in production?" That changed the answer. Today, Vertex AI Search is our main retrieval layer.

RAG is Twio's memory layer, not a search feature

Twio is an AI SaaS for loan brokers. A single client case is a mess of fragmented information:

email threads
payslips, bank statements, identity documents
loan forms, lender requirements
handwritten notes, follow-up emails, missing-document requests

The AI needs to answer questions like:

What documents has this client already sent?
Which email mentioned the missing requirement?
Does this bank statement support the income claim?
Summarize all documents related to this borrower.

If retrieval is weak, the answer is weak. If indexing lags, context is missing. If parsing is wrong, the model sees the wrong evidence. RAG isn't a feature on the side — it's the memory layer of the product.

Why pgvector was the right first choice

Twio is a multi-tenant SaaS, so retrieval can't just return "similar content" — it has to return similar content scoped to the right user, client, application, or file. pgvector made that trivial: embeddings sat next to the business records, joined cleanly, and filtered with plain SQL.

The early wins were real:

no new infrastructure
low cost, easy local dev
SQL inspection for debugging
straightforward metadata filtering
fast to ship

It let us build the first version quickly and learn from actual usage. That matters more than people give it credit for.

Where pgvector stopped paying off

pgvector didn't fail. It did exactly what it's designed for. The issue was that vector storage is only one slice of the RAG pipeline, and pgvector left every other slice to us:

download attachments
extract text from PDFs, run OCR on scans
chunk documents, generate embeddings
design metadata, build retrieval queries
tune indexes, improve ranking
monitor Postgres load, debug retrieval quality

A clean PDF is easy. A scanned bank statement isn't. An email body is easy. An email with five attachments, lender forms, tables, and partial OCR isn't. A demo dataset is easy. A real broker workspace with years of historical emails isn't.

With pgvector, every weakness in that pipeline was ours to fix. When retrieval quality dropped, the suspect list ran all the way from OCR through chunking and embedding to vector distance, SQL filtering, ranking, and DB performance. The extension is simple. The production RAG system around it isn't.

The cost shifted from cloud bill to engineering time — and engineering time was the constrained resource.

pgvector vs Vertex AI Search, in Twio's terms

Scenario	pgvector	Vertex AI Search
Clean text PDF	We own extraction, chunking, embedding, storage, search	Vertex handles most of the indexing and retrieval workflow
Scanned document	We build or integrate OCR ourselves	Vertex absorbs much of the document-processing logic
Broker asks a document question	We own query design, ranking, filtering	Managed search with stronger out-of-the-box quality
Attachment bursts	Postgres carries more search and indexing load	Search workload lives outside the main database
Debugging	Excellent SQL visibility, but many custom layers to inspect	Less low-level control, but far less custom infra to debug
Cost	Lower direct service cost	Higher service cost, lower engineering and maintenance cost
Production readiness	Significant custom work required	Easier to operate as a managed layer

pgvector was cheaper as a database extension. Vertex is cheaper as a product decision. The cloud bill is one input; engineering time, reliability, and iteration speed are the bigger ones at our stage.

Why Vertex fits Twio's shape of problem

Twio's RAG problem is document-heavy. We aren't searching short snippets — we're dealing with messy broker PDFs, scans, forms, tables, and forwarded attachments. Vertex helps in four concrete ways:

Less infrastructure to own. Indexing and retrieval are handled by the managed layer, so we don't rebuild that surface.
Less document-processing logic to maintain. OCR and parsing for messy broker files is one of the harder parts of the pipeline to keep healthy. Vertex covers much of it.
Postgres stays focused on what it's good at — business data, transactions, workflow state — instead of competing with OLTP work for the same resources.
It scales more naturally as document volume grows.

Vertex isn't free, but the alternative isn't either. Building OCR, indexing, ranking, monitoring, and tuning ourselves has its own bill — paid in engineer-weeks.

What pgvector still does well

pgvector is still a strong choice when:

data volume is moderate
you're already on Postgres and want retrieval close to your data
your documents are already clean text
you need tight SQL filtering and full control
you want a fast, low-cost first version

For us, it was the right first implementation — and it taught us what retrieval the product actually needed. It may stay in the stack for internal or fallback use cases.

Takeaway

The lesson from Twio's RAG evolution is simple:

Start with the tool that helps you learn fastest. Move to the tool that helps you operate best.

pgvector got us to a working RAG system quickly. As the product matured, the real challenge shifted to document processing, indexing quality, and operational reliability — and at that point, Vertex AI Search became the better fit. It costs more as a service and less as a system to maintain. For a SaaS at Twio's stage, that's the trade that matters.

From pg-boss to Cloud Tasks: Fixing Queue Bursts and DB Connection Failures on Serverless

Twio_AI — Tue, 02 Jun 2026 03:23:40 +0000

At Twio we picked pg-boss for our job queue, ran into trouble when we went serverless, looked at Pub/Sub, and ended up on Google Cloud Tasks. This is what each queue got right, what it got wrong for our workload, and the rule we landed on for choosing between them.

The workload

Twio is an AI SaaS for loan brokers. The piece that needs a job queue is email processing: download an email, parse the body and attachments, OCR, classify with an LLM, write structured data, and index for RAG. One email with five attachments easily becomes 30+ background jobs. A batch upload becomes hundreds.

Why pg-boss worked — until it didn't

Our database was Postgres on Neon, so pg-boss was the obvious starting point. No extra infrastructure, and one feature we genuinely loved: transactional enqueue. Because jobs live in the same database as business data, you can create a job in the same transaction as the row that triggered it. No dual-write problem, no "DB succeeded but the queue API failed" inconsistency.

It also gave us retries, delayed jobs, dead-letter queues, dedup keys, and full SQL visibility into stuck or failed jobs. For a Postgres-first app on always-on infra, it's an excellent tool.

Then we moved heavy processing to Cloud Run, and the cracks showed up.

pg-boss polls. Neon suspends. They want opposite things.

pg-boss runs a query roughly every 1–2 seconds to look for the next job, plus maintenance queries. Neon autosuspends compute when nothing touches the database. If the queue is polling every second, Neon's idle timer never expires — you pay for always-on compute even when the queue is empty.

Worse, when Neon did manage to suspend, the next poll had to wake it. That wake-up takes hundreds of ms to a few seconds, and queries that triggered it would fail with Connection terminated, ECONNRESET, or timeouts. Pooled connections made it worse: the pool kept sockets that the server had already closed during suspend, and the next polling cycle picked one up and broke.

This isn't a pg-boss bug. It's an architectural mismatch.

Why Pub/Sub wasn't the answer

Pub/Sub is event-driven — no polling against Postgres, Neon can suspend freely. That fixed the obvious problem, but introduced a worse one for our shape of work.

Pub/Sub is built to move messages fast. We needed a queue that moves messages carefully.

Two specific failure modes hit us:

Retry amplification. A parent import job publishes 100 child parse messages, then crashes before acking. Pub/Sub redelivers the parent. The parent re-publishes 100 children. After a few retries, you have hundreds of duplicate child jobs.
No native job-level pacing. If 300 messages land at once, subscribers consume them as fast as they can — slamming our parser, Neon, the LLM provider, and third-party APIs simultaneously. Pub/Sub has flow control on the subscriber side, but it's not the kind of per-queue dispatch throttle we needed.

Plus the ack-deadline problem on long parse jobs, where a missed lease extension causes redelivery while the original is still running.

All of these are solvable with idempotency keys, outboxes, and bounded retries — but at that point you're rebuilding what a job queue should give you out of the box.

Why Cloud Tasks fit

Cloud Tasks is push-based: when a task is due, Google sends an HTTP request to our handler. When there are no tasks, nothing touches our database. That alone resolved the pg-boss/Neon conflict — Neon suspends, costs drop, no more wake-up connection errors.

But the real reason it fit was per-queue dispatch control:

# queue.yaml
- name: email-parse
  rateLimits:
    maxDispatchesPerSecond: 10
    maxConcurrentDispatches: 20
  retryConfig:
    maxAttempts: 5
    minBackoff: 10s
    maxBackoff: 600s
    maxDoublings: 4

Enqueue 300 tasks in a second and Cloud Tasks won't deliver them all at once — it paces dispatch to the limits we set. Our parsers, Neon, and the LLM provider stay protected from bursts.

It also gives us operational levers Pub/Sub doesn't: list tasks, inspect depth, pause a queue, purge a bad batch. When a fan-out goes wrong, we can stop it.

What Cloud Tasks doesn't solve

Two things, both important.

It's still at-least-once. A handler can finish the work and Cloud Tasks can still redeliver if the HTTP response is lost. Handlers must be idempotent.

Fan-out duplication is still possible. If the parent creates 100 child tasks and then fails before returning 200, the retried parent creates them again. The fix here is deterministic task names:

parse-{emailId}-{attachmentId}

Cloud Tasks rejects duplicate names within its retention window, so the second attempt is a no-op. But you have to design for it — it's not automatic.

And it doesn't recover transactional enqueue. Cloud Tasks lives outside the database, so creating a task after a DB write is a dual-write. If you need strict atomicity, the answer is still an outbox: write the business row and an outbox row in one transaction, have a relay publish to Cloud Tasks and mark the row published. No external queue makes this go away.

The rule we landed on

Queue selection isn't about finding the best queue. It's about matching the queue to the runtime model.

pg-boss for small internal jobs in always-on services where Postgres transactionality matters.
Cloud Tasks for cross-system, serverless workflows where we need to protect Neon, LLM providers, and third-party APIs from bursts.

And three rules that apply regardless:

Every handler is idempotent.
Fan-out children have deterministic keys.
If enqueue must be atomic with a business write, use an outbox.

Cloud Tasks fixed our infrastructure mismatch, but the real win was clarifying what the queue is responsible for. Infrastructure handles scheduling, retries, and rate limits. Correctness belongs to the application.