Takayuki Kawazoe

Posted on May 6

How we ended up running one product with 2-3 people, after building our own dev harness

#ai #productivity #devops #saas

The thing nobody told me about agentic coding speeding up by 5x is that your bottleneck just moves. Code lands faster, sure. But now the PM is the slow part of the loop. The QA pass is the slow part of the loop. The "wait, is this what we actually wanted" conversation that used to take a day now takes four days, because you can ship three iterations in the time it takes to confirm what one of them was supposed to do.

I run a small development shop in Tokyo. We do contract work for a handful of companies, and every one of them wants more shipping velocity than they have headcount for. About eighteen months ago I started taking AI coding seriously as a way to scale our own capacity. The Claude API plus a decent agent runner could plausibly make us much faster. So I started using it on real work, and within a few weeks the experience had a shape I didn't expect: the agent was great. The chain leading up to and following the agent was the problem.

We ended up building our own harness around the agents, and the thing it bought us wasn't faster code. It was tighter loops. The punchline is "one product, one PM, one engineer, end-to-end," but the surprises along the way are the actually useful part.

I'm building Codens, the harness in this story — happy to talk about it but the goal here is the build journey, not a pitch.

Coding got fast, then everything around coding got slow

The first month using Claude Code seriously on a client project, our velocity on the implementation step went up roughly 3x. It was real, measured against a previous quarter of similar tickets, and it stayed.

What also went up was the rate at which we shipped the wrong thing. Not "broken" wrong. The agent didn't write code that crashed. It wrote code that worked, looked clean, and implemented something subtly different from what the client meant when they wrote the ticket. Sometimes we'd merge, deploy, and only notice in the next sync. Sometimes the client noticed immediately and we'd have a rework conversation that ate two days of the four days we'd just saved.

The pattern was always the same: a one-paragraph ticket like "users should be able to undo a deletion within 30 seconds" gets handed to the agent, the agent writes a perfectly reasonable interpretation, and the perfectly reasonable interpretation isn't what the client had in their head. Humans had this problem too. The difference was that humans took a day to write the wrong thing, so we had a day to ask "wait, undo from where? the trash, or in-place? does it survive a page refresh? what about cascading deletes?" and catch the ambiguity before we'd implemented it.

The agent took twenty minutes. The clarifying conversation now had to happen after the implementation, against a concrete artifact, which feels efficient until you realize you're throwing away an hour of agent work for every ambiguous ticket.

So the first thing I built wasn't anything that touched the agent. It was a thing that took the rough one-paragraph request and turned it into a structured spec the human PM could review before anyone wrote code. Inputs, outputs, edge cases, what's explicitly out of scope. Half a page, every time. The lift dropped from "stare at a blank page for an hour" to "react to a draft for ten minutes."

This was the first real win. Not faster code, clearer asks.

Once specs were structured, the chain extended downstream on its own

Here's the thing I didn't see coming. Once the spec output was structured (actual fields, actual edge-case lists, actual out-of-scope statements), handing that spec to the coding agent was also better. Way better. The agent stopped guessing what the ticket meant because the ticket no longer contained guesses.

So the chain extended naturally. Rough request, structured spec the PM reviews, agent implements, PR. The handoff between spec and implementation became an artifact instead of a conversation, which meant nobody had to be on Slack at the same time for it to flow.

The first version of the implementation runner was embarrassingly thin. A Python script that took the spec, shelled out to Claude Code in a worker, and posted the resulting PR URL to Slack. I deployed it on a small box in our office on a Saturday afternoon. Within a week we'd run something like 40 tickets through it. Within two weeks we hit our first real problem: every client had different rules, and the agent didn't know any of them.

Each product had different rules, and the agent didn't know any of them

This is where the harness started to look less like a script and more like a system.

One of our clients ran a fintech with strict patterns around money handling. Every price had to flow through a specific pricing module, no direct multiplication of cents-amounts allowed. Another had a Next.js codebase with a strong convention that all data fetching happens in server components and 'use client' is a code-review red flag. A third had a Django monolith with a deeply opinionated repository pattern.

The agent, by default, would happily write code that violated all of these. It wasn't being malicious. It was writing code that worked. Code that worked just often happened to violate house rules.

We started keeping per-project context: a CLAUDE.md per repo with the rules, the patterns, the "don't ever do X" list, the "always check Y before Z" notes. The agent reads it on every run. Not a novel idea; it's what most teams using Claude Code do. What was novel for us was treating it as a deliverable of every project setup. New client onboards, first week is "let's write your rules file together." Half discovery, half negotiation, all of it useful.

That helped. It did not solve the problem.

The rules-don't-stick problem, and the gates we had to add on top

The agent would read the rules file. The agent would acknowledge the rules file in its planning output. The agent would then, several tool calls later, write code that violated rule #4 because it was working through the immediate sub-problem and rule #4 was no longer in the foreground of its context.

The first time this hurt us in production was on the fintech client. A migration ran, a price calculation got refactored as a "tidying" side-effect, and the calculation now bypassed the pricing module, went straight to multiplying two integers, because the agent's local fix made the code "cleaner." The PR passed review (the reviewer was tired, the diff looked clean, the math looked right). It hit prod. Nobody noticed for two days because the answer was right within rounding. We caught it during a billing reconciliation.

If the rule is "all prices flow through bcp_price()," that rule cannot be advisory. The agent's judgment is the wrong layer for it. The reviewer's judgment is the wrong layer for it. There has to be a deterministic check that the agent's diff respects the rule, and that check has to run before the PR is mergeable, and it cannot be something the agent can talk its way past.

So we added gates. Real ones. Not "AI checks AI"; that has the same failure mode as the agent forgetting rule #4. Deterministic checks: shell commands, regex over the diff, AST walks for the more sensitive ones. The agent runs them, gets the failure output back, fixes the diff, runs them again. If they don't pass, the run errors out. The agent cannot decide the gate "doesn't apply here."

The split that emerged, and which I now think is the right shape for any agentic workflow handling production code, is: the agent does the open-ended judgment work (what to write, how to structure it), and a deterministic step machine sequences and checks that work. If verify fails, the agent's next turn gets the failure output as input and tries again. If verify still fails after a few retries, a human gets paged. Open-ended cognition where it earns its cost; deterministic plumbing everywhere else.

This was the second non-obvious lever. The first was "AI writes specs." The second was "AI judgment cannot be trusted to enforce rules; rules are a separate layer."

Tests went into the same chain, almost as an afterthought

We were already running our test suites in CI. What changed once the gates existed was that the agent itself started running tests as part of its loop, before opening a PR.

The shape was: agent finishes implementing, harness runs the project's verify command (lint, typecheck, relevant test slice). If it fails, the agent gets the last ~1500 bytes of test output piped into its prompt, edits, retries. Up to three iterations, then gives up and asks for help.

The retry loop sounds fancy and was actually trivial to implement once the gate infrastructure was there. The output of "tests failed" is structured enough that the agent can read it and produce a corrective edit roughly 70% of the time on the first retry, climbing past 90% by the third. I didn't believe the number when I first measured it. The third retry is doing real work; it's not just throwing more tokens at a stuck problem, it's correcting the second retry's overcompensation.

The thing that surprised me was the team consequence. Once tests were running inside the agent's loop, our human review time on PRs dropped sharply. Not because the AI was doing review (it wasn't, and I'd argue it shouldn't) but because the agent was no longer opening PRs that were going to fail CI. The class of review where you spent twenty minutes reading a diff only to comment "tests are failing, please fix" disappeared. PRs that arrived in front of a human were PRs where the agent had already passed the gates the human cared about most. The conversation moved up the stack, to architecture and product fit, where humans should have been spending the time anyway.

Production bugs needed to flow back into the same pipe

Around the time the test loop stabilized, we hit the next obvious thing: production bugs are also tickets. They have a different shape (stacktrace and reproduction context, not a one-paragraph product request) but they're tickets. Why aren't they entering the same pipeline?

So we built a bug ingestion path. Sentry events, error reports, customer feedback that mentions a specific broken thing come in through a webhook, get analyzed for "is this actionable enough to attempt a fix automatically," and if yes, the harness opens a fix PR. Same retry loop as the implementation path.

There's a public-facing piece for one of our clients: a feedback page where users describe broken things in natural language, which gets turned into structured bug reports, which gets fed into the analyzer. End to end, a user reporting "the export button doesn't work on Safari" can result in a fix PR sitting in front of an engineer about ninety seconds later. Most of the time the engineer doesn't merge it as-is. They read it, adjust, rerun, merge. But they're starting from a reasonable diff against a reproduced bug, not from "let me see if I can repro this on Safari first."

Bug reports, feature specs, and refactor tickets all hit the same harness, all run through the same gate sequence, all produce the same kind of artifact. The PM and engineer don't have to context-switch between "we're doing features now" and "we're triaging bugs now." The queue is a queue.

Rolling it out: the moment the team started using it without me

For the first six months, the harness was something I personally ran. Tickets flowed in, I ran them, PRs came out. The other engineers used it occasionally but mostly worked the old way.

The flip happened over about three weeks. One engineer started using it for the boring tickets, the ones that were "implement this CRUD endpoint exactly like the other three" type work. Their output on those weeks went up noticeably. Another engineer noticed and asked how to set it up for their project. By week three the harness was the default tool for incoming tickets across all four of our active clients.

The thing that flipped it wasn't a feature. It was the moment the gates and the tests were reliable enough that an engineer could send a ticket through and trust the result enough to open the PR for review without re-doing the work themselves. Before that point, the harness was a curiosity. After that point, it was infrastructure.

A small but important detail: we share an org-level credit pool for the LLM API calls across all the agents in the harness. No per-engineer budgeting. If you need to run a hard ticket through the implementation path eight times because the spec keeps shifting, you run it eight times. The shared pool means the cost conversation is a monthly business conversation, not a daily individual one. I think this matters more than people realize. Per-seat or per-run budgets create friction at exactly the wrong moment, the moment an engineer is deciding whether to use the tool or not.

We also added a side-channel that observes the harness (every PR, every ticket, every gate failure, every retry) and computes activity signals per engineer per week. This wasn't supposed to be a productivity panopticon. It was "is the harness actually working" telemetry, the same way you'd watch a deploy pipeline. What it became, naturally, was a thing the engineers themselves used as a sanity check on their own work.

Where we landed: 2-3 people running a product end-to-end

Today, three of our four active client products run with one PM and one engineer. The fourth has two engineers because it's a larger codebase mid-migration. The PM writes rough requests, the harness turns them into specs, the PM edits, the engineer reviews and approves the spec, the harness implements, gates run, tests run, PR lands, engineer reviews, merges. Bugs come in through the public feedback path, get analyzed, fix PRs land in front of the engineer.

The engineer's day is mostly architecture decisions and prioritization conversations with the PM. The bulk of the implementation queue runs without their direct attention. They jump in when the gates fail in a way that needs human judgment, when a spec has a subtle ambiguity the PM didn't catch, when the bug analyzer surfaces a fix that's actually "this whole subsystem needs a rethink." Higher-leverage work, fewer bytes typed per day.

The number that matters isn't "we 5x'd coding speed." At this point I don't even know how to measure that, because the question is malformed. The number that matters is iteration speed. From "client mentions a thing" to "thing is in production" used to be measured in days for us. It's now measured in hours for most of what we do, and the bottleneck is the conversation, not the work.

That's the whole story. We didn't build a thing that codes faster. We built a thing that lets the conversation stay in front, where it should be, while the mechanical parts run. The mechanical parts include things you wouldn't have called mechanical five years ago: implementation, test fixing, bug triage. Most of those have a clear-enough shape that a deterministic step machine plus an agent in the open-ended slots gets you 80%+ of the way through, and the remaining 20% is exactly the work the humans were good at to begin with.

If I had to compress the lesson, it's this. The first AI win is faster typing. Real teams quickly find that's not the bottleneck. The actual win is moving the slow conversation, the "what are we even building" conversation, back to the front of the loop, and giving the rest of the loop enough deterministic structure that humans only show up for the parts that need them. You don't get there by buying an AI coding tool. You get there by treating the whole loop as the design surface.

Codens is the harness we built for ourselves and now offer to other teams (https://www.codens.ai/en/). Other shops will solve this differently. But if you're a small team feeling like the AI made you faster at typing without making you faster at shipping, the gap is probably somewhere I described above. Worth looking at the parts of the loop that aren't typing.

Closing thought. The iteration speed thing keeps surprising me even now. We've been operating at the new tempo for about six months. Every time a client says "can you have this by Friday" and we ship Wednesday, I notice a small flicker of "is that actually going to be okay long-term." Not because of the code but because of what the new tempo does to the conversation rhythm. Our PMs have had to learn to think faster about what they want, because the implementation lag is no longer a built-in pause for reflection. That's a real cost. I don't think it outweighs the benefit, but it's the part of this transition I most underestimated. The harness made the mechanics fast. It also pushed all the unfastness onto the humans who decide what to build, which is, on net, where I want it. But it doesn't feel free.