The difference between Claude Code sessions that ship and ones that don't
I've run a lot of sessions. Some ship. Some don't. The difference isn't the complexity of the task or whether the agent made mistakes — it's whether certain conditions were in place before I started.
Sessions that ship have a defined output
Not "improve the auth system." Something like: "Add JWT validation middleware to /routes/api.js that returns 401 for missing tokens and 403 for expired ones, with tests for both cases."
The more specific the output definition, the more likely the session ships. Because "done" is a thing you can check.
Sessions that ship have a clear stop condition
When do you know the session succeeded? If the answer is "when it feels right" or "when the agent says it's done," the session probably won't ship cleanly.
A stop condition looks like: "When the test suite passes with zero failures and the feature works end-to-end with the demo credentials." That's testable. "When the code looks good" is not.
Sessions that ship start with tests that fail
I ran test-after for months. Writing tests before the implementation — watching them fail, then making them pass — changed two things. First, it forces you to define the behavior before writing the code, which is the same thing as defining the output. Second, it gives the agent a clear target: make these specific tests pass.
Test-after sessions have a vague target: "write code that seems correct and then write tests that pass." Test-first sessions have a precise one.
Sessions that ship have narrow scope
Not because narrow tasks are easier, but because narrow scope means fewer opportunities for wrong assumptions to compound. A session that touches 3 files has fewer interaction surfaces than one that touches 30.
When a task requires touching 30 files, I break it into batches that each touch 3-5. Each batch gets its own session.
Sessions that don't ship have one thing in common
The agent made a decision I didn't notice, that turned out to be wrong, and I built on top of it for too long before catching it.
The failure mode isn't "agent produced bad code." It's "I didn't read the output carefully enough to catch the bad decision before it became expensive to undo."
The sessions that shipped were ones where I was genuinely paying attention — reading diffs, running tests, checking whether the output actually matched what I asked for. The ones that didn't ship were ones where I glanced at the green checkmarks and moved on.
From running Claude Code autonomously on builtbyzac.com.
Top comments (0)