DEV Community

Takayuki Kawazoe
Takayuki Kawazoe

Posted on

"When the AI gets stuck, the engineer fetches the same PRD via MCP and keeps going"

Last Tuesday I watched our auto-fix agent burn through three retries on a session-handling bug and surrender. The failure mode was honest. It tried, the diff broke a test we did not know existed, it tried again, the second diff fought with an old idempotency check, the third diff was basically the first one with renamed variables. Then it stopped. The bug report sat in our system marked analysis_failed, the proposed plan was there, the partial diff was there, and the engineer who had to take over was sitting in Slack scrolling.

That gap, the moment between "AI gave up" and "engineer is coding," is where most AI dev tools quietly cost more than they save. The engineer cannot just resume. They have to reconstruct what the AI was looking at: which PRD section, which kickoff decision, which root cause analysis, which files the bug report pointed at. The data exists. It just lives in five places and none of them are inside the IDE.

We shipped codens-mcp v0.7.5 partly to close that gap. The AI workflow inside Codens reads and writes the same PRDs, bug reports, kickoffs, and run logs that an engineer can now pull into Claude Code over MCP with one call. Same source of truth. Two surfaces. The handoff loses nothing.

The 80/20 reality nobody markets

The honest number for a well-tuned AI dev harness on real production code is somewhere between 80% and 90% of tasks completed end-to-end. The rest is novel business logic, conflicts with code the AI never saw, spec ambiguity that no amount of retry will resolve, and the long tail of edge cases that someone has to think through. I do not believe the "100% AI development" pitch and I do not think anyone shipping into real codebases does either.

The 20% is not the problem. The problem is the seam between the 80% and the 20%.

When the AI hands a task back, the human arrives without context. The PRD is in Notion. The bug analysis is in Sentry plus some chat thread. The kickoff decision that explains "we chose JWT not session cookies" is buried in a meeting recap. The engineer has to play archaeologist before they write a single line. And because the AI workflow has already burned through three retries, the next attempt starts from a worse position than if the engineer had been the first responder.

Most AI dev tools optimize the 80%. They get better at the part the AI was already good at. The 20% gets a "human-in-the-loop" label and a button that says "request review." That button does not solve anything. The engineer still has to find everything.

Codens treats the seam as the actual product. The 80% has to keep getting better, obviously. But the 20% is where the trust gets built or destroyed, and the only way to make it good is to make the takeover instantaneous.

One source of truth, two read paths

Every artifact the AI produces or consumes during a task is a first-class entity in Codens, stored in Postgres, owned by a project, scoped to an org. Green Codens owns the planning side: Consultation (the requirement-gathering conversation), PRD (the structured spec), Kickoff (the implementation plan with vision, scope, tech selection, milestones), Plan (the task breakdown). Red Codens owns the repair side: Bug Report (with the AI's root cause analysis attached), Bug Fix Plan (proposed impact scope and test requirements). Purple Codens owns execution: Run (the live event stream from a workflow), Logs.

The AI workflow writes to these entities through internal service calls. When the Green PRD AI generator finishes a section, it patches the PRD row. When Red's analyzer finishes, it attaches an analysis blob to the bug report. When Purple's runner emits an event, it goes to the run's event log. Nothing escapes into chat. Nothing depends on a human copying text from one tab to another.

The second read path is codens-mcp. It is a Python package that registers as an MCP server inside Claude Code (or any other MCP client). It authenticates with the same JWT the web app uses, talks to the same backend APIs that the AI workflow talks to, and exposes 38 tools that cover 137+ actions. When an engineer calls green_prd(action="get", prd_id=...), they get the same PRD bytes the AI agent read three retries ago.

The point is not "we have an API." Every product has an API. The point is that the AI workflow and the engineer use the same access shape against the same row. There is no "engineer-facing version" of the PRD that drifts from the "AI-facing version." There is one row. Both sides read it. Both sides can write it.

What codens-mcp actually exposes

The retrieval surface that matters for a takeover is small. An engineer who arrives at a failed task needs to know: what was being built, what decisions were already made, what the AI tried, and where it broke.

Install and authenticate once:

pip install codens-mcp
codens-mcp login
Enter fullscreen mode Exit fullscreen mode

login runs Device Code Flow against the Codens auth service and stores a JWT at ~/.purple-codens/credentials.json. From that point every tool call carries the token automatically.

Register the server in Claude Code:

{
  "mcpServers": {
    "codens": { "command": "codens-mcp", "args": ["serve"] }
  }
}
Enter fullscreen mode Exit fullscreen mode

Then the engineer, in their IDE, asks Claude to pull the bug report the AI was working on:

red_bug_report(
    action="get",
    organization_id="org_abc",
    bug_id="bug_2f8a"
)
# -> { id, title, description, severity, steps_to_reproduce,
#      expected_behavior, actual_behavior, affected_files,
#      analysis: { root_cause, evidence, suspected_files }, ... }
Enter fullscreen mode Exit fullscreen mode

The action parameter pattern is the whole reason 38 tools cover 137+ operations. One green_prd tool handles create, list, get, update, delete, update_section, approve, submit_for_review, request_changes, archive, unarchive, link_notion, unlink_notion, and consistency-check. The tool descriptor that the model loads at startup is one short signature, not fifteen. (We have written separately about why that matters for context budget — the short version is that a five-server stack burns 55K tokens advertising itself before any work; codens-mcp burns under 5K for everything.)

For a takeover the engineer typically chains two or three calls:

green_kickoff(action="get", kickoff_id="kck_7a1c")
# -> vision, scope, non-goals, tech selection, milestones

green_plan(action="get_tasks", plan_id="pln_91de")
# -> ordered task list with status and dependencies

purple_run(action="get_status", run_id="run_be40")
# -> last events, failure reason, partial outputs
Enter fullscreen mode Exit fullscreen mode

Three calls. Maybe forty seconds. The engineer now has the same view of the work that the AI had when it gave up, without leaving the IDE and without reading a single Slack thread.

Walking through a real takeover

The Tuesday session-handling bug. Here is what actually happened after the third retry failed.

The on-call engineer opened their IDE. Claude Code was already running with codens-mcp registered. They typed:

"Pull bug report bug_2f8a and the latest fix plan."

Claude called red_bug_report(action="get", bug_id="bug_2f8a") and red_bug_fix_plan(action="get_by_bug", bug_id="bug_2f8a") in parallel. Both returned in under a second. The analysis pointed at the auth middleware. The fix plan listed the three files the AI thought needed to change and the test it expected to pass. The engineer read it in maybe two minutes.

Then they asked:

"What did the last Purple run actually do?"

Claude called purple_run(action="get_status", run_id=...) and purple_run(action="subscribe_events", run_id=...) for replay. The event log showed exactly which test had failed on each retry and why the third retry had effectively reverted to the first. The AI had been bouncing between two incompatible local minima.

That was the engineer's "aha." The fix plan was conceptually right, but the test the AI was retrying against was wrong, written by an earlier feature, asserting a behavior the new spec explicitly changed. The engineer fixed the test, applied the AI's second-attempt diff with a four-line manual adjustment, and shipped it. From bug report open to PR merged: 23 minutes, including reading.

Without codens-mcp that same takeover would have been: open Sentry, search by ticket, copy stack trace, open Notion, find the PRD by title, scroll to the right section, open the chat thread where the kickoff lived, find the test naming pattern, grep the repo, then start coding. I have timed that path on myself. It is between 25 and 45 minutes before the first edit.

The tradeoff

The price of "one source of truth, two read paths" is schema discipline. Every artifact has to be modeled well enough that the AI workflow and the engineer both find what they need in it. You cannot let the PRD turn into a Markdown blob with five conflicting section conventions, because the AI's update_section action and the engineer's get_section reader both depend on the structure being honest. You cannot let the bug report become a free-text field with the root cause analysis stuffed at the bottom in a different format every time, because the takeover tooling that highlights analysis.suspected_files will silently miss them.

This is heavier upfront than the alternative, which is to let each side render its own view. The alternative loses every time. The drift between "what the PM thinks the spec says" and "what the engineer thinks the spec says" is, in my experience, the single biggest source of bugs in features that get partially built by an AI. The schema discipline pays for itself the first time a takeover succeeds in under thirty minutes.

The other cost is honest: we run on the Anthropic API direct path, with per-token billing and our own multi-model routing across Claude and Qwen. That gives us control over the escalation path (AI workflow to engineer manual takeover via MCP) independent of what any single platform decides about subscription-tier agent access. When the platform shifts, the takeover path does not move.

Wrap

Graceful degradation is the unappreciated half of AI dev tool design. Anyone can build an agent that succeeds on the easy 80%. The teams that ship into real production code earn their trust on the 20% where the agent gives up and a human takes over. The only way to make that takeover not feel like a downgrade is to make the data the human needs be exactly the data the agent had, in the same shape, one tool call away.

That is what codens-mcp is. The AI does most of the work. When it cannot, the engineer reads the same row.

Codens English landing: https://www.codens.ai/en/
codens-mcp on PyPI: https://pypi.org/project/codens-mcp/

Top comments (0)