DEV Community: Takayuki Kawazoe

We Made Our AI-vs-Human PR Stats a Public Live Dashboard

Takayuki Kawazoe — Tue, 07 Jul 2026 00:14:09 +0000

A while back I wrote about 64% of our merged PRs being written by AI. A few people (reasonably) asked: "nice story, but can I verify any of that?"

So we put it on a public, auto-updating dashboard:

👉 www.codens.ai/stats/en

It shows, for our GitHub organization:

What % of merged PRs are authored by AI agents (currently 65%)
Median time from PR-open to merge (2 minutes)
Who merged what — task-execution agents vs maintenance bots vs auto-fix vs humans
Weekly AI-merge counts and a per-repository breakdown

Every number is tallied straight from the GitHub API by a small collector script, and the page regenerates itself weekly. We can't inflate it — it's measured, and the down weeks show up too (that's kind of the point).

Why bother making it public

Two reasons.

1. "Trust me bro" doesn't scale. When you claim most of your code is AI-written, the only honest move is to expose the raw counts so anyone can sanity-check them. The dashboard is that receipt.

2. It's a live dogfooding test. The whole thing — the PRD, the implementation, the review, the auto-fix on production errors — runs on Codens, our own AI dev-automation suite. If our own numbers ever tanked, the dashboard would be the first place it'd show. Public accountability is a good forcing function.

Funny footnote: the dashboard itself was code-reviewed by our own AI reviewer before it shipped, and it caught two real bugs — an OG-image percentage that had drifted out of sync with the live number, and a broken error-fallback that would have blanked the page on malformed data. The tool built to sell "AI reviews your PRs" reviewed the PR that announces it. We'll take it.

If you want to see the same machinery on your repos, there's a 14-day free trial (no card): codens.ai. Japanese version of the dashboard is here.

64% of Our Merged PRs Were Written by AI — 2,424 PRs in 3 Months

Takayuki Kawazoe — Mon, 06 Jul 2026 10:49:28 +0000

Three months ago, a new "coworker" joined our GitHub organization. It never complains, works through the night, and occasionally goes home mid-task when its Spot instance gets reclaimed.

Last week I ran the numbers, and they were wilder than I expected.

64% of our merged PRs were written by AI

Between April 8 and July 6, 2026, our org's ten main repositories merged 3,781 pull requests. Of those, 2,424 (64%) came from branches created by AI agents.

The breakdown:

Type	PRs	What it does
Task-execution agents	859	Reads a Notion ticket, implements, tests, opens a PR, gets it merged — end to end
Maintenance bots	1,413	Doc sync and periodic housekeeping, quietly
Auto-fix agents	108	Watches Sentry, sends a fix PR when production breaks
Other agents	44	Odd jobs

Humans (effectively 2–3 of us) do reviews and direction. The hands on the keyboard are mostly AI.

Median time from "PR opened" to "merged": 2 minutes

For the 859 task-agent PRs:

Median: 2 minutes
p75: 9 minutes
93% merged within an hour

"Who reviews a PR in 2 minutes?" — another AI does. When a PR opens, an AI code reviewer files findings with severity levels, CI runs, and on approval the PR auto-merges. Humans watch the results scroll by in Slack.

To be honest about the metric: this clock starts after the agent has already implemented and passed local verification. Still, from "ticket marked Ready" to "merged," you can brew a coffee and it's done before you're back.

Of course, it breaks in hilarious ways

So this doesn't read like a suspicious success story, here are real incidents:

Sudden resignation: agents run on Spot instances. When one is reclaimed mid-task, all that's left is a note reading failed at step ''. Reason: (empty)
Fake Done: the agent reports "Completed!" — there is no PR. There is also no one to interrogate
AI traffic jam: an AI-authored PR waits for an AI review to be merged by an AI, and another AI merges into the same file first
Time-bomb tests: an agent hardcoded a date into a test; 90 days later it detonated and blocked every PR in the repo

Each of these became its own postmortem article. The incident pipeline doubles as a content pipeline.

The machinery

We run this on Codens, our own family of AI dev-automation products:

Green turns a conversation into a structured PRD
Purple picks up tickets and runs agents: implement → verify → PR
Orange auto-reviews every PR (with security audit)
CI + approval → auto-merge
Production error? Red detects it and sends a fix PR
Blue handles QA and E2E

Yes, this whole article is a dogfooding report for our own product. But every number above is measurable via the GitHub API.

What's left for humans

After three months, the human jobs are:

Deciding what to build — ticket quality is everything; vague tickets get vaguely implemented
Designing verification — "tests pass" is not enough; agents will happily cut corners unless you assert file existence and forbidden patterns
First-response when things break — and the incidents become blog posts

It feels less like "AI taking our jobs" and more like "suddenly managing a direct report who files 2,400 PRs a quarter." Management is hard.

Codens has a 14-day free trial (no credit card): codens.ai

Even adopting just Red turns "3 AM Sentry alert" into "a fix PR waiting for you at breakfast."

Our AI autopilot merged nothing for 3 days — the culprit was a review 'freshness' check

Takayuki Kawazoe — Wed, 01 Jul 2026 23:15:59 +0000

At Codens, we run our development on AI agents around the clock: a PRD becomes tasks, agents implement them and open PRs, an AI reviewer approves, and approved PRs merge automatically. Humans mostly read the reviewed output and the notifications. That was the idea — until one morning the dashboard showed zero completed tasks for three days straight.

"Autopilot is down."

Except it wasn't. Every component checked out healthy. Plans were being generated. Tasks were being submitted. Agents were opening PRs. The AI reviewer had already approved them. CI was green. The only thing that never happened was the merge.

The culprit was a freshness check in the code that polls PR reviews — a defensive check we had added ourselves, to prevent a different incident from recurring. This post is the story of how that played out, plus a lesson I think generalizes well beyond AI agents: do you treat external state as events, or as a snapshot?

Background: the wait_review step

Our workflow engine drives each task through fixed steps like implement → create_pr → wait_ci → wait_review → merge_pr. We learned early on that letting the AI decide "what to do next" makes operations impossible to reason about, so transitions are hard-coded in the engine.

wait_review is a rendezvous point: it polls GitHub PR reviews and advances once approvals are in. Inside it, there was this check:

# simplified
fresh_reviews = [
    r for r in pr_reviews
    if r.submitted_at > workflow.review_initiated_at
]
if any(r.state == "APPROVED" for r in fresh_reviews):
    advance_to_merge()

Only look at reviews newer than review_initiated_at. Seems reasonable.

Why the freshness check existed

It was added to prevent a real incident:

A reviewer APPROVES
The agent pushes follow-up commits
The reviewer submits CHANGES_REQUESTED
The poll picks up the stale APPROVE and merges anyway

Merging on an outdated approval after changes were requested is clearly bad, so we added "only trust reviews newer than when we started waiting." At the time, this was the correct fix.

Re-arming kills standing approvals

The problem was when review_initiated_at gets updated.

The engine has a recovery path for Spot instance reclamation and process restarts: it picks up interrupted workflows and re-enters the current step. Re-entering wait_review re-armed review_initiated_at to the current time — regardless of whether any new commits existed.

Which makes this timeline possible:

10:00  Reviewer APPROVES (submitted_at = 10:00)
10:30  Spot reclamation → workflow recovery re-enters the step
       → review_initiated_at re-armed to 10:30
10:31  Poll: submitted_at(10:00) > review_initiated_at(10:30) is false
       → "no new reviews"
Since the reviewer has already approved, no new review
will ever arrive → the workflow waits forever

The approval still exists on GitHub, plain as day — a standing approval. But the filter's reference time moved past it, so the engine forever concludes "no review yet." A perfect deadlock.

The nasty part: it's probabilistic. If recovery re-entry happens before the approval, nothing goes wrong. Only workflows where re-entry lands after the approval silently freeze. On Spot-heavy infrastructure, re-entries are frequent enough that these accumulated over days, and only surfaced as "zero dones."

How we actually diagnosed it

Getting from "Autopilot is down" to the root cause came down to one habit: don't look at the endpoint, look at the distribution.

Watching only the done count (the endpoint), everything looks uniformly stuck
Aggregating in-flight workflows by step showed 28 abnormally parked at wait_review
Cross-referencing their PRs: all approved, CI green, unmerged

At that point "the engine can't see approvals" is established, and the rest is just reading the poll code. Counting where the bodies pile up beats hunting for a dead component — because there was no dead component. A live process faithfully executing a wrong rule will never show up on a health check.

The fix: evaluate a snapshot, not events

The essence of the fix fits in one sentence: instead of asking "did a new review arrive?", compute "what is the effective review state right now?" on every poll.

# keep only each reviewer's latest review
latest_by_reviewer = {}
for r in sorted(pr_reviews, key=lambda r: r.submitted_at):
    latest_by_reviewer[r.user.login] = r

effective_states = [r.state for r in latest_by_reviewer.values()]

if "CHANGES_REQUESTED" in effective_states:
    hold()
elif "APPROVED" in effective_states:
    advance_to_merge()

This is literally the semantics GitHub's own UI uses: the review panel shows each reviewer's latest state, and when the approval happened plays no role in the effective verdict. Which lands on an unglamorous but solid conclusion: if you gate on an external system's state, adopt that system's own semantics.

What about the original incident (merging on a stale approval after changes were requested)? Per-reviewer-latest handles it for free: the reviewer who requested changes is in CHANGES_REQUESTED state, so an older APPROVE can never win. The freshness check survives only in its legitimate, narrow role — debouncing re-review after the same reviewer requested changes and new commits landed.

After deploying, the wait_review backlog dropped from 28 to 8, and 15 tasks completed in 25 minutes — three days of approved PRs draining the moment the predicate was fixed.

Wave two: failures must propagate

We thought that was the end. The next day, the same symptom came back — different cause: dependency chains.

Some tasks had been marked failed at wait_review during the bug window, and their downstream tasks sat blocked on those dependencies. The auto-recovery job only did "move blocked back to pending," while the scheduler's start condition remained "all dependencies done" — so tasks with a failed ancestor bounced between pending and blocked forever. A no-op retry loop.

The fix was to go fail-fast: when a dependency fails, don't leave descendants ambiguously blocked — explicitly mark them FAILED and propagate it downstream. A visible failure gets fixed with one action ("retry the failed ancestor"). An invisible blocked task never gets retried, because nobody knows it's waiting.

Takeaways

These generalize to any system that polls external state at a rendezvous point, AI agents or not:

Evaluate external state as a snapshot, not as events. Gating on "anything newer than time T?" means one mistake in managing T permanently drops facts that are already true (standing approvals). Recomputing "the effective state right now" is idempotent and recovery-proof.
Re-arming a timestamp means erasing every fact before it. If retry/recovery/re-entry paths casually set *_initiated_at = now(), everything established before that instant vanishes behind the filter. Re-arm only when the premise actually changed — e.g., new commits landed.
Yesterday's safeguard is today's prime suspect. The freshness check was a correct response to a real incident — its scope was just too broad. When you add a defense, also write down what stops working if this defense misfires; you'll check it first next time something stalls.
Diagnose "it's stuck" by step-level backlog distribution, not endpoint metrics. All-green health checks can't detect a live process that's wrong on purpose. Counting how many items are parked at each rendezvous point finds it in minutes.
Propagate failures. A dependency failure left as "blocked" is an unbounded silent wait. Fail-fast propagation turns recovery into a single retry.

If your autopilot includes auto-merge, "three days of zero dones" isn't an inconvenience — it's a production outage for your development line. Don't trust the green health checks; put rendezvous-point backlogs on your dashboard. That's the monitoring this incident permanently added to ours.

"When AI reviews AI's code, you've built an infinite loop. Here's how we stopped it."

Takayuki Kawazoe — Thu, 25 Jun 2026 07:35:28 +0000

While building an AI code review product (Orange Codens), there was one design problem that dominated everything else: when the reviewer is AI and the fixer is AI, you can very easily build an infinite loop.

Walk it through. Orange reviews a PR and finds a problem. It doesn't stop at the comment — it hands the fix off to another Codens (Purple / Red) to produce a fix PR. But that fix PR is also a PR. Orange reviews it. New code, new findings. Hand off again. Fix. Review again.

A human reviewer stops at "good enough." AI doesn't, and every loop costs LLM tokens. So "it terminates" and "cost is bounded" have to be guaranteed by the architecture, not by how smart the model is. Here's what we wired into Orange, with the real code and the actual non-convergence incidents we hit.

The structure that loops

Built naively, here's how it spins:

PR open
  → Orange review → finding
    → handoff → Purple/Red opens a fix PR
      → that fix PR is open
        → Orange review → new finding
          → handoff → … (forever)

There isn't one place to cut it. "When do we hand off", "whose PRs are eligible", "do we dispatch the same finding twice", "when does the review itself converge" — each needs its own mechanism. In order:

Cut 1: handoff fires at merge, not when a finding is created

This is the big one. We don't dispatch a fix task the moment we find something. We dispatch only when the PR is merged.

The webhook handler:

_REVIEWED_ACTIONS = {"opened", "synchronize", "reopened"}

async def execute(self, payload):
    action = payload.get("action", "")
    pr = payload.get("pull_request") or {}

    if action == "closed":
        # the merge-gated handoff trigger point
        merged = bool(pr.get("merged"))
        return WebhookResult(
            outcome="handoff",
            handoff_repository_id=repository.repository_id,
            handoff_pr_number=int(pr.get("number", 0)),
            handoff_merged=merged,
            handoff_pr_author=str((pr.get("user") or {}).get("login", "")),
        )
    ...

Before merge, it only posts GitHub suggestions. It wakes Purple/Red only on the closed event with merged=true. And a PR closed without merging discards its queued handoffs:

if not merged:
    # the code never lands on main, so drop queued handoffs
    discarded = 0
    for f in open_findings:
        if f.handoff_queued_at is not None:
            f.handoff_queued_at = None
            await finding_repo.update(f)
            discarded += 1
    return {"merged": False, "discarded": discarded}

What changes: an unmerged PR has zero downstream cost. Experimental PRs, drafts, rejected PRs — however many findings they collect, not a single fix task fires unless the code reaches main. Reviews can run all they want, but handoff (the operation that spends money and creates new PRs) always passes through merge — a human (or an explicit gate) decision.

Cut 2: PRs authored by a bot are excluded from auto-handoff

This is the direct break in the loop. Fix PRs are opened by purple-codens[bot] or red-codens[bot]. A PR authored by one of those bots is excluded from auto-handoff.

_CODENS_BOT_LOGINS = {"purple-codens[bot]", "red-codens[bot]", "orange-codens[bot]"}

def _should_handoff(self, finding, policy, is_bot_pr):
    # a human-queued finding dispatches regardless of mode/bot
    if finding.handoff_queued_at is not None:
        return True
    if is_bot_pr:
        return False  # auto-dispatching on a bot PR is the loop
    if policy.mode == HandoffMode.FULL_AUTO:
        return True
    if policy.mode == HandoffMode.THRESHOLD_AUTO:
        threshold = policy.auto_threshold_severity or "high"
        meets_severity = finding.severity.rank >= Severity(threshold).rank
        in_category = (
            not policy.auto_categories or finding.category.value in policy.auto_categories
        )
        return meets_severity and in_category
    return False

Orange still reviews a bot's fix PR (we want to confirm quality), but it does not auto-spawn the next fix task from it. That's the fundamental cut.

Note the first branch. A handoff_queued_at finding (one a human manually queued with "fix this") dispatches even on a bot PR, even when mode is off. Explicit human intent overrides the loop guard. Automatic chaining is stopped, but a human saying "this particular finding on this bot PR is worth fixing" gets through. Automatic vs. manual cleanly splits the safe default from the escape hatch.

Cut 3: findings from verify runs aren't handed off

Closing another loop path. Purple runs its own verify cycle (implement → test → fix). Findings from Orange reviewing those runs are excluded from handoff.

# exclude findings from purple_verify runs (loop guard)
purple_verify_run_ids = {
    r.review_run_id for r in runs if r.triggered_by == TriggeredBy.PURPLE_VERIFY
}

eligible = [
    f for f in open_findings
    if f.review_run_id not in purple_verify_run_ids
    and f.handed_off_task_id is None          # no double-dispatch (idempotency)
    and self._should_handoff(f, policy, is_bot_pr)
]

handed_off_task_id is None matters too: a finding handed off once is never dispatched again. It prevents the quiet divergence of two fix tasks spawning for the same problem.

Incident 1: findings in the same file deadlocked each other's fix PRs

Now the part where it was built to spec and still failed to converge.

In an E2E run, app/api/notes/search.py had 3 findings (SQL injection, a hardcoded AWS key, and one more). Dispatched naively as "one task per finding," each opened a separate fix PR.

Here's the trap. When the SQLi fix PR tries to merge, Orange reviews it and raises a carry-over REQUEST_CHANGES: "the AWS-key finding in the same file is still unresolved." The AWS-key fix PR is blocked for the same reason as long as the SQLi remains. All three fix PRs became unmergeable, each blocked by the others' unfixed findings.

The fix: at handoff time, coalesce findings of "same file × same target" into one task.

# coalesce by (target_codens, file_path). Multiple findings in the same file
# going to the same service become ONE task so the fix PR resolves them
# together. Otherwise each finding's fix PR is blocked by orange's carry-over
# review of the other unfixed findings in that file.
groups: dict[tuple, list[Finding]] = {}
singles: list[Finding] = []
for f in eligible:
    target = self._target_for(f, policy)
    if target == TargetCodens.PURPLE and f.file_path:
        groups.setdefault((target, f.file_path), []).append(f)
    else:
        singles.append(f)

Fix one file's problems together, in one PR. Obvious in hindsight, but "parallelize naively per finding" puts the reviewer (Orange itself) in a position to block its own fixes with its own other findings — a self-deadlock. Get the unit of parallelism wrong and review and fix jam against each other.

Incident 2: new findings on new code keep the review from converging

Another one. When Orange reviewed a bot's fix PR, it kept raising new high-severity findings on the newly written code each round (missing tests, style, etc.) — 4 consecutive REQUEST_CHANGES rounds, then escalation at the cap (max_review_iterations=5). Non-convergence.

The fix changed the decisive-review logic for bot fix PRs:

carry-over (unresolved findings from the previous review) at severity ≥ high → block (confirm they were actually fixed)
new findings block only if they're blockers (the fix introduced a new critical problem)
new findings below critical are recorded inline / in the body but the PR is APPROVED

Behavior on human PRs (COMMENT only, never auto-block) is unchanged. Only for bot fix PRs did we make the judgment favor convergence.

The point: a perfectionist reviewer never converges. If you raise a new minor finding at full marks on every round, the PR never closes. Anchor on "was the previous serious finding resolved", record new minor ones but let them through. Designing the review bar so the review can stop.

Takeaway: termination is guaranteed by the architecture, not the model

When AI is both author and reviewer, the architecture's job is to guarantee termination and bound cost. No amount of model intelligence guarantees that. The wiring does.

The five cuts in Orange:

Handoff fires at merge, not at finding-time (unmerged PRs have zero downstream cost)
Bot-authored PRs are excluded from auto-handoff (the direct loop cut)
Findings from verify runs aren't handed off
A handed-off finding is never dispatched twice (idempotency)
A human-queued finding crosses the guard (escape hatch)

And two we added after hitting them:

Findings in the same file coalesce into one task (avoid self-deadlock)
Bot fix PR review blocks only on carry-over high + new blockers (favor convergence)

Run "AI writes code" in production and you always hit "who reviews it." If you let AI review it too, you have to design the stopping mechanism as part of the deal — otherwise it fixes forever, the cost diverges, or it deadlocks itself.

Orange Codens builds all of this into the product.

https://www.codens.ai/en/

"How a headless CLI logs in: implementing OAuth Device Code Flow for an MCP client"

Takayuki Kawazoe — Tue, 09 Jun 2026 07:42:51 +0000

When you connect an MCP server to your own service, one unglamorous problem shows up fast: how does the CLI log in?

A web app with a browser can use the OAuth authorization code flow — redirect the user to a login page, exchange the returned code for a token. But MCP clients often run where there's no GUI browser: over SSH, in a CI container, on a headless box. The loopback trick (http://localhost:random_port as the redirect target) doesn't help either, because there's no browser to open.

OAuth has a proper answer for "authenticate a user where there's no browser": RFC 8628, the Device Authorization Grant, a.k.a. Device Code Flow. I implemented it in Codens' Auth service, so here's the design and the real code.

The idea: separate where you authenticate from where you approve

Device Code Flow splits the "device that shows a code" (the CLI) from the "device that approves" (your everyday browser). It's the same thing as logging into Netflix on a TV: a code appears on screen, you type it on your phone.

The flow:

The CLI calls /oauth/device/authorize and gets back a device_code (the machine's secret) and a user_code (a short code a human types).
The CLI shows the user "open this URL and enter ABCD-EFGH", then starts polling /oauth/device/token in the background.
The user opens the verification page in their normal browser, already logged in, enters the user_code, and approves.
The moment it's approved, the CLI's poll receives the token.

The CLI never opens a browser. The user approves from whatever browser they already have — phone, another laptop, anything.

Endpoint 1: device/authorize

The CLI calls this first. It takes a client_id and scope and issues the two codes.

@router.post("/device/authorize", response_model=DeviceAuthorizationResponse)
async def device_authorize(
    client_id: str = Form(...),
    scope: str = Form("openid profile email"),
    session: AsyncSession = Depends(get_session),
):
    # Is this client allowed to use the device_code grant?
    client = await OAuthClientRepository(session).get_by_client_id(client_id)
    if not client or not client.is_active:
        return JSONResponse(status_code=400,
            content={"error": "invalid_client", "error_description": "Unknown client"})

    allowed_grants = (client.grant_types or "").split()
    if _DEVICE_GRANT_TYPE not in allowed_grants:
        return JSONResponse(status_code=400,
            content={"error": "unauthorized_client",
                     "error_description": "Client not authorized for device_code grant"})

    store = get_device_code_store()
    result = await store.create(client_id=client_id, scope=scope)
    frontend_url = settings.FRONTEND_URL.rstrip("/")

    return DeviceAuthorizationResponse(
        device_code=result["device_code"],
        user_code=result["user_code"],
        verification_uri=f"{frontend_url}/device",
        expires_in=result["expires_in"],  # 900s
        interval=result["interval"],       # 5s poll interval
    )

_DEVICE_GRANT_TYPE is the RFC's canonical string urn:ietf:params:oauth:grant-type:device_code. If the client's grant_types doesn't include it, reject. Not everyone gets device flow — only clients that explicitly opt in.

Returning interval (5s) and expires_in (900s) matters: per RFC, the server dictates the poll interval and expiry and tells the client. Don't let the client hardcode them.

Making the two codes

device_code and user_code play different roles, so build them differently.

# device_code: a secret the machine holds. Just needs to be unguessable.
device_code = secrets.token_urlsafe(32)

# user_code: typed by a human. Readability comes first.
_USER_CODE_CHARS = "ABCDEFGHJKMNPQRSTUVWXYZ23456789"  # drop confusable 0/O/1/I/L

def _generate_user_code() -> str:
    left = "".join(secrets.choice(_USER_CODE_CHARS) for _ in range(4))
    right = "".join(secrets.choice(_USER_CODE_CHARS) for _ in range(4))
    return f"{left}-{right}"  # ABCD-EFGH

device_code gets token_urlsafe(32) — if it leaks, someone can grab the token, so entropy wins here.

user_code is typed by hand, so drop the confusable characters (0/O, 1/I/L) from the alphabet. The ABCD-EFGH hyphenated shape makes typos easier to spot. It's a small security-for-UX trade, and it's fine: the user_code is only used by an already-logged-in user to approve — it's not the token.

Storage: two Redis keys

State lives in Redis: a primary key from device_code to the state, and an index from user_code to device_code.

_CODE_PREFIX = "device:code:"            # device:code:{device_code} -> state JSON (primary)
_USER_CODE_PREFIX = "device:user_code:"  # device:user_code:{user_code} -> device_code (index)
_DEVICE_CODE_TTL = 900                    # 15 min, matches the MCP client timeout

async def create(self, client_id: str, scope: str) -> dict:
    device_code = secrets.token_urlsafe(32)
    user_code = _generate_user_code()

    data = {
        "device_code": device_code, "user_code": user_code,
        "client_id": client_id, "scope": scope,
        "status": "pending", "user_id": None,
        "created_at": now, "expires_at": now + _DEVICE_CODE_TTL,
    }

    # Both keys, same TTL, one round trip via pipeline.
    pipe = client.pipeline()
    pipe.set(f"{_CODE_PREFIX}{device_code}", json.dumps(data), ex=_DEVICE_CODE_TTL)
    pipe.set(f"{_USER_CODE_PREFIX}{user_code}", device_code, ex=_DEVICE_CODE_TTL)
    await pipe.execute()

Why two keys? Polling arrives by device_code (that's what the CLI holds). Approval arrives by user_code (that's what the user types). You need to look up from both directions, so you keep a separate index. Put the same TTL on both and they expire together after 15 minutes — no cleanup job to write. That's the Redis TTL paying off.

Normalize when looking up by user_code, because it's human input — it'll arrive lowercase or without the hyphen.

async def get_by_user_code(self, user_code: str):
    user_code = user_code.upper().strip()
    if len(user_code) == 8 and "-" not in user_code:
        user_code = f"{user_code[:4]}-{user_code[4:]}"  # ABCDEFGH -> ABCD-EFGH
    device_code = await client.get(f"{_USER_CODE_PREFIX}{user_code}")
    ...

abcdefgh and ABCD-EFGH both work. Being strict here causes "it's correct but rejected" UX bugs, so be lenient on input.

Endpoint 2: device/token (the poll target)

The CLI hits this every few seconds. It returns different answers for "not yet", "denied", "expired", and "here you go".

@router.post("/device/token")
async def device_token(
    grant_type: str = Form(...),
    device_code: str = Form(...),
    client_id: str = Form(...),
    session: AsyncSession = Depends(get_session),
):
    if grant_type != _DEVICE_GRANT_TYPE:
        return JSONResponse(status_code=400, content={"error": "unsupported_grant_type"})

    data = await get_device_code_store().get_by_device_code(device_code)
    if data is None:
        return JSONResponse(status_code=400, content={"error": "expired_token"})
    if data["client_id"] != client_id:
        return JSONResponse(status_code=400, content={"error": "invalid_client"})

    if data["status"] == "pending":
        return JSONResponse(status_code=400, content={"error": "authorization_pending"})
    if data["status"] == "denied":
        await store.delete(device_code)
        return JSONResponse(status_code=400, content={"error": "access_denied"})

    if data["status"] == "authorized":
        # Issue tokens through the same path as the authorization_code flow
        ...
        await store.delete(device_code)  # one-time use
        return JSONResponse(status_code=200, content=token_response.to_dict(),
            headers={"Cache-Control": "no-store", "Pragma": "no-cache"})

These error strings are defined by RFC 8628 — don't invent your own. In particular, authorization_pending means "the user just hasn't approved yet, this isn't an error, keep polling at the same interval", and any decent client library will quietly wait on it. On access_denied, delete the device_code immediately — no reason to keep a rejected code alive.

When authorized, issue the token through the same TokenGenerator as the authorization_code flow. Device flow doesn't change what's in the token: hash the refresh token into the DB, add an id_token if the openid scope is present — the normal path. Then delete the device_code to guarantee one-time use. You can't redeem the same device_code twice.

Don't forget Cache-Control: no-store on the token response. A token cached by a proxy or browser is an incident waiting to happen.

Endpoint 3: device/verify (the human approval side)

Called from the verification page (/device). This is the one endpoint that assumes a logged-in user, so current_user is required.

@router.post("/device/verify")
async def device_verify(body: DeviceVerifyRequest, current_user: CurrentUser):
    data = await store.get_by_user_code(body.user_code)
    if data is None:
        raise HTTPException(404, "Invalid or expired code")
    if data["status"] != "pending":
        raise HTTPException(400, "This code has already been used")

    if body.action == "approve":
        await store.authorize(body.user_code, str(current_user.id))
        return {"status": "authorized", "client_id": data["client_id"]}
    else:
        await store.deny(body.user_code)
        return {"status": "denied"}

This is the crucial split. The CLI (holding the device_code) receives the token, but who the token is issued for is decided by the user logged into this browser. store.authorize binds current_user.id to the user_code.

async def authorize(self, user_code: str, user_id: str) -> bool:
    data = await self.get_by_user_code(user_code)
    if data is None or data["status"] != "pending":
        return False  # block double-approval / expiry
    data["status"] = "authorized"
    data["user_id"] = user_id
    remaining = int(data["expires_at"] - time.time())
    if remaining <= 0:
        return False
    await client.set(f"{_CODE_PREFIX}{data['device_code']}", json.dumps(data), ex=remaining)
    return True

The status != "pending" check stops an already-approved or denied code from being approved again. The state machine is one-directional only: pending → authorized / pending → denied. Recomputing the remaining TTL and re-setting with ex=remaining means approving doesn't extend the lifetime — the code still dies at the original 15-minute mark.

Register it in OIDC discovery

Finally, add device_authorization_endpoint to .well-known/openid-configuration so an RFC 8628-aware client library can discover the endpoint automatically.

# well_known.py
"device_authorization_endpoint": f"{base_url}/oauth/device/authorize",

And add device_code to the client's (Codens MCP's) grant_types. It only works once both server and client support it.

Takeaways

Device Code Flow looks niche — "authenticate a user without a browser" — but it shows up a lot: MCP, CLI tools, IoT, TV apps. The implementation points that matter:

Build the two codes by role: device_code is a machine secret (high entropy), user_code is human input (readable, confusable chars removed).
Two Redis keys (primary + index) plus a TTL makes expiry cleanup structurally unnecessary.
The state machine starts at pending and is one-directional; approval happens on a separate endpoint by a logged-in user; tokens are one-time use.
Follow the RFC for error strings and let the server drive interval / expires_in.

Anyone building a tool that connects MCP to their own service will hit "how do I log in headless" eventually. Hope this is a useful starting point.

Codens builds all of this auth machinery into the product.

https://www.codens.ai/en/

"Autonomous coding agents don't break in the middle, they break at the seams"

Takayuki Kawazoe — Mon, 08 Jun 2026 00:19:45 +0000

After running AI coding agents in production for a while, one thing became clear: the failures aren't in the code the model writes. They're at the seams — git, CI, auth, the network. The boundaries with the outside world.

The model itself is genuinely capable. It writes functions, writes tests, refactors. What breaks is everything around the work: pushing the result, waiting on CI, merging the PR, refreshing a token, calling another service. And the failures are often the kind a human would avoid without thinking.

Here are five incidents we hit and fixed in Codens' Purple (the orchestration core) over the last few weeks. All real, with production task IDs and dates. Every fix is merged. There's a shared design lesson at the end that ties them together.

Incident 1: a half-resolved merge nearly flooded a PR with 12,000 lines

This was the scary one.

A Purple task on opsguide-back opened a PR. I looked inside: +12,162 lines / 149 files changed, with literal <<<<<<< markers in 2 of them. The commit graph:

e567ce67 (merge commit, "chore: Fix HYBRID_SEARCH...")
 ├ parent[0] = 0b069e5d  (develop tip, +1468 commits over main)
 └ parent[1] = 2940de35  (the actual feature commit)

What happened: in the fix step, the AI decided to git merge develop to backport some test fixes. The merge conflicted. The AI resolved it partially and drove git commit through anyway with markers still in the tree. What got pushed: develop's entire divergence plus unresolved conflict markers. If anyone had clicked merge, main would have been polluted by 1468 commits of develop drift in one shot.

A human wouldn't do this. They wouldn't merge develop into a main-targeted PR in the first place, and if it conflicted they wouldn't commit until it was fully resolved. But the AI, optimizing locally to get one test passing, does it without hesitation.

Fix: stop it at push time, in two layers

A single git pre-push hook. This is where the AI's git push actually goes, so this is where the guard belongs.

#!/bin/bash
set -u

# Layer 1: conflict-marker scan (always on, no config)
scan_conflict_markers() {
    local hits
    # Match markers at column 0 followed by a space, so we don't
    # false-positive on "=======" markdown rules or ASCII art.
    hits=$(git grep -lE '^(<<<<<<< |======= |>>>>>>> )' HEAD 2>/dev/null || true)
    if [ -n "$hits" ]; then
        echo "purple-pre-push: ABORT — committed files contain conflict markers:" >&2
        echo "$hits" | sed 's/^/  /' >&2
        return 1
    fi
    return 0
}

scan_conflict_markers || exit 1

The key is scanning the committed tree (HEAD). The working directory may have been cleaned up, but markers that made it into a commit stay. HEAD is what's about to be pushed, so that's what you git grep.

The regex ^(<<<<<<< |======= |>>>>>>> ) matters for precision: ======= shows up in markdown headings and tables all the time, so we match only the exact shape of a git conflict marker — start of line, then a space.

Layer 2 is a merge-source allowlist, configurable per workflow. It only runs when a policy file is present:

# Layer 2: merge-source allowlist (only when a policy file exists)
# {
#   "feature": "feature/<task-id>",
#   "base":    "main",
#   "allowed": ["develop", "release/x"]
# }

For each new merge commit on the pushed ref, we check that every parent is reachable from feature / base / one of allowed, using git merge-base --is-ancestor. A merge from a disallowed source is rejected. Blank policy means no check — it's opt-in.

for p in $parents; do
    ok=0
    for rname in "${refs[@]}"; do
        if git merge-base --is-ancestor "$p" "$rname" 2>/dev/null; then
            ok=1; break
        fi
    done
    if [ "$ok" = "0" ]; then
        echo "ABORT — merge commit has parent not from allowed sources" >&2
        bad=1
    fi
done

And the unglamorous-but-important part: fail-safe. If the hook itself has a bug and errors, the push still proceeds. A guard bug stopping every workflow is worse than the occasional incident slipping through. Layer 1 is just git grep and git log (tiny surface area); layer 2 falls back to permissive if jq isn't available.

Incident 2: a transient network blip misclassified as a permanent failure

A task routed through a self-hosted model gateway (vLLM behind Cloudflare) died with exit 1 after ~27 minutes of work:

API Error: The socket connection was closed unexpectedly.

The gateway was healthy at session start and recovered immediately after — GET /health returned 200 in 0.56s by the time I looked. So it was a momentary mid-session disconnect: the same Cloudflare-fronted overload pattern that already drives the 524 retry path, just surfacing as a closed socket from node's fetch.

The problem: the existing retry regex covered 524 / origin_response_timeout / connection reset / Too Many Requests but had no entry for the closed-socket case. So the task was classified "non-transient error (exit=1), not retrying," and the whole step got escalated to Slack to wait for a human re-dispatch.

Fix: add the patterns, and trust that false positives are cheap

# Patterns we treat as transient (safe to cleanly retry)
_GW_TRANSIENT_RE = re.compile(
    r"524|origin_response_timeout|Too Many Requests|"
    r"connection reset|"
    # added:
    r"socket connection was closed|socket hang up|ECONNRESET|fetch failed",
    re.IGNORECASE,
)

Added to both classifier sites: the shell-side retry loop in the per-job container, and the Python-side clean-retry detector in the workflow engine.

Here's the core idea of this whole post:

Adding to the transient list is always safe. A false positive (treating a real permanent failure as transient) only wastes a 30–90s backoff. The AI is idempotent over the same prompt, so state isn't corrupted. A false negative (treating a real transient as permanent) escalates to Slack and stops a human.

False positives are cheap, false negatives are expensive. So bias the classifier toward transient. This asymmetry holds for job systems in general, but it's sharper for agents: each run is tens of minutes plus inference cost, so the unit cost of a human escalation is unusually high.

Incident 3: merging before a late-registering CI check even appears

wait_ci built its "required checks" list from the checks observable when the PR opened.

But opsguide-back's test job builds a Docker image first, so it registers ~3 minutes after the PR opens. It wasn't in the PR-open snapshot. So wait_ci passed early without waiting for it, and the downstream merge_pr hit:

Repository rules blocked merge: 405
Required status check "test" is failing.

Actual timeline (2026-05-20, opsguide-back #11284):

04:25:25  wait_ci starts  required_checks=[check-develop-only-files,
          export-and-check, format-check, lint, check-single-head]  ← no test
04:28     test job starts
04:33     test FAILED  ← but wait_ci had already moved on to merge

Fix: treat branch protection, not the observed snapshot, as the source of truth

GitHub branch protection has required_status_checks — the canonical list GitHub actually gates the merge on. Read that instead of a snapshot.

def get_required_status_check_contexts(repo, branch):
    # Read branch protection's required_status_checks.
    # Return [] on 404/403 so unprotected branches / missing perms
    # fall back to current behaviour.
    ...

Union those contexts into wait_ci's required checks with strict=True. Strict mode already waits for a required check that hasn't appeared yet (returns waiting() when the run is None/incomplete), so the late test now gets waited for, evaluated, and a failure routes to fix instead of slipping through to a 405 at merge.

The lesson: don't let "what I can see right now" be the system's truth. In a world where CI checks register asynchronously, an observation snapshot is always going to be stale. Read the gate definition itself.

Incident 4: the singular form of a fast check re-queued at merge time

This one is a single regex, but the kind that eats an afternoon.

Two tasks failed at merge-pr with:

Repository rules blocked merge: 405
Required status check "check-branch-name" is expected.

_is_ci_pending_error only matched the plural wording "N of M required status checks are expected" — i.e. "are expected". When exactly one required check is incomplete, GitHub uses the singular Required status check "X" is expected. That fell straight through the pending detector into a hard failure.

Why does a check wait_ci saw green re-queue at merge time? check-branch-name is a fast check, and merge_pr recomputes the merge base right before merging. GitHub re-evaluates branch protection against the new head and briefly reports the fast check as "expected" again until it re-reports success. The bounded retry loop was built for exactly this window — it just wasn't being entered for the single-check case.

Fix

# Route both singular "is expected" and plural "are expected" into retry
if "expected" in error_message.lower():
    return True  # treat as CI-pending; back off and retry

Match the bare token "expected". The only GitHub merge-block messages containing "expected" are these pending-check wordings, so widening the match can't misclassify a genuine policy rejection (required signed commits, etc.) as transient — covered by the existing regression test.

Unglamorous, but singular-vs-plural and 3-minute-delays are the actual things that stop autonomous agents.

Incident 5: a borrowed token already expired the moment it arrived

Codens' per-task workers run on borrowed shared OAuth credentials. The refreshToken is intentionally stripped: if it weren't, each worker's CLI would refresh independently and rotate the shared OAuth identity, cascading 401s across siblings.

src.token_refresh: token is expired — attempting refresh before job start
src.token_refresh: Token refresh failed: no refreshToken in credentials file
POST /jobs HTTP/1.1 500 Internal Server Error

So the borrower can't refresh on its own. If the accessToken it receives is already past expiresAt at the moment of receipt, the worker's pre-job check dies on "no refreshToken" and POST /jobs returns 500.

Fix: the source that holds the refreshToken refreshes before returning

The root cause: the source credential service's GET /claude-auth returned the stored credentials as-is, expiry included. The only place that holds the canonical refreshToken is the source, so refresh there before returning.

async def get_claude_auth():
    # Refresh before returning. No-op when >5 min of life remains,
    # so the common case adds zero round-trips.
    await run_in_thread(ensure_valid_token)
    return load_credentials()

ensure_valid_token does nothing if the token has more than 5 minutes left, so it's free in the common case. Only when it's under threshold does the one place with the refreshToken (the source) refresh, write the new token, and then return it.

The naive design — "the borrower refreshes" — didn't match the architectural constraint of a shared identity. Only one party can refresh. So that party does it before returning.

Three principles that recur

Line up all five and the fixes share a shape.

1. Autonomous agents break at the seams. Not in output quality, but at the boundaries: git, CI, auth, network. So the fixes point at hardening boundaries, not at smarter models. A pre-push hook, a source of truth for CI gates, the right party to refresh a token — all classic systems design, all unrelated to the model.

2. False positives and false negatives have asymmetric cost. A misclassified transient costs a tens-of-seconds backoff; a missed one costs a human escalation. Agent runs are long and expensive, so the cost of stopping a human is unusually high. Bias classifiers toward retry. Idempotency makes that safe.

3. Guards must be fail-safe. A safety mechanism's own bug must not stop the main flow. The pre-push hook lets the push proceed if the hook itself errors unexpectedly. We weigh "every workflow stops" heavier than "an incident occasionally slips through."

The more you let the AI do, the more these not-in-the-middle details pay off. Smarter models don't make git conflict markers disappear, don't make CI checks register synchronously, don't stop tokens from expiring. Keeping an agent running in production turned out to be the work of closing these seams, one at a time.

Codens builds all of this into the product. Take a look if you're interested.

https://www.codens.ai/en/

"Why we told our AI plan generator to never split tests into a separate sub-task"

Takayuki Kawazoe — Tue, 26 May 2026 08:24:33 +0000

The run was marked failed. Two of the three sub-tasks merged cleanly. The third one, titled "Add tests for is_sent=True treated as read in test_inbox_service_unread_propagation.py", never finished. CI retried up to the cap, all failures, then gave up. The whole plan was thrown out even though two thirds of the actual code had already landed on green branches.

The fix turned out to be one paragraph in one prompt. Not a code change in the dispatcher. Not a new CI flag. Just a rule that says: if a sub-task introduces or modifies code, the unit tests for that code go in the same sub-task. The "tests as their own task" pattern is forbidden.

Here is what I observed, why the AI reached for the wrong decomposition, and the exact prompt rule that closed the gap.

What actually happened

Codens Purple has what I call a plan generator. That is the part of the system that takes one PRD or bug report and breaks it into sub-tasks. Each sub-task then gets dispatched on its own Git branch, runs in parallel with the others, and merges back to the base when its CI goes green. The piece of the plan generator that actually does the splitting is driven by what we internally call the analyze prompt, which is just the system prompt the model sees when it decides "how should this work be carved up."

On a project called opsguide-back, for one bug, the plan generator produced this triple:

1. Add tests for is_sent=True treated as read in
   test_inbox_service_unread_propagation.py
2. Fix _store_messages_batch in inbox_service.py to mark
   self-sent messages as read
3. Add sender_email exclusion to _build_activity_unread_count
   in resolver.py

If you read that as a human reviewer, it looks great. Three clean concerns, easy to review independently, no overlap in files touched. Textbook parallelization.

It died anyway. Sub-tasks 2 and 3 both finished and merged. Sub-task 1, the test-only one, kept failing CI. Its branch contained only changes to the test file. The implementation functions it was asserting against did not exist on that branch yet, because the implementation lived on a sibling branch that this branch could not see. pytest collected the test, tried to import the helpers, and the asserted behaviour was simply not present. Retry, retry, retry, give up. Run failed.

The cruel part is that if the merge order had happened to put the test branch last, after both impl branches had landed, the test would have passed. But we cannot guarantee that order. Each sub-task races on its own.

Why the AI did this

This was not a model failure. The model did exactly what every general-purpose decomposition heuristic would tell you to do. Split tests from implementation so they can move in parallel. That is correct advice for a human team, where the reviewer and the merge queue keep the order honest, and where a developer can rebase a test PR onto the impl PR before merging.

The thing the model did not know is that our dispatch system runs each sub-task on its own isolated branch. Each sub-task sees the base branch plus its own changes, and nothing else. Sibling sub-tasks' work is invisible to it until merge time. That is not a universal fact about software development. It is a property of how we, specifically, run parallel agents. Nothing in the model's training corpus tells it that this constraint applies, because most of the corpus is about human teams.

So the model reached for the most-cited decomposition pattern it knew, which happens to be wrong for our dispatcher. The mistake lived in the prompt. We had been asking the model to plan parallel work without telling it the actual rules of "parallel" in our system.

This is the general shape of a lot of AI agent failures I have hit. The agent is not bad at reasoning. It is reasoning correctly in the wrong universe, because the prompt forgot to describe the universe.

The fix

We added this block to the analyze prompt. It is the only change.

## CRITICAL: Tests live with their implementation

NEVER split tests for new behaviour into a separate sub-task. Every sub-task
that introduces or modifies code MUST also add the unit tests for that code
in the SAME sub-task. The pattern "Sub-task A: implement X / Sub-task B:
add tests for X" is FORBIDDEN.

Title heuristic: if you are about to write a sub-task title that starts
with "Add tests for ..." or "Write tests for ...", STOP and merge it
into the impl sub-task whose code it tests.

Two things are doing the work here. The first is the explicit "FORBIDDEN" framing. The second, which I think matters more in practice, is the title heuristic. The model writes the title before it writes the body. If we can get it to catch itself at the title stage, the bad plan never gets generated in the first place, so we do not have to rely on a later pass to repair it.

We also rewrote the few-shot examples in the same prompt. Before, the example impl sub-task's ## Steps section only listed source-code file edits. After, every example impl sub-task lists the implementation file edit and the test file edit side by side. Roughly:

 ## Steps
 1. Edit src/inbox_service.py: in _store_messages_batch,
    set is_read=True when message.sender_email == account_owner_email.
+2. Edit tests/test_inbox_service_unread_propagation.py:
+   add unit test asserting is_sent=True self-messages count
+   as read.

That tiny diff is the part that changes behaviour. Models pattern-match very strongly on few-shot examples. If every example shows tests bundled with impl, the model produces the same shape.

Since the rule went in, the plan generator has stopped emitting "Add tests for ..." sub-tasks on new behaviour. The test-only failure mode is gone.

The exception

There is one shape of test-only sub-task that is still fine. If we are backfilling a regression test for code that is already on the base branch, the test-only sub-task is allowed. The reason is symmetrical to the original failure: when the implementation already exists on main, a test-only branch has everything it needs to compile, import, and assert. pytest finds the function, the test runs, CI passes.

The prompt calls that out explicitly so the model does not over-apply the new rule and start refusing legitimate backfill work. The line in the prompt is roughly "the rule is about new behaviour introduced in this plan, not about all test-only sub-tasks ever."

Generalizing

The bigger lesson is that AI agents reach for human-team decompositions by default, and that is fine when your dispatch system also behaves like a human team. Most agent dispatch systems do not. Ours runs sub-tasks on isolated branches with no cross-visibility. Some teams run agents in long-lived shared worktrees. Some serialize. Each of these creates its own invisible constraint on what can and cannot be split.

The agent does not know which one you have. It cannot infer it from the codebase, because none of those constraints are encoded in the code. They live in the dispatcher.

So the work, when you start letting an agent plan parallel sub-tasks, is to spend prompt tokens drawing the line between what can be split and what cannot. For us that line was: tests for new code live with the new code. For someone else it might be: never split a migration from the code that depends on it. Or: never split a config change from the deployment that consumes it. The shape of the rule depends entirely on your dispatcher, not on the model.

The pattern I would suggest is to add a single "CRITICAL" section to the planning prompt that enumerates the constraints your dispatcher imposes. Use a title-stage heuristic so the model self-rejects bad plans before generating the body. Rewrite the few-shot examples to demonstrate the right shape, because that is what the model actually copies.

We rebuild Codens with Codens. Every prompt rule like this one came from watching a real run fail and adding the one sentence that would have prevented it. If you want to see how the parallel planner works end to end, the English landing page is at https://www.codens.ai/en/.

"Why your Playwright screenshots show for Japanese / Chinese / Korean text, and the 3-line Dockerfile fix"

Takayuki Kawazoe — Mon, 25 May 2026 06:44:17 +0000

I opened the screenshot artifact for our codens.ai landing page smoke test and the page was full of square boxes. Where the Japanese hero copy should have been, there was a row of □□□□□. Where the feature names were, more boxes. The nav looked like an ancient artifact from a half-decoded file.

The page itself was fine. I had the dev server open in another tab and the Japanese rendered perfectly. The problem was inside the Playwright container.

Three lines in the Dockerfile fixed it:

    fonts-noto-cjk \
    fonts-noto-cjk-extra \
    fonts-noto-color-emoji \

That is the entire fix. If you only came for the answer, you can close the tab now. If you want to know why this happens and where else it will bite you, keep reading.

What is actually happening

The official Playwright Docker image (and most slim base images people build on) only installs Latin fonts. In our case it was fonts-liberation plus fonts-dejavu-core. That is enough to render English, most European languages, basic punctuation, and not much else.

When Chromium tries to paint a character it has no glyph for, it does the only thing it can do. It draws the missing-glyph placeholder, which on most systems is that hollow rectangle people call a tofu box. The character code is correct. The DOM is correct. The page is correct. The screenshot rendering side just has no shape to draw.

This is the part that confuses people the first time. The browser is not broken. The test is not broken. The page is not broken. The container does not have the font installed, so when the screenshot is composited there is nothing to fill the box with.

You can verify this in two seconds. SSH into the container, run fc-list | grep -i cjk, and you will see an empty result. That is the whole story.

The fix

Three apt packages, added to whatever RUN apt-get install block already exists in your Dockerfile.

Before:

RUN apt-get update && apt-get install -y \
    fonts-liberation \
    fonts-dejavu-core \
    && rm -rf /var/lib/apt/lists/*

After:

RUN apt-get update && apt-get install -y \
    fonts-liberation \
    fonts-dejavu-core \
    fonts-noto-cjk \
    fonts-noto-cjk-extra \
    fonts-noto-color-emoji \
    && rm -rf /var/lib/apt/lists/*

What each one buys you:

fonts-noto-cjk is the main package. It covers Japanese kana, the Han characters used in both Japanese and Simplified Chinese, and Korean Hangul. This is the one that fixes most of the boxes.
fonts-noto-cjk-extra covers the long tail. Traditional Chinese variants, less common Han glyphs, characters that show up in proper nouns. Worth including because the cost is small and you do not want to debug a single rare character later.
fonts-noto-color-emoji is the one people forget. If your page has any emoji, you will get tofu for those too. Most modern marketing pages have at least a checkmark or a sparkle somewhere.

Image size impact is about 70 MB on a Debian or Ubuntu base. CJK font files are large because there are tens of thousands of glyphs. If you are squeezing every megabyte you can use the smaller variable-weight subset, but for a CI image used by a test runner the 70 MB is irrelevant.

I shipped this in commit 40422650 for Codens Blue, our QA agent. Rebuilt the image, reran the same smoke test, and the screenshot came out with actual readable Japanese.

Why you only notice after the fact

This is the annoying part. Nothing in your test suite tells you the screenshot is broken.

Unit tests pass. The page renders correctly when a human visits it. The Playwright test reports green because the test only checks that the page loaded and the screenshot was saved. CI is happy. The artifact thumbnail in the GitHub Actions UI is tiny and you cannot tell tofu from text at that size.

You notice when someone opens the screenshot to share it. A designer asks for the latest LP screenshot to compare against a Figma mock. A stakeholder pulls a screenshot for a Slack thread. A regression alert fires and you open the diff. That is when the boxes show up and someone asks why the page is full of squares.

You can technically assert against tofu rendering inside the test. Sample a region that should contain CJK text, check that not every pixel in that region is identical white, fail if it looks suspiciously uniform. I have seen people do this. The implementation cost almost never beats the cost of just installing the fonts once. Three lines of Dockerfile beats a hundred lines of pixel sampling logic.

The same trap is everywhere

Playwright is just the messenger. Anything that wraps a headless Chromium in a Docker container has this problem if the base image lacks CJK fonts.

Puppeteer, pyppeteer, playwright-python, Selenium with headless Chrome, any custom screenshot service built on chrome-launcher, server-side rendering pipelines that use headless Chrome to generate Open Graph images. Same root cause every time. Same fix every time.

If your product touches any audience outside Latin script, default to installing the CJK and emoji fonts in your base image. Treat it as part of the container setup, not as a thing you wait to hit. The cost is 70 MB and three lines. The cost of not doing it is some future Slack message that says "why is the page full of boxes" and then an afternoon of confused debugging.

Wrap

That is the whole thing. Three apt packages, one rebuild, done. If you are running Codens Blue or any other screenshot-based QA flow against a multilingual page, this is the first place to look when boxes appear.

If you want to see the actual landing page these screenshots are taken from, it lives at https://www.codens.ai/en/.

"Adding Cursor Composer 2.5 as a third executor lane: 10x cheaper than Opus at comparable scores, but smoke tells a different story"

Takayuki Kawazoe — Mon, 25 May 2026 00:00:19 +0000

A roughly tenfold per-task cost drop at comparable accuracy is one of those numbers you do not get to ignore for very long. Composer 2.5 published SWE-Bench Multilingual figures in the same neighborhood as Opus, and the per-attempt API cost is about an order of magnitude lower. For an agent harness that runs hundreds of attempts per project per week, a 10x cost compression on a viable lane reshapes the unit economics enough to justify a real integration, not just a spike.

So I shipped Composer 2.5 as a third executor lane in Codens Purple, the orchestration service that decides which model runs each task. Codens was already running two lanes side by side: Claude via the raw Anthropic API and a self-hosted Qwen deployment. The third lane went in over two days, May 23-24, across a Phase 1 skeleton commit, a Phase 2 SDK wire, an ECS Fargate task definition change, an IAM credential isolation fix, and a one-project canary toggle.

Then I ran a smoke pass. 16 failed out of 25 attempts across v4 through v17. The integration works. The benchmark numbers are not the production numbers. This is the writeup of both halves: what shipped, and what the smoke phase actually told me.

Why a third lane at all

The case for a third lane is the same case I made earlier this year for the per-model retry cap pattern. Each model has its own failure shape and its own cost curve. Pinning the whole harness to one provider means inheriting one bill, one rate-limit policy, and one definition of "the model got it wrong."

Composer 2.5 changes the cost arithmetic in a way that matters at our retry caps. Codens retries each task per model up to a cap: claude=3, qwen=6, composer-2.5=5 for now. At cap=3 with Opus, the worst-case attempt cost dominates the per-task budget. At cap=3 with Composer 2.5 at roughly 1/10 the per-attempt rate and comparable accuracy, the worst-case attempt cost drops by roughly an order of magnitude even before factoring in higher-than-Opus first-pass success. That math is what made integration time worth spending.

The optionality argument also got stronger recently. Anthropic clarified that the Agent SDK and claude -p CLI workflows are not covered by subscription plans for agent use cases, which validates the API-direct path Codens already runs on. Adding a Cursor lane on top of that is the same bet, extended: do not get pinned to any one vendor's pricing or policy, and keep the harness free to route tasks to whichever lane wins on cost and reliability for the workload at hand.

Executor lane design

The pleasant part of the design was that PurpleTask.execute_model already supported per-task model switching, and PurpleProject.default_model already let an entire project pin a model. Adding the third lane was not an architecture change. It was an enum value plus a new runner module.

class PurpleTask(Base):
    # existing fields elided
    execute_model = Column(
        Enum("opus", "sonnet", "qwen", "composer-2.5", name="execute_model"),
        nullable=True,
    )

The runner dispatcher already had two branches: runner_claude.py for the Anthropic API path that wraps the claude -p CLI, and runner_qwen.py for the self-hosted endpoint. The third runner, runner_cursor.py, slots in next to those two with the same input contract (task spec, workspace dir, env) and the same output contract (workspace diff, structured result, failure_reason on non-zero).

I split the change into two commits on purpose. Phase 1 was a validation-only runner that exited non-zero on every invocation, plus the enum addition. Shippable in isolation, zero behavior change for existing tasks because nothing pointed at composer-2.5 yet. Phase 2 was the actual SDK call. Splitting like this means each commit can be reverted on its own, and the enum migration is not coupled to any SDK behavior question.

I have learned the hard way that bundling an enum addition with the runtime that depends on it produces commits you cannot cleanly revert when the runtime turns out to be the problem. Phase 1 / Phase 2 splits are cheap insurance.

Phase 1: the skeleton

Phase 1, commit 5a575031, did three things and nothing else. It added composer-2.5 to the model enum, registered runner_cursor.py in the dispatch table, and made the runner validate its inputs and exit non-zero with a clear "not yet implemented" failure_reason. The migration ran on staging. The dispatch table picked up the new entry. No production task pointed at the new lane, so the runner was never invoked in the live path.

This is the kind of commit that looks like it does nothing and is actually doing the most important thing: proving the surrounding plumbing is correct before the new code can hide bugs in the plumbing. If Phase 2 had landed in one shot and the SDK call had failed, I would have spent the next hour trying to figure out whether the failure was in the dispatcher, the env wiring, the IAM role, or the SDK. With Phase 1 already in production for an hour, the only thing Phase 2 could break was the SDK call itself.

Phase 2: wiring the Cursor SDK

Phase 2, commit b1e7ebcd, is where the real work happened. The Cursor Python SDK exposes a session that walks Bridge → Client → Agent → events. The shape in the runner is:

bridge = await Bridge.launch(...)
client = Client(bridge=bridge)
agent = await client.agent.create(
    model=ModelSelection(id=model_id),
    local=LocalAgentOptions(cwd=workspace_dir),
)
run = await agent.send(prompt, SendOptions(...))
async for event in run.events():
    handle_event(event)

The local=LocalAgentOptions(cwd=workspace_dir) part matters: Cursor agents can run remotely or locally, and for Codens the workspace is already mounted into the Fargate task at a known path, so local-mode keeps the file IO inside the task and avoids round-tripping the diff over the wire. agent.send returns a run handle whose events() async iterator yields the structured event stream we already know how to consume from the Claude path. The translation layer in runner_cursor.py normalizes Cursor's event shapes to the internal event schema that the rest of Purple already speaks.

CURSOR_API_KEY is the obvious blocker. We store it in AWS Secrets Manager at purple-codens-prod/cursor-api-key and inject it into the per-task environment so the SDK picks it up automatically. The ECS Fargate task definition change in PR #1156 (commits d1ef5db4 and 656f42e4) exposes the secret ARN as an environment variable:

{
  "name": "CURSOR_API_KEY_SECRET_ARN",
  "value": "arn:aws:secretsmanager:ap-northeast-1:...:secret:purple-codens-prod/cursor-api-key"
}

The entrypoint script resolves it before launching the runner:

CURSOR_API_KEY=$(aws secretsmanager get-secret-value \
    --secret-id "$CURSOR_API_KEY_SECRET_ARN" \
    --query SecretString --output text)
export CURSOR_API_KEY
exec python -m purple.runner_cursor "$@"

This part is where I introduced a bug I want to flag specifically, because it is the kind of bug a multi-tenant SaaS should never ship. Initial commit pulled the secret using whatever AWS_PROFILE was active in the task environment, which in some code paths inherited from the customer's connected AWS credentials. That is wrong in a multi-tenant harness. The fix in commit 6210a052 makes the entrypoint use the ECS task IAM role for the Secrets Manager call, never the customer's profile. Customer credentials are scoped to customer resources only. Platform credentials, including our Cursor API key, must resolve through the task role. Easy mistake, important fix.

The canary procedure

I do not trust new lanes in production until a real project has run on them for at least a day. The canary procedure (commit d6fe3cb3) is intentionally small: flip purple_projects.default_model = 'composer-2.5' on exactly one internal Corevice-org project, dogfood it, and watch the metrics. Every other project stays on whatever model they were already on, which means the canary is fully isolated.

The SQL is one row:

UPDATE purple_projects
SET default_model = 'composer-2.5'
WHERE id = '<internal-project-id>';

Rollback is the same statement with the prior value. No code deploy involved. This is one of the upsides of keeping model selection as runtime data rather than baking it into deploy artifacts: rollback is a transaction, not a release.

The comparison axes we track on the canary versus the same project's last 30 days on Opus:

Completion rate (task finishes without exhausting retries)
Verify pass rate (Codens verify steps succeed against the final diff)
Wall time per task
Cost per completed task

The point of the canary is not to certify the lane is good. The point is to surface the failure modes that benchmarks do not surface, before any real customer touches the new lane.

What the smoke runs actually showed

Across v4 through v17, the smoke pass ran 25 attempts on the canary project. Nine finished. Sixteen failed. That is a 36% completion rate on a workload where the equivalent Opus runs were sitting around 80%+. The benchmark numbers and the production numbers were not the same numbers.

Two failure modes accounted for almost all of the misses.

The Cursor SDK bridge dropped mid-session on a handful of long-running tasks. When the bridge dropped, the workspace diff in progress was lost, the run handle errored, and the runner reported a generic SDK exception. Salvaging the partial diff at the moment the bridge dropped was the obvious fix. Commit 0f95f020 catches the bridge-drop exception, snapshots whatever is currently on disk in the workspace, and feeds that diff into the retry attempt's context so the next attempt does not start from zero.

The other failure mode was uglier. When a task exhausted its retry cap, the runner reported failure_reason = "exceeded max executions (5)" and that was it. The operator on the other side had no visibility into why each of those five attempts had failed. The fix in the same commit (0f95f020) enriches failure_reason with the last attempt's actual error string. Now when the cap is exhausted, the operator sees "exceeded max executions (5): last attempt failed with: <real error>" and can route the task to a different lane or escalate.

Two smaller fixes shipped alongside. Commit 1be0614f surfaces the AWS CLI failure when the Secrets Manager call fails. Previously the entrypoint swallowed it silently and the runner started with an empty CURSOR_API_KEY, producing an opaque 401 from the SDK three seconds later. Now the entrypoint exits non-zero with the AWS CLI error before the runner even starts. Commit 64af2b50 cleans up the per-task env injection and drops a message field collision between the Cursor event schema and our internal one that was causing some events to lose their payload during translation.

None of these fixes turn Composer 2.5 into a production-grade lane for our workload. They turn it into a lane I can operate, observe, and reason about while we keep iterating on it. The canary stays canary. Customer-facing projects stay on the lanes they were on.

Closing

Multi-lane executor architecture is a hedge, and like all hedges, the value shows up only when you actually need it. Composer 2.5 may or may not become a default-routing lane for Codens in the coming weeks. The 10x cost compression is real, the benchmark numbers are real, and the smoke phase is also real. The point of the canary procedure is that we get to find out which of those three numbers matters for our workload before any customer feels it.

The integration cost was a Phase 1 skeleton, a Phase 2 SDK wire, an ECS task definition change, an IAM fix, and a one-row SQL toggle. The integration value, regardless of whether Composer 2.5 sticks, is one more lane the harness can route through next time a pricing announcement or a model release reshapes the cost curve. That optionality is what an AI dev harness is supposed to give you.

Codens is at https://www.codens.ai/en/ if you want to see what a multi-lane harness for autonomous code repair and QA looks like in production.

"Centralizing billing across 5 products triggered a 403 nobody saw coming"

Takayuki Kawazoe — Sat, 23 May 2026 10:44:56 +0000

We flipped USE_BCP=true on Red at 14:02. The first 403 hit Sentry at 14:06. By 14:11 the pattern was clear: any user who tried to do something that touched org-level credit (granting a teammate access, viewing the org credit balance, kicking off a fix run under an org-scoped project) got a 403 back from the Red API, which had received a 403 from BCP, which had received a "not a member" from Auth.

Staging didn't catch it. I want to be honest about that part before anything else. Staging had two users in one org, both of which had been provisioned by me through the Auth admin path months ago, so their org memberships existed in Auth's org_members table by accident of history. Every code path I exercised in staging happened to read from a row that was already there. The bug only fires when a user accepts an org invitation on the product side after the cutover, and we had no synthetic flow for that in staging. Lesson noted, expensive way to learn it.

This post is about what actually broke, why the design wasn't wrong (the implementation was missing), and the three branches I considered for where org-membership authority should live before settling on the one that produced the bug.

Phase H: why centralize billing now

Codens is five products plus two platform services. Red does auto-fix, Blue does QA, Green does PRDs, Yellow is the engineering activity ledger, Purple is the orchestration layer. Auth is the identity service. BCP, the Billing Control Plane, is the newest piece and the subject of this story.

Until last quarter, each product calculated its own credit consumption. That was fine when Red was the only product taking money. It became untenable around the time Green went into beta, because we had three different rounding rules, two slightly different definitions of "what counts as a billable run," and a support ticket pattern that boiled down to "my org's credit balance on Red doesn't match my org's credit balance on Blue and you charged me twice." Phase H of the architecture roadmap pulls all of that into BCP. Every product reads its credit policy from BCP, posts consumption events to BCP, and asks BCP "can this user/org afford this operation?" before starting work.

The cutover is gated behind two env vars per product:

USE_BCP=true
BCP_API_URL=https://api.billing.codens.ai

I cut one product at a time, starting with Red because it has the highest traffic and the most mature billing surface. Red PR #266 was the actual flip. Blue PR #233 and Green PR #411 followed once Red had been stable for a week. Yellow and Purple are scheduled for next quarter, both still on local credit math.

The cutover order matters for this story because the 403 only manifests on org-scoped operations. Red individual-account billing kept working perfectly. So did Blue and Green individual accounts. It was specifically the org-shared credit pool path that exploded, and only for users who had joined their org through the product-side invitation flow rather than through Auth's admin console.

Tracing the 403

The first instinct was "BCP is misconfigured." It wasn't. BCP logs showed clean inbound requests with the right org_id, the right user_id, the right requested operation. BCP then made an internal call to Auth: "is user X a member of org Y?" Auth returned false. BCP returned 403. Red returned 403. User saw 403.

The Auth log line was the clarifying one:

GET /internal/orgs/{org_id}/members/{user_id} -> 404

So Auth wasn't broken either. Auth was correctly reporting that user X was not a member of org Y, as far as Auth knew. I pulled the user out of the database. The user existed in Auth's users table. The org existed in Auth's organizations table. The link row in Auth's org_members was missing.

I went over to Red's database. The link row was there. Red had a row that said user X belonged to org Y, with the role and joined-at timestamp from the day the user accepted the invitation. Red had been authoritative for this relationship the entire time.

CDTSK-1392 captured the root cause. Auth Codens is supposed to be master of organizations and memberships, but each product had grown its own organizations and org_members tables back when each product was a standalone service. Invitation acceptance was handled locally by each product. The row landed in the product's database, and nobody told Auth. Pre-BCP, this didn't matter, because the product was the one authorizing org-scoped operations against its own tables. Post-BCP, BCP asks Auth, Auth doesn't know, 403.

The bug is not in the centralization. The bug is that we shipped centralization assuming a sync that didn't exist.

Three branches for where authority lives

Before writing the sync, I had to decide whether the sync was even the right answer. There are three reasonable places to put authority over org membership in a multi-product setup like ours.

Authority in the auth service. Auth is the master record. Every product holds a local cache (or a foreign-key shadow) and reflects changes back to Auth as they happen. This is what we have. It's the most conventional choice. The downside is the one we just discovered: every product-side write path that affects membership has to remember to call Auth, and forgetting is silent until something else (like BCP) starts depending on Auth being correct.

Authority in billing itself. BCP owns the org and member tables. Every product reads from BCP. This has the appeal of "the system that needs to know the truth owns the truth." It also means every product becomes hard-dependent on BCP being up to render a user's basic org context, which is a much bigger blast radius than billing being temporarily degraded. I didn't want every Red dashboard render to fail because BCP was deploying.

Authority distributed across products. Each product remains the source of truth for memberships that originate in that product. BCP, when asked to authorize an org-scoped operation, routes the membership question to whichever product owns the org. This sounds clever for two products. With five products, the routing table is a permanent piece of infrastructure that has to be updated every time a new product launches, and the question "who owns this org" is itself a piece of state that has to live somewhere central. You've reinvented the auth service, badly.

I chose branch one. The 403 wasn't evidence of a wrong choice. It was evidence that I'd shipped half of a choice. The half I shipped (BCP queries Auth) was correct. The half I hadn't shipped (products tell Auth about new memberships) was the gap.

The sync endpoint

The fix has two halves. Auth needs an endpoint that products can call. Products need to call it at the right moments.

On the Auth side, I added POST /api/v1/internal/organizations/{org_id}/members:upsert. The verb is upsert deliberately. The endpoint is idempotent and the products call it both on invitation acceptance and on role changes, so the handler has to be willing to create or update without the caller knowing which case applies. The response status differentiates: 201 if a new membership row was created, 200 if an existing row was updated.

Getting FastAPI to actually return 201 vs 200 from the same handler was the part that almost shipped broken. PR #124 was the fix. The original handler looked like this:

@router.post(
    "/organizations/{org_id}/members:upsert",
    response_model=UpsertOrgMemberResponse,
)
async def upsert_org_member(
    org_id: UUID,
    payload: UpsertOrgMemberRequest,
    use_case: UpsertOrgMemberUseCase = Depends(get_upsert_use_case),
) -> UpsertOrgMemberResponse:
    result = await use_case.execute(org_id, payload)
    return UpsertOrgMemberResponse.from_domain(result)

When you annotate the return as a Pydantic model, FastAPI takes over status code resolution and forces the default for the route (200 for POST in our config, or 201 if you set status_code= on the decorator). Either way you can't branch. You get one status for both the create and the update case, which silently broke the idempotency contract for any caller that wanted to distinguish.

The fix is to return JSONResponse directly so the handler controls the status:

@router.post("/organizations/{org_id}/members:upsert")
async def upsert_org_member(
    org_id: UUID,
    payload: UpsertOrgMemberRequest,
    use_case: UpsertOrgMemberUseCase = Depends(get_upsert_use_case),
) -> JSONResponse:
    result = await use_case.execute(org_id, payload)
    status = 201 if result.created else 200
    return JSONResponse(
        status_code=status,
        content=UpsertOrgMemberResponse.from_domain(result).model_dump(mode="json"),
    )

You lose automatic OpenAPI response model inference, which is a real cost. You get correct semantics, which is a bigger gain. I document the response shape with responses={200: ..., 201: ...} on the decorator to keep the OpenAPI spec honest.

On the product side, Red PR #264 added the client call at the two moments membership state changes: invitation acceptance and role update.

async def accept_invitation(self, invitation_id: UUID, user_id: UUID) -> None:
    invitation = await self.invitations.get(invitation_id)
    await self.org_members.create(
        org_id=invitation.org_id,
        user_id=user_id,
        role=invitation.role,
    )
    await self.auth_client.upsert_org_member(
        org_id=invitation.org_id,
        user_id=user_id,
        role=invitation.role,
    )
    await self.invitations.mark_accepted(invitation_id)

The Auth call is not in a transaction with the local write, which is a deliberate choice and a place where I might be wrong. If the local write succeeds and the Auth call fails, we have drift. The current mitigation is a nightly reconciliation job that compares product org_members to Auth org_members and re-upserts anything missing. I'd rather drift and reconcile than block invitation acceptance on Auth being reachable.

Blue and Green shipped matching calls in their respective PRs.

Side cleanup: while I was in BCP I noticed that the bonus-credit endpoint silently dropped its grant when the grant_type field name on the wire didn't match what the receiver expected (the sender was using bonus_type, the receiver was reading grant_type, Pydantic accepted the payload with extra="ignore" and quietly inserted a row with the default grant type). PR #265 fixed the Red caller and PR #231 fixed Blue. Lesson there is to not use extra="ignore" on internal wire models, but that's another post.

Lessons

The biggest one is that staging only catches the bugs you have data for. The org-membership row was present in staging by historical accident, so the path that read it worked. I now provision a fresh, end-to-end test user (sign up, accept invitation, perform org-scoped action) as part of pre-cutover validation, scripted, not "remember to do it."

Cutting one product at a time was the only thing that kept the blast radius survivable. If I had flipped all three on the same morning the triage would have taken twice as long, because every signal would have been duplicated three ways. The order Red, then Blue, then Green wasn't load-balanced for anything clever — it was just the order I trusted the metrics on.

Naming the endpoint :upsert instead of overloading POST .../members mattered more than I expected. When the FastAPI status code issue came up, the conversation was "the upsert endpoint should return different codes for create vs update," which is a one-sentence problem statement. If the endpoint had been POST /members I'd have spent another hour arguing about whether 200 or 201 was correct in the abstract.

Wrap

The hardest part of centralizing anything across a product family is not the new service. The new service is straightforward, you write it, you deploy it, you wire up clients. The hard part is figuring out who is allowed to be the source of truth for the relationships the new service depends on, and then making every existing write path honor that choice. We chose Auth as the master for org membership, which I still think is right. We just hadn't enforced it everywhere it mattered, and BCP was the first dependent that actually cared.

If you want to see how the rest of the harness fits together, the English landing page is at https://www.codens.ai/en/. Yellow and Purple come onto BCP next quarter. I'll write that one up too, hopefully without the same shape of bug.

"When the AI gets stuck, the engineer fetches the same PRD via MCP and keeps going"

Takayuki Kawazoe — Wed, 20 May 2026 07:33:54 +0000

Last Tuesday I watched our auto-fix agent burn through three retries on a session-handling bug and surrender. The failure mode was honest. It tried, the diff broke a test we did not know existed, it tried again, the second diff fought with an old idempotency check, the third diff was basically the first one with renamed variables. Then it stopped. The bug report sat in our system marked analysis_failed, the proposed plan was there, the partial diff was there, and the engineer who had to take over was sitting in Slack scrolling.

That gap, the moment between "AI gave up" and "engineer is coding," is where most AI dev tools quietly cost more than they save. The engineer cannot just resume. They have to reconstruct what the AI was looking at: which PRD section, which kickoff decision, which root cause analysis, which files the bug report pointed at. The data exists. It just lives in five places and none of them are inside the IDE.

We shipped codens-mcp v0.7.5 partly to close that gap. The AI workflow inside Codens reads and writes the same PRDs, bug reports, kickoffs, and run logs that an engineer can now pull into Claude Code over MCP with one call. Same source of truth. Two surfaces. The handoff loses nothing.

The 80/20 reality nobody markets

The honest number for a well-tuned AI dev harness on real production code is somewhere between 80% and 90% of tasks completed end-to-end. The rest is novel business logic, conflicts with code the AI never saw, spec ambiguity that no amount of retry will resolve, and the long tail of edge cases that someone has to think through. I do not believe the "100% AI development" pitch and I do not think anyone shipping into real codebases does either.

The 20% is not the problem. The problem is the seam between the 80% and the 20%.

When the AI hands a task back, the human arrives without context. The PRD is in Notion. The bug analysis is in Sentry plus some chat thread. The kickoff decision that explains "we chose JWT not session cookies" is buried in a meeting recap. The engineer has to play archaeologist before they write a single line. And because the AI workflow has already burned through three retries, the next attempt starts from a worse position than if the engineer had been the first responder.

Most AI dev tools optimize the 80%. They get better at the part the AI was already good at. The 20% gets a "human-in-the-loop" label and a button that says "request review." That button does not solve anything. The engineer still has to find everything.

Codens treats the seam as the actual product. The 80% has to keep getting better, obviously. But the 20% is where the trust gets built or destroyed, and the only way to make it good is to make the takeover instantaneous.

One source of truth, two read paths

Every artifact the AI produces or consumes during a task is a first-class entity in Codens, stored in Postgres, owned by a project, scoped to an org. Green Codens owns the planning side: Consultation (the requirement-gathering conversation), PRD (the structured spec), Kickoff (the implementation plan with vision, scope, tech selection, milestones), Plan (the task breakdown). Red Codens owns the repair side: Bug Report (with the AI's root cause analysis attached), Bug Fix Plan (proposed impact scope and test requirements). Purple Codens owns execution: Run (the live event stream from a workflow), Logs.

The AI workflow writes to these entities through internal service calls. When the Green PRD AI generator finishes a section, it patches the PRD row. When Red's analyzer finishes, it attaches an analysis blob to the bug report. When Purple's runner emits an event, it goes to the run's event log. Nothing escapes into chat. Nothing depends on a human copying text from one tab to another.

The second read path is codens-mcp. It is a Python package that registers as an MCP server inside Claude Code (or any other MCP client). It authenticates with the same JWT the web app uses, talks to the same backend APIs that the AI workflow talks to, and exposes 38 tools that cover 137+ actions. When an engineer calls green_prd(action="get", prd_id=...), they get the same PRD bytes the AI agent read three retries ago.

The point is not "we have an API." Every product has an API. The point is that the AI workflow and the engineer use the same access shape against the same row. There is no "engineer-facing version" of the PRD that drifts from the "AI-facing version." There is one row. Both sides read it. Both sides can write it.

What codens-mcp actually exposes

The retrieval surface that matters for a takeover is small. An engineer who arrives at a failed task needs to know: what was being built, what decisions were already made, what the AI tried, and where it broke.

Install and authenticate once:

pip install codens-mcp
codens-mcp login

login runs Device Code Flow against the Codens auth service and stores a JWT at ~/.purple-codens/credentials.json. From that point every tool call carries the token automatically.

{
  "mcpServers": {
    "codens": { "command": "codens-mcp", "args": ["serve"] }
  }
}

Then the engineer, in their IDE, asks Claude to pull the bug report the AI was working on:

red_bug_report(
    action="get",
    organization_id="org_abc",
    bug_id="bug_2f8a"
)
# -> { id, title, description, severity, steps_to_reproduce,
#      expected_behavior, actual_behavior, affected_files,
#      analysis: { root_cause, evidence, suspected_files }, ... }

The action parameter pattern is the whole reason 38 tools cover 137+ operations. One green_prd tool handles create, list, get, update, delete, update_section, approve, submit_for_review, request_changes, archive, unarchive, link_notion, unlink_notion, and consistency-check. The tool descriptor that the model loads at startup is one short signature, not fifteen. (We have written separately about why that matters for context budget — the short version is that a five-server stack burns 55K tokens advertising itself before any work; codens-mcp burns under 5K for everything.)

For a takeover the engineer typically chains two or three calls:

green_kickoff(action="get", kickoff_id="kck_7a1c")
# -> vision, scope, non-goals, tech selection, milestones

green_plan(action="get_tasks", plan_id="pln_91de")
# -> ordered task list with status and dependencies

purple_run(action="get_status", run_id="run_be40")
# -> last events, failure reason, partial outputs

Three calls. Maybe forty seconds. The engineer now has the same view of the work that the AI had when it gave up, without leaving the IDE and without reading a single Slack thread.

Walking through a real takeover

The Tuesday session-handling bug. Here is what actually happened after the third retry failed.

The on-call engineer opened their IDE. Claude Code was already running with codens-mcp registered. They typed:

"Pull bug report bug_2f8a and the latest fix plan."

Claude called red_bug_report(action="get", bug_id="bug_2f8a") and red_bug_fix_plan(action="get_by_bug", bug_id="bug_2f8a") in parallel. Both returned in under a second. The analysis pointed at the auth middleware. The fix plan listed the three files the AI thought needed to change and the test it expected to pass. The engineer read it in maybe two minutes.

Then they asked:

"What did the last Purple run actually do?"

Claude called purple_run(action="get_status", run_id=...) and purple_run(action="subscribe_events", run_id=...) for replay. The event log showed exactly which test had failed on each retry and why the third retry had effectively reverted to the first. The AI had been bouncing between two incompatible local minima.

That was the engineer's "aha." The fix plan was conceptually right, but the test the AI was retrying against was wrong, written by an earlier feature, asserting a behavior the new spec explicitly changed. The engineer fixed the test, applied the AI's second-attempt diff with a four-line manual adjustment, and shipped it. From bug report open to PR merged: 23 minutes, including reading.

Without codens-mcp that same takeover would have been: open Sentry, search by ticket, copy stack trace, open Notion, find the PRD by title, scroll to the right section, open the chat thread where the kickoff lived, find the test naming pattern, grep the repo, then start coding. I have timed that path on myself. It is between 25 and 45 minutes before the first edit.

The tradeoff

The price of "one source of truth, two read paths" is schema discipline. Every artifact has to be modeled well enough that the AI workflow and the engineer both find what they need in it. You cannot let the PRD turn into a Markdown blob with five conflicting section conventions, because the AI's update_section action and the engineer's get_section reader both depend on the structure being honest. You cannot let the bug report become a free-text field with the root cause analysis stuffed at the bottom in a different format every time, because the takeover tooling that highlights analysis.suspected_files will silently miss them.

This is heavier upfront than the alternative, which is to let each side render its own view. The alternative loses every time. The drift between "what the PM thinks the spec says" and "what the engineer thinks the spec says" is, in my experience, the single biggest source of bugs in features that get partially built by an AI. The schema discipline pays for itself the first time a takeover succeeds in under thirty minutes.

The other cost is honest: we run on the Anthropic API direct path, with per-token billing and our own multi-model routing across Claude and Qwen. That gives us control over the escalation path (AI workflow to engineer manual takeover via MCP) independent of what any single platform decides about subscription-tier agent access. When the platform shifts, the takeover path does not move.

Wrap

Graceful degradation is the unappreciated half of AI dev tool design. Anyone can build an agent that succeeds on the easy 80%. The teams that ship into real production code earn their trust on the 20% where the agent gives up and a human takes over. The only way to make that takeover not feel like a downgrade is to make the data the human needs be exactly the data the agent had, in the same shape, one tool call away.

That is what codens-mcp is. The AI does most of the work. When it cannot, the engineer reads the same row.

Codens English landing: https://www.codens.ai/en/
codens-mcp on PyPI: https://pypi.org/project/codens-mcp/

"One JWT, five services, and the python-jose audience list trap"

Takayuki Kawazoe — Sat, 16 May 2026 04:34:53 +0000

audience must be a string or None.

That was the exception python-jose threw the moment our unified MCP server tried to talk to the second backend behind it. The token was valid. The signature checked out. The claims were correct. The library just refused to accept a list as the expected audience, and the JWT spec disagrees with the library on whether that should be a problem.

We run a single MCP server, codens-mcp on PyPI, that fronts five backends: Red (auto-fix), Blue (QA), Green (PRD), Purple (orchestration), and Auth. One MCP token, five destinations. When Claude calls a Red tool, the MCP server proxies an HTTP request to the Red backend carrying that same token. Same for Blue, Green, Purple, Auth. Each backend has its own primary audience for its own user-facing tokens, and we wanted all of them to also accept the MCP server's token without minting five service-specific JWTs per session.

This is the story of how that ran into a python-jose quirk, and the 12-line workaround we ended up shipping.

The architecture, briefly

Codens exposes 31 tools across the five product surfaces through one MCP server. From Claude's side it is a single connection. From the backends' side, each one sees a normal authenticated HTTP request with a bearer token in the header. The token is issued by the Auth service. Its aud claim is purple-codens-mcp, because the MCP server is the thing the user logged into when they connected their client.

Each backend already had its own audience for its first-party tokens. Green expects green-codens. Red expects red-codens. And so on. Those audiences were baked into the OAuth verifier and matched the audience claim on tokens minted by that service's own login flow.

We had two ways forward.

The first option: mint five tokens per MCP session. The MCP server logs into Red, Green, Blue, Purple, and Auth as the user, gets five JWTs, and selects the right one based on which tool the user invoked. This is conceptually clean. It also means five times the token issuance, five rotation surfaces, five sets of refresh flows to coordinate, and a routing layer in the MCP server that has to know which token belongs to which tool. None of that adds value.

The second option: mint one token, declare its audience as purple-codens-mcp, and teach every backend to accept that audience in addition to its own primary one. The MCP server holds one credential. Each backend keeps its primary audience for its own native flows and additionally trusts MCP-issued tokens. Rotation surface stays small. The routing logic in the MCP server disappears.

We picked option two. The plan was to add a per-service config that lists additional accepted audiences, expand the verifier to check against the union, and ship it.

Fix v1: pass a list to python-jose

The setting looked like this in every backend service:

class Settings(BaseSettings):
    OAUTH_AUDIENCE: str = "green-codens"
    OAUTH_ADDITIONAL_AUDIENCES: list[str] = ["purple-codens-mcp"]

The verifier change looked equally innocuous. python-jose's jwt.decode accepts an audience keyword. The naive reading of every JWT tutorial on the internet says you give it the expected audience and it checks the token's aud against that. So we built a list of accepted audiences and handed it over:

audiences = [self.audience] if verify_audience and self.audience else []
if audiences and settings.OAUTH_ADDITIONAL_AUDIENCES:
    audiences.extend(settings.OAUTH_ADDITIONAL_AUDIENCES)

payload = jwt.decode(
    token,
    self.secret_key,
    algorithms=[self.algorithm],
    audience=audiences if audiences else None,
)

This is the version we wrote, ran a quick local smoke test against, and pushed to the dev environment thinking the work was done. The shape of the change matched the shape of the problem. A list of allowed audiences in, an aud claim checked against that list, request accepted. Done.

The dev environment, of course, immediately disagreed.

The trap

The MCP server made its first call into Green and the request came back as a 401. The Green logs had the actual exception underneath the generic auth failure:

TypeError: audience must be a string or None

python-jose's jwt.decode does not accept a list for its audience parameter. If you pass one, it raises before it even looks at the token. The library has only ever supported single-string audience verification. There is no flag, no overload, no helper that takes a list.

RFC 7519 is unambiguous on the other side of this question. Section 4.1.3 defines aud as either a single case-sensitive string or an array of case-sensitive strings, and verification logic is supposed to check that the recipient identifies itself with at least one of the values present. The spec assumes set membership semantics on both ends. The token can have multiple audiences, and the verifier can accept multiple audiences. Whether either side is a list is a transport detail.

python-jose is one of the most-used Python JWT libraries. Most FastAPI tutorials reach for it without thinking. It is also old, and the maintainer activity is thin. There is a multi-year-old GitHub issue tracking exactly this limitation, with patches floating around in forks and pull requests that never merged. The library's behavior is what it is, and if you need list audience verification, you are on your own.

The honest read here is that the JWT spec describes capability and most libraries describe a comfortable subset of it. The subset is usually fine. The moment you do anything cross-service it stops being fine.

Fix v2: decode without audience verification, then verify manually

The fix that worked is to use python-jose for what it is good at, which is signature verification and claim decoding, and do the audience check ourselves. python-jose lets you disable individual claim checks through its options dict. verify_aud: False turns off the built-in audience verification entirely. The signature, expiry, issuer, and everything else still get checked. We just take responsibility for aud.

should_verify_aud = verify_audience and bool(self.audience)

payload = jwt.decode(
    token,
    self.secret_key,
    algorithms=[self.algorithm],
    options={"verify_aud": False},
)

if should_verify_aud:
    allowed_audiences = {self.audience, *settings.OAUTH_ADDITIONAL_AUDIENCES}
    token_aud = payload.get("aud")
    token_aud_set = (
        set(token_aud) if isinstance(token_aud, list)
        else {token_aud} if token_aud is not None
        else set()
    )
    if not (token_aud_set & allowed_audiences):
        raise InvalidTokenError(
            f"Invalid audience: token aud={token_aud!r}, expected one of {sorted(allowed_audiences)}"
        )

The set intersection does the entire job. token_aud_set & allowed_audiences returns a set of values present in both, and if that set is empty the token is for someone else and we reject it. If the token's aud is a single string we wrap it in a one-element set. If it is a list we convert directly. If it is missing we get an empty set and the intersection is empty, which fails closed.

One subtle thing about the order. We compute should_verify_aud before calling jwt.decode, not after, because we want the variable to capture the caller's intent independent of what python-jose returns. If someone passes verify_audience=False, we skip the manual check entirely. If they pass verify_audience=True but the service has no configured audience, there is nothing to verify against, so we also skip. The manual block only runs when there is something real to check.

The error message includes both the token's actual aud value and the sorted list of audiences we accept. When you debug an inter-service auth failure at 2am, the only thing worse than a 401 with no detail is a 401 that tells you nothing about the mismatch. The cost of formatting that message into the exception is zero and the time it saves is real.

The bonus pattern: decode and verify as separate steps

Once you have done this once, decoupling decoding from verification starts to feel like the right default for any JWT code that has to do anything non-trivial. The library is good at parsing the structure and confirming the signature. Your service is the one that knows which claims matter and what acceptance looks like.

The same pattern handles a bunch of adjacent problems. Token introspection for audit logs without re-running all the checks. Soft expiry where you log a warning at 90 percent of the lifetime instead of rejecting. Migration windows where you accept tokens signed with either the old or new key for a week. Custom claim validation that the library has never heard of. Whenever a future library bug lands in the issuer check or the expiry math, you have an escape hatch already in place because the verification logic is yours.

This is also the answer even if python-jose ships list audience support tomorrow. You do not lose anything by owning the audience check. You gain a place to put the next requirement that does not fit cleanly into a kwarg.

Wrap

Multi-service authentication keeps running into the gap between what JWT can do and what the convenient libraries actually do. The spec is generous. The libraries are opinionated. When you stitch services together, the opinions usually have to give.

The unified-token path was worth the workaround. One JWT, one rotation, one issuer, five backends that each know how to accept it. The cost was a dozen lines of manual verification in a shared OAuth module. We would make the same trade again.

If you want to see how Codens uses this on the agent side, the English landing page is at https://www.codens.ai/en/. The MCP server is codens-mcp on PyPI and it is what the agent connects to when it needs to talk to any of the five product surfaces.