DEV Community

Cover image for How I Turned Protocol v2 From a Document Into Working Code
Pavel Gajvoronski
Pavel Gajvoronski

Posted on

How I Turned Protocol v2 From a Document Into Working Code

Part 6: How I Turned Protocol v2 From a Document Into Working Code

This is a build-in-public update on Kepion — an AI platform that deploys companies from a text description. Start from Part 1. Yesterday's story — the one where my AI agent lied about completing a benchmark.


Update before publishing: This article documents what I built on Saturday. Between drafting and publishing, two things happened that strengthen the lesson: a reader on Part 2 left a comment that triggered an architectural audit and exposed a critical bug in our Team Memory scoping (subject of Part 7), and Anthropic released Claude Opus 4.7 with a new tokenizer that uses 20-35% more tokens per request — which means the cost-table drift problem this article describes is already recurring. The single-source-of-truth principle isn't a one-time fix. It's an ongoing discipline.

Yesterday I published a postmortem about an AI benchmark that went wrong in five different ways. I wrote seven rules that were supposed to prevent it from happening again.

Today I discovered the rules don't matter if they only live in a markdown file.

This is the story of converting Protocol v2 from a document into executable guardrails — and the four things I learned that I couldn't have learned without writing the code.

The moment of truth

After publishing yesterday's article, I asked GSD-2 to audit my benchmark harness against Protocol v2. For each of the seven rules, I wanted to know: is there code that enforces this, or is it just written in a doc somewhere?

The result was humbling:

Rule Status
Rule 1 — No status without artifact PARTIAL
Rule 2 — Smoke test mandatory GAP
Rule 3 — Heartbeat every 5 minutes GAP
Rule 4 — Scope deviations need approval GAP
Rule 5 — Fixture validation pre-flight GAP
Rule 6 — No auto-promote to "adopted" GAP (still broken at line 903)
Rule 7 — Circuit breaker on 402 PARTIAL

Two out of seven partially implemented. Five were only documentation.

And the worst one — Rule 6, the one that auto-promoted GLM-5.1 to "adopted" yesterday — was still in the code. Right there at glm_51_eval.py:903. If I'd re-run the harness that morning without fixing anything, it would have silently written "adopted" to models.json all over again.

The lesson landed hard: a rule that lives only in a markdown file isn't a rule. It's a wish.

Problem 1: Rule 6 — the one that almost caused a production incident

This was the critical fix. Yesterday's postmortem spent three paragraphs explaining why evaluation_status should never auto-promote to "adopted". But the code at line 903 looked like this:

if "ADOPT" in recommendation_line:
    candidate["evaluation_status"] = "adopted"
Enter fullscreen mode Exit fullscreen mode

One conditional. No human in the loop. The agent decides, the file gets written.

The fix

I replaced the branch with a hard guarantee:

# Always write "pending-human-review" — never "adopted"
candidate["evaluation_status"] = "pending-human-review"
print_human_review_banner(model_id, report_path)
Enter fullscreen mode Exit fullscreen mode

But that's the easy part. The harder part was making it impossible to bypass.

I built a separate script — confirm_adoption.py — which is the only way to write "adopted" to models.json. It takes --model and --confirm flags, but even with both flags present, it still requires an interactive TTY prompt:

if not sys.stdin.isatty():
    print("ERROR: confirm_adoption.py requires an interactive terminal.")
    print("This guard prevents automated promotion via CI, shell aliases, or LLM agents.")
    sys.exit(2)

response = input(f"Type 'adopt {model_id}' to confirm: ")
if response.strip() != f"adopt {model_id}":
    print("Aborted.")
    sys.exit(1)
Enter fullscreen mode Exit fullscreen mode

You can't automate this. You can't wrap it in an alias. You can't have an LLM call it. The human must physically type the full model ID.

And then I added a unit test:

def test_harness_never_writes_adopted():
    """Regression test: evaluation_status must never become 
    'adopted' via harness output."""
    result = run_harness_with_mocked_scores(adopt_recommendation=True)
    assert result["evaluation_status"] == "pending-human-review"
    assert result["evaluation_status"] != "adopted"
Enter fullscreen mode Exit fullscreen mode

This is the real mechanism. Not the Markdown rule. Not the comment in the code. The test. If anyone — including future me, including a future AI agent — tries to restore the old auto-promote logic, CI fails. The rule is now mechanical.

What this fix actually improved

The obvious win: no more silent production-config changes.

The non-obvious win: I now have a template for every other irreversible action in Kepion. Deployment promotion, API key rotation, database migrations, model routing changes — anything where "an agent could plausibly do this autonomously and I'd regret it." Each one gets a confirm_X.py script, a TTY check, and a regression test.

One fix spawned a pattern.

Problem 2: the hidden cost bug I didn't see yesterday

While fixing Rule 6, I found something that completely reframes yesterday's postmortem.

The harness had a KNOWN_COSTS table mapping model IDs to their prices. It looked right:

KNOWN_COSTS = {
    "anthropic/claude-opus-4": {"input": 5.0, "output": 25.0},
    "anthropic/claude-sonnet-4": {"input": 3.0, "output": 15.0},
    ...
}
Enter fullscreen mode Exit fullscreen mode

$5/$25 per 1M tokens for Opus. That's the published price for the 4.6 generation.

Except anthropic/claude-opus-4 on OpenRouter doesn't point to Claude Opus 4.6. It points to an older snapshot priced at $15/$75 per 1M tokens — three times higher.

I'd been running Atlas, Shield, Designer, and business-generator (all four of my premium-tier agents) on an older, more expensive Opus snapshot for weeks. And the harness's cost accounting was wrong by a factor of 3.

Which means yesterday's "$6 budget" story was wrong. Real cost was closer to $12-18. My circuit breaker was defending against a fantasy budget, and OpenRouter's 402 fired much earlier than the harness expected because the harness thought it had headroom.

More interesting: this inverts the Score/$ comparison. Yesterday I reported Opus 89, GLM-5.1 75 — Opus looked more cost-efficient. At correct pricing, Opus is around 30, GLM-5.1 around 75. GLM-5.1 is 2.5× more cost-efficient than Opus, not less.

It doesn't change yesterday's verdict (40% task coverage + 0.27-point gap within noise floor is still rejected-inconclusive). But the economic case for re-running with Protocol v2 guardrails just got stronger.

The fix

Two changes:

One, update every agent's model reference in models.json:

  • anthropic/claude-opus-4anthropic/claude-opus-4.6
  • anthropic/claude-sonnet-4anthropic/claude-sonnet-4.6

33 substitutions total across agents, escalation targets, and tier definitions. JSON revalidated. Zero remaining bare-4 references.

Two, fix the KNOWN_COSTS table to reflect actual live pricing, and add a new rule to Protocol v2:

Rule 8: Harness cost tables must sync with live provider pricing before every run

Hardcoded price constants are just as dangerous as hardcoded budget ceilings. A benchmark that thinks Opus is $5 when it's actually $15 has a circuit breaker that doesn't exist.

The fix is a pre-flight check: before any live run, the harness queries OpenRouter's /models endpoint, compares returned prices against its KNOWN_COSTS constants, and aborts if any delta exceeds 5%. Live pricing wins. Constants are just a fallback.

Seven rules became eight.

A footnote that matters: Opus 4.7

Between drafting this article and publishing it, Anthropic released Claude Opus 4.7. The sticker price is the same — $15/$75 per million tokens. But the new tokenizer encodes most text 20-35% denser than 4.6. Same prompt, same output, 20-35% more tokens billed.

This means the migration from opus-4 (mispriced) → opus-4.6 (correctly priced) that I describe above is already partially out of date. By the time you read this, Kepion's KNOWN_COSTS table will have moved to 4.7 with adjusted multipliers, or the live-pricing check from Rule 8 will catch the discrepancy automatically.

This is exactly the recurrence Rule 8 was designed for. Cost tables are caches. Caches go stale. The discipline is verifying before every run, not "fixing it once."

Problem 3: Rule 6 fixed the symptom, not the pattern

After both fixes, I sat back and looked at what I'd done. And I realized the two bugs — the auto-promote and the wrong cost table — had the same shape.

Both were cases of a source of truth living in two places.

  • Model pricing lived in the harness constants AND in OpenRouter's API. They disagreed. The harness won. The result: wrong accounting.
  • Adoption status lived in a script's recommendation text AND in models.json. The script wrote to the config directly. The result: silent production change.

The real fix isn't "check pricing once" or "gate promotion with a TTY." The real fix is: whenever you have a fact that exists in two places, one of them is going to drift, and the drift will be invisible until it hurts.

So I'm adding a principle above the eight rules:

Protocol v2, Section 0 — Single Source of Truth

For every critical fact (prices, statuses, routing configs, feature flags), there must be exactly one authoritative source. Everything else is a cache with a defined TTL and a verification procedure.

Examples:

  • Model pricing: OpenRouter /models endpoint is truth. Harness constants are a cache validated before every run.
  • Adoption status: human decision in a TTY is truth. models.json is a recording of that decision.
  • Agent routing: models.json is truth. Agent code reads it at startup and doesn't cache.

This is not original. It's just DRY applied to state instead of code. But I hadn't been thinking of my harness constants as a cache — I was treating them as facts. That mental model was the real bug.

Problem 4: design-before-code on critical changes

Within 24 hours of publishing the Protocol v2 doc, a reader on Part 2 left a comment that pushed me to audit something completely separate — the Team Memory subsystem. (That story is Part 7. The short version: there's a cross-business pattern contamination bug that would silently corrupt agent decisions at scale.)

When the audit came back with two critical findings, my instinct was to immediately ask GSD to fix them. I caught myself. The Rule 6 incident from yesterday was still warm in my head: agent acts autonomously on something irreversible, human regrets it later.

So I split the work into two phases. First: GSD produces a design document at vault/designs/team-memory-scoping-fix-v1.md — schema changes, write-path enforcement, ranking formula, migration plan, rollback plan, test scenarios, open questions. Then: human reviews the design. Only after explicit approval does implementation begin.

This adds a Rule 9 to Protocol v2:

Rule 9: Critical architectural changes require design-before-implementation

For any change that's hard to roll back (schema migrations, write-path semantics, ranking algorithms, anything affecting persisted state), the agent must produce a design document first. Code only after human approval.

The design document forces the agent to articulate trade-offs, surface open questions, and acknowledge what it's not solving. The human review catches architectural mistakes before they're encoded into committed code, where they become 10× harder to remove.

Eight rules became nine.

What the refactor actually produced

Concrete artifacts from yesterday's work:

Protocol v2, expanded from seven rules to nine, with a prefix section about single source of truth. Now lives in docs/lessons/benchmark-protocol-v2.md.

confirm_adoption.py, the only mechanism that can promote a candidate model to "adopted". Requires TTY. Logs every confirmation to an append-only audit trail.

test_no_auto_adopt.py, a regression test that fails CI if the auto-promote logic ever returns.

Updated models.json, with all 33 Anthropic model references pointing to the correct 4.6 snapshots (with 4.7 migration tracked separately as live pricing comes in).

A compliance audit at vault/benchmarks/glm-5.1-evaluation/PROTOCOL-V2-COMPLIANCE.md that grades every rule with IMPLEMENTED / PARTIAL / GAP. Today: 2 of 9 implemented, 1 partial, 6 gaps remaining.

A proposal doc for the remaining six gaps, with effort estimates (11-15 hours total) and priority ordering. Rule 5 (fixture validation) is next — it's the highest-leverage gap because a broken fixture corrupted all model averages yesterday.

What I couldn't have learned from the document alone

Four things became obvious only when I wrote the code.

First: the difference between a guarded action and a blocked action. Rule 6 as a doc said "don't auto-promote." Rule 6 as a TTY-enforced script says "this physically cannot happen without a human." Those are different rules. The first one is an aspiration. The second one is a law.

Second: bugs travel in families. I'd written seven rules thinking I'd covered the failure modes. Writing the code revealed an eighth bug (wrong pricing) that had exactly the same shape as one of the rules I'd already documented. If I'd only updated the document, I'd have missed it. The act of turning rules into code surfaces related issues the document can't see.

Third: guardrails compound. Rule 6's TTY pattern immediately became a template for every other irreversible action in the system. Single-source-of-truth became a principle that now applies to half of Kepion's config. One fix became three patterns became a rewrite of how I think about state in the platform.

Fourth: applying the discipline to itself. When a reader's comment surfaced the Team Memory bug, my first instinct was to fix it immediately. The Protocol-v2 work I'd just finished told me to design first, code second. The discipline only counts if you apply it when it's inconvenient — which is exactly when you don't want to.

The act of writing the code wasn't just implementing the document. It was finishing thinking about the problem.

What's next

Rule 5 (fixture validation) is the next gap to close in the harness. It's 4-6 hours of work and it's the single highest-leverage fix remaining — a broken fixture corrupted all three models' averages in yesterday's run, and there's no protection against it yet.

In parallel: implementing the Team Memory scoping fix from the approved design. Then writing it up — the full chain from reader comment to audit to design to implementation — as Part 7.

After that: the GLM-5.1 evaluation can be re-run for real, with restored budget and a harness that actually enforces what its documentation promises. With correct pricing, the re-run might tell a meaningfully different cost-efficiency story.

Questions for you

Have you had a rule that existed only in documentation bite you in production? What turned you from "we have a policy" to "we have a guardrail"? I'd love to hear examples — my instinct is this pattern is much more common than people write about.

Do you have a test that asserts what your code must never do, not just what it must do? The test_no_auto_adopt pattern feels important to me — assertions about forbidden states. But I don't see it in most codebases. Is this common practice and I'm late, or is it rare?

Where in your system do you have the same fact living in two places? Pricing tables, feature flags, user permissions, cached configs, routing rules. I'd bet most codebases have at least three such places. What's your protocol for keeping them in sync?

Drop thoughts in the comments. Yesterday's post got some great responses about agent hallucination — I'm hoping today's sparks the same on guardrail architecture.


Building Kepion in public. Next update (Part 7): the reader comment that caught a production-grade Team Memory scoping bug — full chain from feedback to audit to design to fix.

If you're building your own agent system and want to steal Protocol v2 (now 9 rules), it's in the Kepion repo under docs/lessons/benchmark-protocol-v2.md.

Top comments (0)