DEV Community

Cover image for A week with ctxbudgeter: how I cut Claude code-review costs 60%
KARAN CHANDRA DEY
KARAN CHANDRA DEY

Posted on

A week with ctxbudgeter: how I cut Claude code-review costs 60%


Tl;dr — Over six months a Claude-backed code-review bot grew from system + diff to system + diff + README + CONTRIBUTING + style guide + tool defs + history. Cost climbed to ~$0.066 per call, latency hit 8s, and last week it shipped an API key to Anthropic from a developer's docstring. In 7 days, with a Python package I built called ctxbudgeter, we cut input tokens 22k → 3.4k + 7.4k cached, dropped cost per call by 60%, eliminated the secret-leak class of bug at compile time, and made prompt regressions fail CI like any other test failure.

This is how.


The bot, the bill, the leak

We run an internal code-review bot. Every pull request opens, fires a webhook, and asks Claude to leave inline review comments. It started simple — the system prompt plus the PR diff. Then someone added the README "so the bot has context." Then CONTRIBUTING. Then the style guide. Then "the last 10 reviews so it knows our tone." Then a few helpful source files. Then tool definitions for our internal Jira mention syntax.

Six months in, nobody on the team could tell you what was actually in the prompt anymore.

The bill climbed steadily — about $0.066 per PR review. At 1,500 PRs a month that's not a lot of money, but every PR was a 22,000-token shot regardless of whether the actual diff touched 3 lines or 30 files. Latency hovered around 8 seconds. Review quality was inconsistent — sometimes the bot would comment on stale // TODO notes from years ago because the diff context had been pushed out by accreted "helpful" docs.

Then last Tuesday, a developer pasted a snippet containing a real sk-... API key into a docstring to debug something locally. They forgot to delete it before pushing. Our bot dutifully shipped that snippet to Anthropic as part of the review prompt. We caught it at the next rotation, but it was the second-worst kind of incident: silent, easy, repeatable.

I'd been wanting to fix this for a while. The pattern is universal — every AI team I've ever worked with rebuilt the same context-budget logic ad-hoc, badly, and lived with the consequences. So I built ctxbudgeter, a small framework-agnostic Python package that compiles clean, cheap, auditable context for AI agents before the model sees it.

Then I used it on the code-review bot. Here's the week.


Day 1 — Audit what's actually in the prompt

pip install "ctxbudgeter[tiktoken,yaml]"
Enter fullscreen mode Exit fullscreen mode

I refactored the bot's build_prompt() function into a build_pack() that constructs a typed ContextPack:

from ctxbudgeter import ContextPack

def build_pack(pr_diff: str, touched_files: list[str]) -> ContextPack:
    pack = ContextPack(
        model="claude-sonnet-4.6",
        token_budget=24_000,
        reserved_output_tokens=4_000,
    )
    pack.add(name="system",
             content=open("prompts/system.md").read(),
             kind="system", priority=100,
             cache_policy="stable", required=True)
    pack.add_file("docs/STYLE_GUIDE.md", priority=85, cache_policy="stable")
    pack.add_file("README.md",           priority=70, cache_policy="stable")
    pack.add_file("CONTRIBUTING.md",     priority=65, cache_policy="stable")
    pack.add_file("CHANGELOG.md",        priority=40, compressible=True)
    for f in touched_files:
        pack.add_file(f, priority=60, kind="code", compressible=True)
    pack.add(name="diff", content=pr_diff, kind="task",
             priority=95, required=True)
    return pack
Enter fullscreen mode Exit fullscreen mode

Then ran a compile audit on one real PR:

$ python -c "from agent.context import build_pack; \
  print(build_pack(open('fixtures/pr_4231.diff').read(), \
                   ['src/auth/jwt.py','src/auth/middleware.py']).compile().report())"
Enter fullscreen mode Exit fullscreen mode

The output stopped me cold:

Included:
  - system          412 tokens   stable cache prefix
  - STYLE_GUIDE.md  1,840 tokens stable
  - README.md       2,086 tokens stable
  - CONTRIBUTING.md   920 tokens stable
  - jwt.py          1,250 tokens stable
  - middleware.py     880 tokens stable
  - diff            3,420 tokens required
Excluded:
  - CHANGELOG.md    8,200 tokens — token-heavy and low priority (score 31)

Cacheable prefix   7,388 tokens
Health score       92 / 100
Enter fullscreen mode Exit fullscreen mode

Three findings in one report:

  1. CHANGELOG.md was bloating prompts by 8,200 tokens that I'd never explicitly chosen to include. ctxbudgeter dropped it on score because the priority (40) couldn't justify the token cost relative to the others.
  2. We had 7,388 cacheable tokens we weren't using. Every PR was paying full price for the same static system prompt + docs.
  3. Utilization was only 54% — plenty of headroom for big diffs.

That last point is what made me realize the previous "just include everything that might be useful" approach was both expensive and wasteful.


Day 2 — Wire in cache-aware adapters

The biggest free win sitting on the table: Anthropic's prompt caching. We weren't using it at all. ctxbudgeter ships a to_anthropic_request adapter that places cache_control on the last stable system block automatically — no manual tracking required.

from ctxbudgeter.adapters import to_anthropic_request
import anthropic

client = anthropic.Anthropic()

def review(pr_diff: str, touched_files: list[str]) -> str:
    compiled = build_pack(pr_diff, touched_files).compile()
    resp = client.messages.create(
        **to_anthropic_request(compiled, user_message="Begin review.")
    )
    return resp.content[0].text
Enter fullscreen mode Exit fullscreen mode

That's the entire diff between "no caching" and "fully cache-aware." Internally, the adapter walks the compiled prompt, finds the last item with cache_policy="stable", and marks it with {"cache_control": {"type": "ephemeral"}} — covering the longest possible prefix from a single breakpoint (Anthropic allows up to 4, but one well-placed beats four sloppy ones).

The next two requests in production:

First call:   input_tokens=3,420   cache_creation=7,388   cache_read=0
Second call:  input_tokens=3,420   cache_creation=0       cache_read=7,388
Enter fullscreen mode Exit fullscreen mode

Cost per call dropped from $0.032 → $0.013 — a 60% cut on input. Latency on warm-cache calls dropped from 8s → 3s.

The same adapter exists for OpenAI (to_openai_request) and computes a deterministic prompt_cache_key from a SHA256 of the stable prefix — useful when you have multiple PR queues hitting the same warm cache. LangChain and PydanticAI adapters are there too. The whole thing is framework-agnostic.


Day 3 — Stop blanket-loading source files. Use References.

Eager-loading every "touched file" in the PR was a worse problem than I'd realized. On a small 3-line fix it was fine — maybe 800 tokens. On a 30-file refactor it was 30,000+ tokens of source code, most of which had nothing to do with the actual change. ctxbudgeter supports lazy References — pointers that only resolve if they could plausibly fit the budget. This is Anthropic's "just-in-time" pattern, built into the library.

I wired up a vector-search loader that pulls only the symbols the diff actually touches:

from ctxbudgeter import ContextPack
from ctxbudgeter.loaders import register_loader

@register_loader("vector_search")
def vector_search(ref):
    # ref.location is the search query (a function name from the diff)
    hits = embeddings_store.search(ref.location, k=1)
    return hits[0].content

def build_pack(pr_diff, diff_symbols):
    pack = ContextPack(
        model="claude-sonnet-4.6",
        token_budget=24_000,
        reserved_output_tokens=4_000,
    )
    pack.add(name="system",
             content=open("prompts/system.md").read(),
             kind="system", priority=100,
             cache_policy="stable", required=True)
    pack.add(name="diff", content=pr_diff,
             kind="task", priority=95, required=True)

    for sym in diff_symbols:
        pack.add_reference(
            name=f"def:{sym}",
            location=sym,                # query passed to the loader
            loader=vector_search,
            estimated_tokens=400,        # rough avg per symbol
            kind="code", priority=70,
            cache_policy="stable",
        )
    return pack
Enter fullscreen mode Exit fullscreen mode

The compiler skips references whose estimated_tokens clearly won't fit before the loader is even called — so we don't pay for vector lookups on context we'd just throw away.

Results:

PR size Eager-loaded tokens JIT-loaded tokens
Small (3 symbols) ~3,000 ~1,200
30-file refactor 30,000+ ~1,200

Token cost is now a function of the change, not the codebase size.


Day 4 — Stop leaking API keys into prompts

The incident I mentioned at the top — a developer's sk-... snippet shipping to Anthropic — was the kind of bug that's invisible until it isn't. I added a regex-based detector and used ctxbudgeter's sensitivity policy to enforce it at compile time:

import re

SECRET_RE = re.compile(
    r"sk-[A-Za-z0-9]{20,}|ghp_[A-Za-z0-9]{36}|AKIA[0-9A-Z]{16}"
)

for f in touched_files:
    content = open(f).read()
    if SECRET_RE.search(content):
        pack.add(name=f, content=content, kind="code",
                 sensitivity="secret", priority=60)
    else:
        pack.add_file(f, priority=60, kind="code")

pack.set_secret_policy("refuse")  # raises SecretContentError
Enter fullscreen mode Exit fullscreen mode

set_secret_policy("refuse") makes the compiler raise SecretContentError at compile time if any sensitivity="secret" item would be included in the output. Production preference. You can also redact (replaces content with [REDACTED — sensitivity=secret] and flags it in the report), warn (allows but flags), or allow (silent — the escape hatch).

In redact mode the compile report shows:

Included:
  ...
  - jwt_config.py: 10 tokens, REDACTED, code [!secret]
Enter fullscreen mode Exit fullscreen mode

In production we use refuse. A secret detected = the build halts and a human looks at why. The class of bug — "we accidentally shipped a credential to a third-party model" — is now structurally impossible.

In production you'd run a real secret scanner like gitleaks or trufflehog instead of my toy regex. The point isn't the detector. The point is that ctxbudgeter gives you the enforcement layer. Your policy lives in code, gated by tests. Not in someone's head.


Day 5 — The bot stops repeating itself

A specific reviewer on our team keeps dismissing the bot's docstring comments. "We follow Google style but loosely. Skip the nits." Every PR the bot would say it. Every PR she'd dismiss it. The bot had no memory.

ctxbudgeter ships a MemoryStore abstraction for the LangChain "Write" strategy:

from ctxbudgeter import JSONMemoryStore, MemoryNote

mem = JSONMemoryStore(".ctxbudgeter/review_memory.json")

# When a reviewer dismisses a comment as a nit:
mem.write(MemoryNote(
    key=f"pref:{reviewer_id}:docstrings",
    content="This reviewer treats docstring nits as noise. Skip unless egregious.",
    tags=[reviewer_id, "style", "preferences"],
))
Enter fullscreen mode Exit fullscreen mode

On the next PR by the same reviewer:

pack.add_memory(mem, tags=[reviewer_id], limit=5,
                priority=75, cache_policy="stable")
Enter fullscreen mode Exit fullscreen mode

The compile report now shows:

Included:
  - memory_pref:alice:docstrings   18 tokens, stable cache prefix
Enter fullscreen mode Exit fullscreen mode

18 tokens. Folded into the cacheable prefix. After the first call per reviewer it costs effectively nothing — about $0.0000054 to include.

The store has three implementations: InMemoryStore (process-local, for tests), JSONMemoryStore (file-backed, for simple cases), and the abstract base class so you can plug in Pinecone, Chroma, pgvector — whatever your stack already has.

pack.fork() provides the LangChain "Isolate" strategy for subagent context separation, but I didn't need it for the code-review bot. Four strategies, one library.


Day 6 — Lock the quality with pytest

The bot worked. Now the trick: stop it from drifting.

ctxbudgeter ships a pytest plugin and an assertion library — what I think of as "pytest for prompts." A test file dropped into our existing test suite:

from ctxbudgeter.testing import (
    assert_includes, assert_excludes,
    assert_health_at_least, assert_no_secret_items,
    assert_cacheable_prefix_at_least, assert_used_tokens_at_most,
)
from agent.context import build_pack

PR_DIFF = open("fixtures/pr_4231.diff").read()
FILES   = ["src/auth/jwt.py", "src/auth/middleware.py"]

def test_pack_health():
    c = build_pack(PR_DIFF, FILES).compile()
    assert_includes(c, "system", "diff")
    assert_excludes(c, "CHANGELOG.md")
    assert_health_at_least(c, 85)
    assert_cacheable_prefix_at_least(c, 4096)
    assert_used_tokens_at_most(c, 20_000)
    assert_no_secret_items(c)

def test_pack_golden(ctxbudgeter_golden):
    # Stores tests/golden/test_pack_golden.json on first run.
    # Future runs diff against it — fails on unexpected context drift.
    ctxbudgeter_golden().check(build_pack(PR_DIFF, FILES).compile())
Enter fullscreen mode Exit fullscreen mode

The first run creates the golden snapshot. Subsequent runs diff against it.

A week later, an engineer added pack.add_file("HUGE_LEGACY_DOC.md", priority=99) "just to give the bot more context." Pre-merge CI:

$ pytest tests/test_context_quality.py
FAILED  test_pack_golden — Golden mismatch.
  included_order:           -["middleware.py"]  +["HUGE_LEGACY_DOC.md"]
  cacheable_prefix_tokens:  expected 7388, got 14920
Enter fullscreen mode Exit fullscreen mode

The regression got caught before it shipped. Two options for the engineer: justify the change intentionally with pytest --ctxbudgeter-update-golden (which refreshes the snapshot), or revert. The right outcome, every time.

Quality drift in prompts is now structurally the same as quality drift in code: pytest fails, PR doesn't merge.


Day 7 — Ship to CI as a quality gate

The final move: extract the pack definition into version-controlled config so the team can review it in PRs.

# pack.yaml
model: claude-sonnet-4.6
token_budget: 24000
reserved_output_tokens: 4000
secret_policy: refuse

items:
  - name: system
    from_file: prompts/system.md
    kind: system
    priority: 100
    required: true
    cache_policy: stable

  - name: style_guide
    from_file: docs/STYLE_GUIDE.md
    kind: project_doc
    priority: 85
    cache_policy: stable

references:
  - name: arch_overview
    location: "system architecture"
    loader: vector_search
    estimated_tokens: 600
    kind: retrieval
    priority: 70
Enter fullscreen mode Exit fullscreen mode

Then a GitHub Action that comments on every PR with the compiled report and fails the build if health drops below 85:

# .github/workflows/context-check.yml
name: context-check
on: [pull_request]
jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - run: pip install "ctxbudgeter[all]"
      - run: ctxbudgeter validate pack.yaml
      - run: |
          ctxbudgeter pack pack.yaml \
            --format markdown \
            --fail-below 85 \
            -o report.md
      - uses: marocchino/sticky-pull-request-comment@v2
        with: { path: report.md }
Enter fullscreen mode Exit fullscreen mode

Now every PR gets a sticky comment showing exactly what context the bot will use — included items with token counts, excluded items with reasons, the cacheable prefix size, the health score breakdown. Reviewers can see at a glance: "did this PR change what the bot can see, and if so, why?"

Context became a reviewable concern. Same way config-as-code became reviewable a decade ago.


The before / after

A week of work, mostly afternoons. Here are the numbers from the actual production bot:

Before After
Input tokens per call ~22,000 (uncontrolled) 3,420 + 7,388 cached
Cache hit rate 0% ~95% (warm prefix)
Cost per PR review $0.066 $0.013
Latency on cache hit 8s 3s
Secret leak risk Real (1 incident / 6 months) Compile-time refusal
Context regression risk "Hope nobody changes the prompt" Pytest gate + golden snapshot
Reviewability Buried in build_prompt.py pack.yaml in every PR

At our volume (~1,500 PRs/month):

  • $80/mo saved in direct LLM costs. Modest.
  • ~5 minutes/day cut from p95 review latency. Adds up.
  • One whole class of production incident eliminated. Priceless.

The cost number isn't the headline. The transformative bit is the operational confidence — we now treat prompt engineering with the same discipline as we treat code, with CI gates and reviewable artifacts. The team stopped having "I think we should add X to the prompt" debates that went nowhere; it's a PR now.


What I learned building this

Three things stuck with me:

Deterministic beats clever. The temptation when building a prompt assembler is to add an LLM somewhere — "let it summarize," "let it pick the best examples." I deliberately avoided every place that would have introduced an LLM call into the core compiler. The result: same inputs always produce the same compiled prompt, the same health score, the same audit trail. Debuggable in CI. Reproducible across machines. Tests pass deterministically. The compression hook is user-provided, so if you want an LLM in the loop, you bring it in explicitly. Default-zero LLM in core was the single most important design decision.

The audit trail is the product. Most of ctxbudgeter's lines aren't the scoring formula or the budget math — they're generating clear, human-readable reasons for every decision. "Excluded — token-heavy and low priority (8,200 tokens, score 31)" is worth a thousand lines of internal metric. Developers love tools that tell them why something happened.

Pytest is the right surface. I considered building a custom assertion DSL. I'm glad I didn't. By piggybacking on pytest — fixtures, plugins, parametrize, conftest — the eval/assert layer slots into existing test suites with zero new mental overhead. Engineers already know pytest. Now they can write assert_health_at_least(c, 80) the same way they write assert response.status_code == 200.


What's next

ctxbudgeter is live at v0.2.0 — pip install ctxbudgeter. The roadmap:

  • v0.3 — Compiler: cost projection (estimated $ per call before send), multi-model packs, side-by-side diff tool
  • v0.3 — Memory & RAG: vector store adapter (Pinecone, Chroma, pgvector), auto-summarization compressor
  • v0.4 — Ecosystem: LangGraph node decorator, PydanticAI agent integration, OpenTelemetry tracing
  • v0.5 — UX: web dashboard for compile reports, hot-reload pack.yaml, VS Code extension

MIT licensed. Open source. Contributions welcome.


Try it

pip install ctxbudgeter
Enter fullscreen mode Exit fullscreen mode

Source: github.com/Kayariyan28/ctxbudgeter
PyPI: pypi.org/project/ctxbudgeter

If you build agents and bleed tokens, give it 15 minutes. The README walks through the full case study with copy-pasteable code; you can have a compile report running on your own context inside an afternoon.

And if it saves you tokens, time, or a 3am incident — drop a ⭐ on the repo. That's the signal I need to keep shipping v0.3.


Karan Chandra Dey (@K28) is the founder of K28 Design Lab, helping SMEs ship their first AI MVP. ctxbudgeter is his first open-source release under the K28 brand. Find him on LinkedIn or GitHub.

Top comments (0)