DEV Community: Jonathan Barazany

The 5-Hour Quota, Boris's Tweet, and What the Source Code Actually Reveals

Jonathan Barazany — Wed, 01 Apr 2026 11:26:14 +0000

Yesterday I published a deep dive into Claude Code's compaction engine. At the end, I made a promise: go deeper on the caching optimizations that happen outside of compaction.

But actually, the caching rabbit hole started before that post - because of a tweet from about ten days ago.

The Tweet That Confused Me

If you're a heavy Claude Code user, you felt the 5-hour usage cap snap shut after Anthropic's two-week promotional window closed. The complaints flooded in. Someone tagged Boris - the engineer behind Claude Code, the person who built it - asking what he planned to do about it.

His answer: improvements are coming to squeeze more out of the current quota.

My first reaction: what can he possibly do? The quota is server-side. It's rate limits and token budgets. There's no client trick that changes how many tokens you're allowed per hour.

That question sat with me. Then yesterday's compaction post led me to look harder at the source - and the answer became obvious.

Cache Hit Ratio Is the Quota

Every message you send to Claude Code costs tokens. But tokens aren't flat. Cache hits are discounted significantly. Cache misses cost 1.25x - you're not just paying full price, you're paying a penalty.

If your cache hit ratio is high, you stretch the same quota dramatically further than someone whose cache keeps busting. The quota doesn't change. What you extract from it does.

This is the reframe. When Boris says improvements are coming, he's not talking about changing server limits. He's talking about recovering cache hit ratio - which is the same thing as handing quota back to users.

What Claude Code Already Does About This

When I asked Claude to analyze its own source code, what came back wasn't a simple "we cache the system prompt." It was twelve distinct mechanisms working together, each one plugging a specific leak.

Two stood out - and they reveal how deeply Anthropic thinks about cache economics.

The first solves a combinatorial explosion: five runtime booleans in the system prompt means 32 possible cache entries, most of which will never get a second hit. Claude Code's fix involves a literal boundary string in the source that splits stable content from dynamic content, with the stable prefix shared globally across every user on Earth.

The second is even more interesting: a side-channel called cache_edits that surgically removes old tool results from the cached KV store without changing a single byte in the actual message. No cache invalidation. No reprocessing penalty.

But those are just two of twelve mechanisms. The full picture includes a 728-line diagnostic system that treats cache misses as bugs, a function literally named DANGEROUS_uncachedSystemPromptSection(), and a one-sentence prompt rewrite that saved 20K tokens per budget flip.

Read the full source code analysis on my blog →

Here's what you'll find in the full post:

How __SYSTEM_PROMPT_DYNAMIC_BOUNDARY__ solves the 2^N cache key explosion with Blake2b prefix hashing
The cache_edits side-channel: surgery without invalidation
Why there's a function called DANGEROUS_uncachedSystemPromptSection() and what it forces engineers to do
The real mechanism behind the /clear warning (it's called "willow" internally)
What Boris can actually ship to stretch your quota further

Previously: Claude Code's Compaction Engine: What the Source Code Actually Reveals

We Were About to Buy an Automation Platform. We Already Had One.

Jonathan Barazany — Wed, 01 Apr 2026 06:18:17 +0000

When Cowork launched in January, I didn't understand the hype.

We'd deployed Claude Enterprise at Nayax. Hundreds of employees now had access to a tool most of them had only dreamed about — connected to Jira, Outlook, Teams, Confluence, Snowflake. People were doing things that were simply out of reach before. That felt like the win.

Cowork existed. I knew it. But the initial version was a local file-management tool — point it at a folder, let it sort your downloads. Useful, but not what I needed. And Chat, combined with the connectors we'd spent months carefully plugging in, was already doing incredible things. So Cowork stayed on the shelf.

Then the automation requests started coming in.

The Death Spiral

Every department had something they wanted to automate. IT wanted to auto-respond to incomplete helpdesk tickets. Finance needed meeting summaries routed somewhere useful. Managers wanted Jira tickets updated based on what happened in their standups. Everyone had a list.

We started evaluating n8n and Workato. That's when the death spiral began.

How does Andy from HR run a workflow on behalf of his credentials? How many environments do we need? Who educates the entire company on workflow variables, testing, monitoring? The n8n enterprise license alone wasn't cheap. And even I — with an engineering background and hands-on software experience — had struggled with n8n. I eventually pulled off what I needed by connecting it to Claude, but when it broke, I still had to debug it in the n8n portal. That's a learning curve that shuts most people out entirely.

I didn't see a way out. No clean answer, just more governance discussions.

The Moment That Changed It

On February 24, Anthropic shipped the full enterprise push for Cowork — plugins, deep integrations, and /schedule. My first reaction: "What could I actually do with that?"

I sat with that question. And then it clicked — not because of the scheduling feature itself, but because of what was already underneath it.

Cowork runs in each user's context. It logs in as them. It inherits their permissions. And it connects to every SaaS integration we'd spent the previous months approving and configuring — already scoped, already governed, already ours. No new environments. No credential sharing. No governance meeting required.

The automation platform we'd been debating how to buy was already deployed.

So How Does a Non-Technical Person Actually Build the Workflow?

That's the part that surprised me most. The governance problem was solved — but I still had a nagging question: how does Andy from HR actually build his automation?

The answer involves a mechanism where Claude watches your successful conversation, composes a reusable skill from it, then tests that skill against a clean session and produces an eval report. No nodes. No variables. No portal debugging.

Read the full breakdown on my blog →

The full post covers:

How the "skill that writes skills" actually works in practice
The specific automations I've been running for weeks (enriched work items, auto-responses, usage reports, meeting summaries → Jira)
Why the connectors and governance we built over 3 months turned out to be the foundation for something none of us expected
What rollout looks like next — and why it needs its own workshop

What Karpathy's Autoresearch Unlocked for Me

Jonathan Barazany — Tue, 31 Mar 2026 22:58:56 +0000

I'm not a data scientist. I've trained a few models before — simple classification problems, with AI writing the Python and me running the iterations. It worked. I got confident.

Then a friend asked for help with something harder.

Three Weeks at 0.58

The problem involved predicting an outcome from a mix of CRM data and call recordings. Not trivial, but not exotic either.

Quick primer on AUC — the metric I'll use throughout. Imagine your model looks at two random people: one where the answer is yes, one where it's no. AUC measures how often the model correctly ranks the yes above the no. Score of 0.5 means random guessing. Score of 1.0 means always right.

I tried everything I knew: XGBoost, feature engineering, extracting features from transcripts using AI models, trying different combinations. I assumed more data meant better results — that's how it's supposed to work. Instead, every time I added more features, the AUC dropped. Below 0.5 sometimes. Meaning the model was now actively misleading — it would've been better to ignore it entirely and flip a coin.

My ceiling was 0.581. I couldn't break it no matter what I did.

I stepped away from the problem for a week. Talked it through with a friend who actually knows this domain. Nothing clicked. I was out of moves.

Then Karpathy posted about autoresearch.

The Gold Wasn't the Model

The post generated a ton of hype. For two days I kept asking myself: how do I use what he built on my problem? His project was about training a small language model — not a classification problem like mine. On the surface, nothing transferred.

But that was the wrong question. The interesting part wasn't what Karpathy was training. It was how. A fixed, uncheat-able validation metric. An agent that modifies only the training script. A loop that runs while you sleep. The scientific method, automated.

I had a problem sitting unfinished. I had the method. I started a tmux session from my iPhone on Friday night and let it run.

What Happened Next

By experiment 30, the agent declared it had exhausted all options. I didn't accept that — I just asked it questions. That conversation unlocked something unexpected: a technique I'd never heard of that jumped AUC from 0.581 to 0.628 in one step.

From there, across 165 experiments, the agent built a system that pushed AUC to 0.6747 — a 15.6% gain from a dataset I was ready to abandon. And the dynamic that made it work — agent explores, hits a ceiling, human nudges, agent continues — repeated itself throughout the entire run.

Read the full story with the complete results table and methodology →

The full post covers:

The exact stacking architecture that broke through the 0.58 ceiling
How "rubber duck debugging" with an AI agent led to techniques I didn't know existed
The complete experiment-by-experiment results table (0.581 → 0.6747)
What happens when the agent spawns its own research sub-agent mid-run
Why staying close to what's emerging isn't optional anymore

Claude Code's Compaction Engine: What the Source Code Actually Reveals

Jonathan Barazany — Tue, 31 Mar 2026 22:43:23 +0000

A few months ago I wrote about context engineering - the invisible logic that keeps AI agents from losing their minds during long sessions. I described the patterns from the outside: keep the latest file versions, trim terminal output, summarize old tool results, guard the system prompt.

I also made a prediction: naive LLM summarization was a band-aid. The real work had to be deterministic curation. Summary should be the last resort.

Then Claude Code's repository surfaced publicly. I asked Claude to analyze its own compaction source code.

The prediction held. And the implementation is more thoughtful than I expected.

Three Tiers, Not One

Claude Code's compaction system isn't a single mechanism - it's three tiers applied in sequence, each heavier than the last.

Tier 1 runs before every API call. It does lightweight cleanup: clearing old tool results, keeping only the most recent five, replacing the rest with [Old tool result content cleared]. Fast, cheap, no model involved.

Tier 2 operates at the API level - server-side strategies that handle thinking blocks and tool result clearing based on token thresholds.

Tier 3 is the full LLM summarization. A structured 9-section summary: intent, technical concepts, files touched, errors and fixes, all user messages, pending tasks, current work. The model reasons through the conversation before committing to the summary - a chain-of-thought scratchpad that gets stripped afterward. It's sophisticated. It's also the last resort.

This architecture confirms exactly what the first article argued: summarization is expensive and lossy. You reach for it only when everything else has already run.

But Here's the Problem

My first instinct when reading about Tier 1 was: if the conversation is cached, deleting old messages invalidates the cache. And cache invalidation is brutally expensive - instead of a 90% discount on tokens, you're paying 1.25x for cache writes. You've just made compaction cost more than the tokens you saved.

So how does Claude Code solve this? The answer involves a mechanism called cache_edits that surgically removes tool results without touching the cached prefix, a summarization call that piggybacks on the main conversation's cache key (the alternative showed a 98% miss rate), and a reconstruction process that rebuilds the entire session state after compaction.

Read the full analysis on my blog →

The full post covers:

How cache_edits preserve the prompt cache during cleanup
Why the summarization call reuses your own cache key (and what happens when it doesn't)
The complete post-compaction reconstruction process
How cache economics shaped every architectural decision in the system