DEV Community: Ferhat Atagün

How I shipped a blog Google couldn't see

Ferhat Atagün — Mon, 15 Jun 2026 06:57:08 +0000

Every blog post on my site looked fine. Open /blog/something, the
article was there — title, paragraphs, code blocks, the works.

Then I ran curl https://ferhatatagun.com/blog/four-tools-in-two-weekends
on a hunch, and the HTML had zero of the body text. Title in <head>,
layout chrome, a perfectly empty <div class="markdown-container" />, and
nothing else. The article only rendered after JavaScript loaded — meaning
the version Google indexed had no article in it.

This had been the case for every blog post on the site, for months.

TL;DR

The site used adapter-static with prerender = true, which suggests "all routes are rendered to HTML at build time." That's true for the page chrome — but not the body.
The Markdown component parsed content inside onMount, so the HTML on disk had a skeleton and nothing else. The article materialized only after hydration.
Two ways this hides from you: every browser you test in runs the JS, so the page looks fine; and the page reports a healthy 200 response, so monitoring stays green.
The fix is mechanical (parse markdown at module scope, inject via {@html}), but it cascaded: prerender OOMed on the worker heap, the prerender crawler followed embedded .md links and 404'd, and the static adapter's fallback was clobbering the prerendered home. Each problem appeared only because the previous one was fixed.

This is a write-up of the regression: what I missed, how it stayed hidden,
the actual code-level fix, and the three secondary failures that the fix
unblocked.

What the page was actually serving

The Svelte component looked harmless:

<script lang="ts">
    import { marked } from 'marked';
    import { gfmHeadingId } from 'marked-gfm-heading-id';
    import { mangle } from 'marked-mangle';
    import createSanitizer from 'dompurify';
    import Prism from 'prismjs';
    import { onMount } from 'svelte';

    let container: HTMLDivElement;
    export let content: string;

    onMount(() => {
        marked.use(gfmHeadingId());
        marked.use(mangle());
        const sanitizer = createSanitizer(window);
        const parsed = marked.parse(content);
        container.innerHTML = sanitizer.sanitize(parsed);
        Prism.highlightAllUnder(container);
    });
</script>

<div bind:this={container} class="markdown-container" />

The whole thing — parsing markdown, sanitising, highlighting — lives
inside onMount. That callback fires only in the browser, after
hydration. During SvelteKit's prerender pass, onMount never runs.
So the HTML on disk contains exactly what's in the template: an empty
<div class="markdown-container" />.

The article body was being added imperatively to the DOM at runtime.
That's invisible to Google. It's invisible to OG/Twitter card scrapers.
It's invisible to anyone who fetches the URL with curl.

Two checks that would have caught this and didn't:

My browser tabs were always rendering the right thing, because they ran the JS. Testing the live page by looking at it is testing the hydrated version, not the indexed version.
The page returned 200. Uptime monitors stayed green. Status pages stayed green. Lighthouse scored fine, because Lighthouse runs JS too.

The only way to see the regression is to bypass JS. curl does. So does
Googlebot's render preview. So does the View-source feature your browser
hides three menus deep. I'd been opening DevTools to inspect post-hydration
DOM for months, and never View Source on the raw response.

The numbers, once I looked:

$ curl -s https://ferhatatagun.com/blog/four-tools-in-two-weekends \
    | wc -c
32280
$ curl -s https://ferhatatagun.com/blog/four-tools-in-two-weekends \
    | grep -c "claudoscope"
0
$ curl -s https://ferhatatagun.com/blog/four-tools-in-two-weekends \
    | grep -c "TL;DR"
0

The page is 32 KB and contains no part of the article body. "claudoscope"
appears half a dozen times in the post; in the HTML, zero. Same for
"TL;DR". The HTML was 100% layout chrome.

Why `prerender = true` wasn't enough

The static adapter's prerender pass walks every route, calls the page's
load, and renders the resulting component tree to HTML. It runs all the
top-level Svelte component code. What it does not do is run lifecycle
hooks like onMount, because those are explicitly contracted to be
browser-only.

So prerender = true was doing exactly what it advertises. The bug was
that the data dependency lived behind a lifecycle that prerender skips.

The fix is to make markdown parsing module-level, not lifecycle-level:

<script lang="ts">
    import { marked } from 'marked';
    import { gfmHeadingId } from 'marked-gfm-heading-id';
    import { mangle } from 'marked-mangle';
    import { onMount } from 'svelte';
    import 'prismjs/themes/prism-tomorrow.css';

    marked.use(gfmHeadingId());
    marked.use(mangle());

    export let content: string;
    $: parsed = marked.parse(content) as string;

    let container: HTMLDivElement;

    onMount(async () => {
        const Prism = (await import('prismjs')).default;
        await import('prismjs/components/prism-typescript');
        if (container) Prism.highlightAllUnder(container);
    });
</script>

<div bind:this={container} class="markdown-container">{@html parsed}</div>

Three things changed:

marked.use(...) moved to module scope. It now runs both during prerender and during hydration, configuring the same extensions in both environments.
parsed = marked.parse(content) is a reactive top-level statement. It runs synchronously inside the component's render pass, so its output is in the HTML that goes to disk.
Prism syntax highlighting stays inside onMount, dynamic-imported. Prism touches self at import time, which is fine in the browser but not on the prerender worker. Highlighting is cosmetic — losing it on prerender is invisible until JS loads, which is acceptable.

I also dropped DOMPurify. The original code piped marked's output
through it before injecting. That was paying a sanitiser's bundle cost
plus a render-time cost, but the input was our own ?raw-imported
markdown files, not user content. Defending against ourselves was
defensive theatre. If a hostile actor can write to my markdown source,
sanitising the output is the wrong layer to do it at.

The result of the fix:

$ wc -c www/build/blog/four-tools-in-two-weekends.html
45292
$ grep -c "claudoscope" www/build/blog/four-tools-in-two-weekends.html
3
$ grep -c "TL;DR" www/build/blog/four-tools-in-two-weekends.html
1
$ grep -c "<h1 id=" www/build/blog/four-tools-in-two-weekends.html
1

45 KB instead of 32 KB. The 13 KB difference is the article body —
the part Google had been seeing as empty.

The home page was even worse

The home page had a different version of the same problem. SvelteKit's
adapter-static accepts a fallback option for SPA-style hosting; if
a path doesn't have a prerendered HTML file, the server can serve the
fallback and let the client-side router resolve it.

The config was:

fallback: 'index.html'

Which writes the fallback shell to index.html. The home route at /
also prerenders to index.html. So you have two operations writing to
the same path. The fallback wins, because the adapter writes it after
the prerender. The 40 KB prerendered home gets overwritten with the
13 KB SPA shell.

$ wc -c www/build/index.html
13096

That's the bare HTML that the bundler emits as the SPA's entry point —
just the imports for the JS bundles, no body content. Anyone hitting /
with a non-JS user agent was getting that.

The fix is one character of intent:

fallback: '200.html'

200.html is a convention some static hosts (Surge, Netlify with
configuration) use to mean "the SPA fallback." The static adapter
doesn't care about the name; it just writes the fallback to whatever
path you give it. Renaming to 200.html keeps the fallback for unknown
paths without colliding with the prerendered home.

$ wc -c www/build/index.html
40871

3.1× growth, all of it actual rendered home content.

The three problems the fix uncovered

Each of these only became visible because the previous one was fixed.

1. Prerender worker OOM

Once the markdown was actually being parsed during prerender, the
GitHub Actions build started failing with:

Error [ERR_WORKER_OUT_OF_MEMORY]: Worker terminated due to reaching
memory limit: JS heap out of memory

The default Node heap on ubuntu-latest is about 1.4 GB. The
prerender pass was now doing real work — marked.parse on every blog
post markdown source, each producing 10–15 KB of HTML. Across 14 blog
posts plus the markdown rendering inside other routes, that pushed the
worker over the line.

Two fixes:

env:
  NODE_OPTIONS: --max-old-space-size=4096

That alone unblocked it. Belt-and-suspenders, I also guarded the
marked.use(...) calls so they only configure once per Node process,
in case Vite's SSR ever re-imports the module across routes:

const __markedKey = '__omni_marked_configured__';
const __markedScope = globalThis as unknown as Record<string, boolean>;
if (!__markedScope[__markedKey]) {
    marked.use(gfmHeadingId());
    marked.use(mangle());
    __markedScope[__markedKey] = true;
}

2. The prerender crawler followed every link in the rendered HTML

Several blog posts embed relative links to translations or contributing.md
files that live in the source repos they reference — [/i18n/README.tr.md],
[/contributing.md], that kind of thing. When markdown rendering was
client-side, those got hydrated into <a> tags but the prerender crawler
never saw them.

Now the crawler sees them, follows them, and treats the 404s as build
errors:

Error: 404 /contributing.md (linked from /skills/nextjs)

These are not site routes I own — they're content links inside post bodies.
The fix is to demote prerender 404s from errors to warnings:

prerender: {
    handleHttpError: 'warn',
    handleMissingId: 'warn'
}

3. The shared host wasn't pulling from gh-pages

The repo's CI deploys to gh-pages. The ferhatatagun.com domain
points at a Spaceship shared host that I FTP-upload to. The two were
unrelated, which meant every CI deploy updated gh-pages and the live
site stayed exactly as it had been.

This wasn't a CI bug; it was a deployment-pipeline shape I'd let drift.
The fix isn't code — it's "manually FTP the gh-pages contents to the
shared host's public_html," or rebuild the deploy pipeline to push
directly. For one-shot remediation, I went with the FTP path.

How to catch the next one before it ships

The reason this regression survived for months is that none of my
verification ran with JS disabled. To prevent the next one:

Always view source on a critical route after a deploy. Not DevTools — that shows the hydrated DOM. The browser's "View Source" shows what arrived on the wire. The two should differ in trivial ways (hydration markers, attribute order); they should not differ in content.
curl | grep your most important sentinel. For a blog: a phrase you know is in the body. For a product page: the price. For a marketing page: the value prop. Make it a 10-second post-deploy check.
Test once with JavaScript disabled. It's a one-time check per major template change. The first time a critical body of text is missing from the JS-disabled page, you have the answer.
For static sites, diff the page sizes you ship over time. A 40 KB → 13 KB drop on a single route would have lit up. I had no alert because I'd never measured a baseline.

For SvelteKit specifically, the pattern is one rule: anything that
materializes data into the rendered DOM should be reactive or
top-level, not in onMount. onMount is for browser-only side
effects — DOM measurement, third-party widget init, anything that
needs window. As soon as you put content production in there, the
prerender stops seeing it. The same shape exists in React (useEffect),
Vue (onMounted), and every framework that distinguishes hydration
from render.

What this cost

Two evenings to find it. Forty-five minutes to fix it. Three
follow-up commits to deal with the secondary failures the fix exposed.

The harder cost is the months of indexing where every post's body was
empty. Google's view of those pages now has the title and the OG image
and an empty <div>. The dev.to mirrors of seven of the posts, which I
published with canonical_url pointing back to my site, were more
indexable than the originals.

Search engines will recrawl. The mirrors will eventually catch up. But
this is the kind of bug that doesn't reverse itself instantly — the
right move after the fix is to submit a fresh sitemap, request reindex
on the most important URLs, and wait.

The win-condition I'm watching for is the same curl | grep that
revealed the bug, run against the production URL:

$ curl -s https://ferhatatagun.com/blog/four-tools-in-two-weekends \
    | grep -c "TL;DR"
1

Zero is the bug. One is the fix.

This post is mirrored from ferhatatagun.com/blog/how-i-shipped-a-blog-google-couldnt-see — that's the canonical URL.

More from the same place:

Long-form blog — AI, LLM tooling, frontend at the model boundary
The five tools — browser-only, BYOK, open-source Claude API dev tools
GitHub · X · LinkedIn

Happy to discuss here or on the canonical post — both threads stay open.

See the prompt before you ship it

Ferhat Atagün — Mon, 08 Jun 2026 11:29:39 +0000

The way most teams find out their prompt is too long is in the bill. The way most teams find out their prompt is approaching the context window is when the model starts dropping the system instructions. The way most teams find out their prompt-caching boundary is in the wrong place is by graphing a hit ratio that won't climb above 30%.

All three of these are diagnosable in advance, in about four seconds, for free. The reason they keep happening is that the tools every Claude developer reaches for — chat playgrounds, IDE plugins, the official SDK — are post-hoc. They show you what just happened. None of them shows you what your prompt looks like before you press send.

The other four tools I've shipped in this suite are all post-hoc too. claudoscope x-rays a finished response. agent-replay scrubs a finished trace. prompt-lab compares two finished runs. tool-lab sandboxes the agent loop. They're all "look at what just happened" microscopes. None of them is a "look at what you're about to do" lens.

context-lens is. Paste a system prompt and a user message; see exactly how the API will count them, where in the 200K window you sit, where caching boundaries fall, and what each call will cost. The pre-flight check that turns a guess into a measurement.

TL;DR

Token cost, context-window position, and prompt-caching layout are all knowable from the prompt alone — you don't need to send the request.
Anthropic's count_tokens endpoint gives you the exact number; a ~3.7 chars/token heuristic gives you a good-enough number while you type.
The most useful single number is "tokens × calls/day × dollars/token" — once you can compute it before deploying, "ship this prompt" stops being an aesthetic call and becomes a budget call.
A 4× difference in input length between two equivalent prompts is normal. Catching it before it goes to production saves more than the tool costs to build.

What you can actually pre-flight

Three things, all derivable from the prompt text alone:

1. Exact token count. Not an estimate. Anthropic ships a /v1/messages/count_tokens endpoint that takes the exact same shape as /v1/messages (system, messages, tools) and returns just the input_tokens number. Same tokenization as the actual API call would use. No model invocation, no output, no cost beyond a single tiny request.

2. Position in the context window. Sonnet 4.5 has a 200K-token window. Going past it doesn't error; the model silently drops the oldest content, which usually means dropping your system instructions, which usually means the model stops doing what you asked. The math is (input + max_output) / 200_000. You should never see "78% of window" in production without knowing about it.

3. Cost per call. Multiply input tokens by input price ($3/M on Sonnet), output tokens by output price ($15/M), and you have one number for the cost of one call. Multiply by your traffic and you have the bill. The interesting move: do this before you commit to a prompt design, not after.

The fourth thing — where prompt-caching boundaries should sit — is harder to derive purely from text, but it's still pre-flight: you choose where to put cache_control based on which prefix is stable across your real traffic. context-lens won't choose for you, but it will show you the boundaries you've chosen so you can sanity-check them.

The four-fold cost difference no one was looking for

A real example, the worked-out kind. Two versions of the same agent system prompt:

Version	Approach	Input tokens (counted)
A	Markdown headings, examples, long taxonomy, JSON schema embedded	3,847
B	Single paragraph, schema implied by one example, no preamble	612

Same model (Sonnet 4.5). Same user inputs (a code review task). The output was substantively equivalent on five real traffic samples — both caught the same critical bugs, both produced valid JSON, both came in under 800 output tokens.

The cost differential is mechanical:

A: (3847 × 3 + 800 × 15) / 1_000_000 = $0.0235 per call
B: (612 × 3 + 800 × 15) / 1_000_000 = $0.0138 per call

At 10,000 calls per day, that's $97/day saved, or $3,000/month. For a single prompt rewrite that took two hours to test in context-lens.

The salient detail: I didn't intend version B to be cheaper. I intended it to be more readable. The cost reduction was a side-effect that I would not have noticed without the pre-flight number, because both prompts felt "about the same length" to me in an editor. context-lens told me one was 6.3× the length of the other, in the only metric that matters: the metric the API uses.

The lesson is that "feels about the same" is a uniformly bad estimator for token count, and you stop making the mistake the day you start measuring before you ship.

Why the heuristic mode exists

context-lens does two things:

Live as you type: a fast heuristic, roughly 3.7 chars/token for English-ish text, that updates with every keystroke. No API call, no key required, instant.
On demand: a real API call to count_tokens that gives you the exact number Anthropic will use.

The heuristic isn't quite right — Turkish, code, and JSON all tokenize differently than English prose, sometimes by 30%. But it's a real-time signal while you iterate, which is more useful than an accurate-but-asynchronous one while you write. When you're ready to commit, you click the button and get the exact number. The two modes are intentional: one for the iteration phase, one for the verification phase.

The pattern generalizes. Every place you have a fast-approximate metric and a slow-exact one, ship both, label them clearly, default to the fast one. The fast metric should never be wrong by more than ~30%; otherwise it's not a useful approximation. ~3.7 chars/token meets that bar for the languages context-lens has to handle.

What about prompt caching

Caching is the lever most teams underuse — and the one context-lens helps with most by surfacing where the boundaries are. Anthropic lets you mark any segment of your prompt as cacheable with cache_control: { type: "ephemeral" }. The next 5 minutes, requests that share that exact prefix get the cached portion at 10% of the input price. The math flips: a 4,000-token system prompt that costs $0.012 per cold call costs $0.0012 per warm call. That's 10×.

The catch: every byte before the cache_control boundary must be identical. If you interpolate the user's name into the system prompt — gone. If your tool list reorders between requests — gone. If you append a timestamp — gone.

context-lens shows you the structure you're sending. It doesn't auto-detect cacheability for you, but it does let you toggle "assume input is cache-read" and see what the cost would be if your caching worked. If $0.012 → $0.0012 is interesting at your traffic level, the path to verify it works is in claudoscope, which shows you the actual cache-read and cache-write breakdown on a live call. The two tools are complementary: context-lens predicts, claudoscope measures.

I wrote a longer piece on the caching observability case in Prompt caching is the cheapest Claude optimization. Nobody measures it. if you want the full argument.

What I'd recommend you do this week

Three escalating moves:

Today (5 minutes): Take whatever prompt your team is shipping right now. Paste it into context-lens with a representative user message. Note the token count. Now write a 1-paragraph version of the same prompt and paste that. If the count drops by 50% with no quality regression on three real inputs, you have a free production cost cut.
This sprint (an afternoon): Add a pre-merge step to your prompt-change workflow: every PR that touches a prompt must include the context-lens token counts (before / after) in the description. Same energy as showing test results. If a PR triples your input tokens, that should be a conversation, not a stealth deploy.
This quarter (a habit): Track the prompt-cost-per-feature number across your product as a real metric. If feature X costs $0.02/call and feature Y costs $0.20/call, that's information you should know about before the bill teaches you. context-lens is the cheapest place to start collecting it — count_tokens is free to call.

The economics of LLM apps in 2026 are not about model selection, mostly. They're about prompt design. Teams that can see their prompts before they ship them will out-compete teams that can't, on cost first and on quality second. The "see them" part is what's missing in most setups, and what context-lens is for.

I shipped this in context-lens — paste a Claude prompt, see what it costs before you ship. BYOK, no backend, runs in the browser. Source: github.com/ferhatatagun/context-lens.

The same protocol-level approach also powers four sibling tools — claudoscope, agent-replay, prompt-lab, tool-lab. All open source, all BYOK: ferhatatagun.com/tools.

This post is mirrored from ferhatatagun.com/blog/see-the-prompt-before-you-ship-it — that's the canonical URL.

More from the same place:

Long-form blog — AI, LLM tooling, frontend at the model boundary
The five tools the post discusses — all browser-only, BYOK, open-source
GitHub · X · LinkedIn

Happy to discuss here or on the canonical post — both threads stay open.

What I learned shipping four open-source Claude dev-tools in two weekends

Ferhat Atagün — Mon, 08 Jun 2026 11:29:08 +0000

About a month ago I tried to import the Anthropic SDK into a Next.js project and the bundler crashed. The fix was straightforward — talk to the Messages API directly, ~150 lines of TypeScript replacing the SDK — but the side-effect was that I now had a hand-rolled SSE client lying around, with all of Claude's streaming behaviour visible to me at the protocol level for the first time.

That client became the seed of four small open-source tools, shipped over two weekends. Each one points a different microscope at the same protocol:

claudoscope — live x-ray of token economics: input, cache write, cache read, output, all visible as the response streams.
agent-replay — paste a Claude agent trace, replay it step-by-step on a cinematic timeline.
prompt-lab — run two prompts (or models) on the same input, side by side, with output/cost/latency compared.
tool-lab — define Claude tools in a JSON editor, type the mock responses by hand, watch the agent loop play out.

All four run only in your browser, BYOK, no backend, MIT-licensed. Together they're around 400 KB gzipped; the shared SSE client is the same file in all four repos. Five long-form posts on ferhatatagun.com/blog and Medium document the engineering decisions behind each one.

The work is done — the more interesting question for me now is what shipping them in this shape, on this timeline, taught me about building developer tools in the AI-tooling era.

TL;DR

Resistance from the official SDK ended up being the most generative constraint. Without the crash, I would never have written the parser, and without the parser, I would never have noticed how much the SDK hides.
"One tool per insight" beats "one tool for everything." Each of the four tools makes exactly one thing visible. They compose because they don't try to.
BYOK + browser-only is a credibility multiplier. The threshold for "I'll try this" drops dramatically when there's no account to make and no server to trust.
A <150-line shared protocol client across four projects is a more interesting reuse pattern than "extract into a library." It travels by copy-paste, but with intent.
The articles are not promotion; they're scaffolding. Every tool needs a long-form artifact that explains why it exists, not what it does.

The constraint that made the work possible

If the Anthropic SDK had imported cleanly into my Next.js bundle, none of this exists. I would have used the SDK, never seen the SSE event stream, never realized that the four usage fields are sitting there in every response, and shipped some boring product feature instead.

What broke first was the bundler — node:fs/promises from inside an agent-toolset module, deep in the SDK's transitive imports. The fix wasn't subtle: don't use the SDK. Talk to api.anthropic.com directly with fetch. Add the anthropic-dangerous-direct-browser-access header. Parse the SSE stream by hand. About 150 lines.

The interesting part wasn't the parser — it was what I saw because of the parser. I'd been calling Claude for months without ever noticing that cache_creation_input_tokens and cache_read_input_tokens were distinct fields. I'd never looked at the granular order of content_block_delta events. I'd never noticed that tool_use inputs arrive as partial-JSON deltas you have to accumulate. The SDK had been doing me a favor by hiding this stuff, and I'd been doing my apps a disservice by letting it.

The lesson, restated: when an SDK fights you, the fight is the gift. The work to bypass it gives you ground-truth visibility you'd never have bought yourself.

One tool, one thing it makes visible

The temptation, once I had the SSE parser, was to build "a Claude developer dashboard" — one tool that did everything. I almost did. The reason I didn't is that the most useful diagnostic tools I've ever used (Wireshark, Chrome DevTools' specific panels, the React Profiler) all share a property: each panel makes exactly one thing visible in a way no other tool does.

So I broke the work into four:

Tool	Makes visible
claudoscope	The four `usage` fields, live, as cost in dollars
agent-replay	The decision sequence inside a `messages` array
prompt-lab	The latency/cost/output diff between two variants
tool-lab	What the model actually does with your tool schemas

Each is a small surface area. None of them does the other three's job. They're all the same shape — paste-some-JSON, watch-some-output, see-the-thing — but the "thing" is intentionally different in each.

This decomposition cost me something: I have four landing pages to maintain, four READMEs, four sets of cross-links. But it bought me an asymmetric thing: a clear pitch per tool. "X-ray a Claude API call" is easier to share than "an all-in-one Claude developer console." On a Show HN front page or a Twitter timeline, the small specific claim wins.

BYOK + browser-only as a trust multiplier

The first version of each tool, in my head, had a backend. A small Node service, an API key kept server-side, maybe a rate limiter. I started building the first one this way, then stopped at the deploy step and asked: why am I making the user trust me with their key?

There is no good answer. For a developer tool that the user is going to use for ten minutes to debug their own work, no backend is necessary. Their key, their requests, their data. The browser is the right runtime; localStorage is the right persistence layer; "nothing leaves your tab" is the right privacy guarantee.

What this changed: the "try it" threshold collapsed. No account creation. No OAuth dance. No "should I trust this site with my key?" hesitation. Open the URL, paste a key, hit send. The tool is yours in under thirty seconds. The Anthropic header named anthropic-dangerous-direct-browser-access was clearly built for exactly this kind of usage — a developer wants to look at the protocol directly, on their own machine, with their own credentials.

The flip side: this design only works for developer tools used by their own creator. A production app that ships keys to users would still need a backend. But for the diagnostic case, BYOK + browser-only is the right architecture.

A 150-line client, copied four times

The shared SSE streaming client is src/lib/anthropic.ts in all four repos. Same file. Same 150ish lines. I considered extracting it to an npm package — @ferhatatagun/claude-fetch or similar — and decided against it three times.

The case against extraction is intuitive once you've worked at scale: a shared library across four tools creates a fan-out problem. A breaking change in the library breaks all four; a non-breaking change requires version-pinning logic; a hotfix requires four PR's to deploy. Meanwhile the four tools are small enough that the file is reviewable in five minutes. There's nothing to abstract over.

What I do instead: the file at the top of src/lib/anthropic.ts in each repo says, in a comment, where it was last synced from. When I improve the parser in one tool, I diff the file across the four repos and reconcile. It takes minutes, not hours, and the four tools stay in sync without the ceremony of a published package.

This isn't a universal pattern — for ten projects it would break down, for a hundred it's clearly wrong. But for four tools shipped by one person on weekends, it's strictly better than the npm-and-versioning alternative.

The articles aren't marketing — they're scaffolding

Each of the four tools has a long-form post that explains why it exists. claudoscope has two (one on the streaming client itself, one on cache observability). prompt-lab, tool-lab, and agent-replay each have one. There are also five matching Turkish translations on ferhatatagun.com.

These posts are not promotion in the marketing sense. I'm not optimizing them for SEO and I'm not pumping them on LinkedIn for impressions. (OK, I'm pumping them on LinkedIn a little. But that's not the point.)

The point is: a tool that does one specific thing benefits massively from an artifact that explains why that specific thing is worth doing. "Here's a tool to A/B test Claude prompts" is a less convincing pitch than "you're choosing prompts by vibes; here's what side-by-side reveals that sequential doesn't, with a worked example, and a tool for it." The article does the persuasion; the tool catches the convinced reader.

Without the writing, the tools look like toys. With the writing, they look like the natural conclusion of an argument. The two work as a pair.

What I'd do differently

A handful of small things I'd front-load if I were starting over:

Demo mode from day one. I added ?demo=1 to three of the four tools as an afterthought. It's the single highest-conversion feature — users who land on a tool and don't have a key still need something to look at, or they bounce. Should have been there at first commit.
Per-tool OG cards. I shipped each tool with a generic OG image and went back two days later to make per-tool 1200×630 cards in the right brand color. The first two days of traffic that came in via shared links looked generic. Should have been there at launch.
Cross-linking inside the tools. Each tool's footer points to the other three. I added this in the second weekend. The first weekend, every tool was a silo, and visitors discovered them one at a time. Should have been baked into the template.
A "what's in this for me" line on the landing page. I had four hero descriptions like "see what Claude is doing." Better: "see prompt caching save you 90% of your bill, live, as you debug." Specific outcome > vague capability. I corrected this in the second pass.

None of these are large fixes. They're all things that, if you've ever shipped a small developer tool, you already know. Knowing and remembering at the moment of shipping are different things.

What's next

Whatever the API surface adds next, the same pattern applies: ship a small visualizer for it the day it lands. Anthropic shipped MCP, batch, files, computer-use, and citations over the last year, and most of them still don't have great developer-side observability tools. Each one is a 200-300 line tool waiting to be built.

For now, the four-tool suite is at a natural stopping point. The work I'm interested in now is around adoption — making it visible enough that the people who need these tools can find them. If you've read this far and one of the four sounds like it would have saved you time last week, take it for a spin and let me know what's missing.

All four tools: ferhatatagun.com/tools.

Source on github.com/ferhatatagun. MIT, BYOK, no backend.

Articles on each one: ferhatatagun.com/blog.

This post is mirrored from ferhatatagun.com/blog/four-tools-in-two-weekends — that's the canonical URL.

More from the same place:

Long-form blog — AI, LLM tooling, frontend at the model boundary
The five tools the post discusses — all browser-only, BYOK, open-source
GitHub · X · LinkedIn

Happy to discuss here or on the canonical post — both threads stay open.

How I debug Claude agents by replaying their trace

Ferhat Atagün — Mon, 08 Jun 2026 11:28:37 +0000

Your agent did something weird in production. A user reported it, you found the failed run in your logs, and you're now staring at a JSON file that's 400 messages long, half of them are tool_result blocks the size of small databases, and somewhere in there is the moment the agent decided to do the wrong thing.

You can't re-run the agent: the API state has moved on, the tool would behave differently now, the prompt has been updated three times since. You have only the trace.

The way most of us read agent traces is: open the JSON in an editor, ctrl+F for the tool name we suspect, scroll through walls of escaped strings, try to mentally reconstruct the sequence. It takes thirty minutes, by the end of which you have one of three answers — "yeah I see what went wrong," "I'm pretty sure I see what went wrong," or "I have no idea what went wrong." About a third of the time it's the third one, and you go ship a band-aid that may or may not fix the actual problem.

The thing nobody talks about is that this isn't a hard problem. The JSON contains all the information. The issue is purely presentational — it's nearly impossible to read.

TL;DR

Agent traces are a sequence of decisions but stored as a wall of nested JSON. The signal is there; the format is the problem.
The right primitive isn't a JSON viewer — it's a timeline. Each thought, tool call, tool result, and final answer becomes its own discrete, color-coded step.
Once you can scrub through the trace step by step, the failure point becomes visually obvious in seconds instead of minutes.
This is post-hoc, not interactive. You don't need to re-run the agent or hit the API — replay works on the raw trace alone.
A browser-only tool can do this in 4 seconds. No backend, no key, just paste the JSON.

What an agent trace actually contains

When you save a Claude agent run, you usually persist the messages array — the full conversation including the model's responses and the tool results you fed back. A six-step agent run looks roughly like:

[
  { "role": "user", "content": "Find me the cheapest flight from IST to LAX next Tuesday" },
  { "role": "assistant", "content": [
    { "type": "text", "text": "I'll search for flights and check prices." },
    { "type": "tool_use", "id": "tu_01", "name": "search_flights", "input": {...} }
  ]},
  { "role": "user", "content": [
    { "type": "tool_result", "tool_use_id": "tu_01", "content": "[<2KB of JSON>]" }
  ]},
  { "role": "assistant", "content": [
    { "type": "text", "text": "Looking at three of those..." },
    { "type": "tool_use", "id": "tu_02", "name": "get_price", "input": {...} }
  ]},
  // ...four more steps...
]

Every interesting moment of the agent's behaviour is in there: which tool it picked, what arguments it constructed, what it said about its own reasoning, how it interpreted the result. The structure is fundamentally a sequence of discrete events, not a "document."

But you read it as a document, because that's what an editor shows you. The brain has to do the work of converting "alternating role: assistant / role: user with tool_result content blocks" into "step 3 was a tool call to get_price with argument X, which returned Y, which the agent then interpreted as Z."

That conversion is what kills your debugging time. Doing it manually for a 12-step trace takes minutes. Doing it for a 60-step agent on a complex task takes hours.

The right primitive: a timeline of decisions

The reframe is: stop reading the trace as JSON, start watching it as a sequence of decisions. Each step is one of:

💭 Thought — the model wrote text (the part of its response that isn't a tool call)
🔧 Tool call — the model invoked a tool with specific arguments
📥 Tool result — what came back, fed into the next turn
✅ Final answer — the model's end_turn, no more tools

Color-code those four event types. Lay them out in order, one card per event. You now have a timeline you can scrub, step through, and play back. The information density per card is high enough that you can read the entire trace at a glance, and zoom in only on the cards that look suspicious.

The structural insight: agent debugging is closer to debugging a script with breakpoints than to reading source code. You want to step through, not skim. JSON gives you no steps; the timeline gives you nothing else.

The bugs that become obvious in this view

Three failure modes I see repeatedly when I drop a trace into the timeline:

1. The wrong tool, picked silently. The model called search_archive when it should have called search_recent. In JSON this is one line out of 200 that flies past your eye. In the timeline it's a card with a tool name you didn't expect, and you click on it.

2. Hallucinated arguments. The model called the right tool but with an argument shape that doesn't match the schema — usually because the schema is ambiguous. In JSON you see {"q": "foo", "limit": "10"} and don't notice that limit should have been an integer. In the timeline the tool result card right after shows a 400 error and you trace it back one step.

3. The infinite loop precursor. Some agents get stuck in a pattern where they keep calling the same tool with slightly different inputs, never reaching a conclusion. In JSON it's a wall of near-identical blocks. In the timeline it's a visual rhythm — five purple cards in a row with the same tool name — that you can see in your peripheral vision the moment you scroll.

In all three cases, the bug isn't subtle. It just looks subtle when it's hidden in JSON.

What replay gives you that re-running doesn't

The temptation when an agent fails is to re-run it with print statements, see what happens, iterate. Don't. Three reasons:

It costs API calls. A failed agent that called 15 tools costs you 15× input tokens to re-run. With caching maybe less; either way, the bill is real. Replay is free.

The API state has moved. The tool you call today might return different data than the tool returned during the original run. You're not debugging the original failure anymore; you're debugging whatever happens now, which might be a totally different bug.

The model is stochastic. Even at temperature 0, retries can produce different outputs. Re-running an agent and getting a different failure mode means you've now got two bugs to investigate. The trace is the canonical artifact of what actually happened.

Replay sidesteps all three. You're inspecting a frozen artifact, deterministically, at whatever speed you want. The bug doesn't move while you're looking at it.

What this looks like in agent-replay

agent-replay is the tool I built for this. Paste your trace into a JSON pane on the left. The right pane renders it as a cinematic timeline:

Each event is a card with an icon and color
You can press space to play through the trace at 1× speed (one event per second), or scrub manually
Click any card to see the full content — the thought text, the tool call's input JSON, the raw tool result, expanded
Filter by event type — "show me only the tool calls" or "show me only the assistant thoughts" — when you want to focus
The whole thing is in your browser; no key needed, no backend, your trace never leaves the tab

There's a sample trace seeded on ?demo=1 if you want to see what a 12-step agent looks like without copying your own data anywhere.

The thing I keep finding: the moment I'm debugging is no longer "where in the JSON did the agent screw up." It's "which card looks wrong, and what does the next card show as a consequence." A 30-minute investigation becomes a 30-second one. Not because the tool is doing anything clever — it's just showing the same data in the right shape.

What I'd recommend you do this week

Three escalating moves:

Today (5 minutes): Find the last weird agent run you have a trace for. Paste it into agent-replay. See how long it takes to find the failure point. If it's faster than your usual JSON-scrolling approach, you just changed your debugging workflow.
This week (an afternoon): Add a trace-export endpoint to your agent. Every agent run, finished or failed, dumps the messages array to S3 or a database row. You need the trace before you need to debug it, not after.
This quarter (a habit): When a user reports "the agent did something weird," your first move is to pull the trace and open it in a timeline view, before you read the user's report carefully. Most of the time you'll know what happened before you finish reading the bug report.

Agent debugging is presented as an emerging engineering discipline. It isn't — it's a tooling problem we've solved many times before for non-AI systems. We just haven't built the tools yet for this one. Once the trace is in the right shape, the bugs are obvious. The work is laying out the data, not interpreting it.

I shipped this in agent-replay — paste a trace, scrub the timeline. No key, no backend, runs in the browser. Source: github.com/ferhatatagun/agent-replay.

The same SSE client (for traces that include streaming events) also powers three sibling tools — claudoscope, prompt-lab, tool-lab. All open-source, all BYOK: ferhatatagun.com/tools.

This post is mirrored from ferhatatagun.com/blog/debug-claude-agents-by-replaying-traces — that's the canonical URL.

More from the same place:

Long-form blog — AI, LLM tooling, frontend at the model boundary
The five tools the post discusses — all browser-only, BYOK, open-source
GitHub · X · LinkedIn

Happy to discuss here or on the canonical post — both threads stay open.

Build the sandbox before you write a single tool

Ferhat Atagün — Mon, 08 Jun 2026 11:28:05 +0000

The first time you ship a Claude agent that uses tools you'll do it the obvious way: design the schema, write the actual tool function, hit the API, parse the tool_use block, run the function, feed the result back, loop. It works. It also has a fundamental ordering bug:

You wrote the tools before you knew if they were the right tools.

By the time you've stood up a database query function, two API calls, and a thing that hits the file system, you've sunk maybe a day. You run the agent. It calls a non-existent tool. It hallucinates an argument shape that doesn't match your schema. It picks the wrong tool when both would have worked. Now you're going to redesign the schema, and the four real tool implementations you wrote are going in the bin or being rewritten.

The thing that makes this worse is that the failure mode looks like an "agent quality" problem when it's actually a "premature implementation" problem. The model knew what it wanted; you'd just built the wrong scaffolding around it.

TL;DR

Tool implementations are the slowest part of agent development; tool design is the fastest part to get wrong.
Decouple them: write the tool schemas, run the agent loop with mocked responses, see how the model picks and uses the tools — then write the real implementations only for the tools that survived.
The right mental model is "you play the role of every tool, by hand" — slow for the agent, fast for you, brutal for bad designs.
This is a fifteen-minute exercise for a five-tool agent that would otherwise take a day, and it catches design mistakes before they touch your codebase.
The whole thing fits in a browser tool with no backend.

What "premature implementation" actually looks like

A worked example. I was building a code review agent. My first instinct was four tools:

const tools = [
  { name: "read_file", description: "read a file from the repo", ... },
  { name: "search_code", description: "grep across the repo", ... },
  { name: "get_diff", description: "show the diff for this PR", ... },
  { name: "post_comment", description: "leave a review comment", ... },
];

I implemented all four. Real filesystem access. Real git invocation. Real GitHub API call. Probably four hours total. Then I ran the agent on a real PR.

What happened: the agent called get_diff first (good), then called search_code for every single identifier in the diff (catastrophic — the diff had 200 lines, 50 unique identifiers, my rate limit ran out). It never called read_file because the diff already contained the context. It called post_comment once at the end with a 4,000-word essay instead of inline comments.

Three of my four "real" tools were either misused or unused. The agent design was wrong, not the implementations. If I'd run the loop with mocked responses first, I would have:

Noticed it called search_code 50 times → split the tool into search_code(query, limit=5) with an explicit budget
Noticed it never used read_file → deleted it, saved myself an hour
Noticed post_comment was being used as post_essay → split into post_inline_comment(line, body) and post_summary(body)

That intervention takes fifteen minutes when the tools are mocked. It takes a day when they're real.

The role-play pattern

The trick is shockingly simple: write your tool schemas, send a real user message to Claude, and when the model produces a tool_use block, you hand-type the result and feed it back. The loop runs end-to-end, but you're playing every tool.

In code, this is the same agent loop everyone writes:

while (true) {
  const res = await callClaude({ messages, tools });
  if (res.stop_reason === "end_turn") break;

  const toolUses = res.content.filter(b => b.type === "tool_use");
  const toolResults = toolUses.map(t => ({
    type: "tool_result",
    tool_use_id: t.id,
    content: PROMPT_USER_FOR_RESULT(t.name, t.input),  // <-- you fill this in
  }));

  messages.push({ role: "assistant", content: res.content });
  messages.push({ role: "user", content: toolResults });
}

The only difference between this and a "real" agent loop is the PROMPT_USER_FOR_RESULT call — instead of executing a function, it shows you what the model called and what arguments it used, and waits for you to type the answer.

What that produces is surprisingly information-dense:

Did the model pick the tool I expected? If it took a different path you didn't anticipate, your schema is signaling something other than what you meant.
Did the input shape match my JSON schema? If the model is straining to fit the schema, the schema is too rigid or too loose.
How many tools did it chain? A 12-step tool chain to answer one question is a sign you decomposed the toolset wrong.
Did it ask follow-up questions before tool use? That's good — it means the model is trying to disambiguate. If it doesn't, your prompt isn't asking it to.

You see all of this in a five-minute conversation, before you've written a single line of real implementation.

When you can stop role-playing

The sandbox isn't a permanent state. It's a phase. You run it until you've answered three questions:

Are these the right tools? — Some get deleted, some get split, some get merged. Usually 30-50% of your initial toolset doesn't survive contact with a real prompt.
Are the schemas tight enough? — You see the model picking awkward argument values; you constrain the schema (enum instead of string, required instead of optional).
Does the agent loop terminate? — Some agents will keep calling tools forever if their stopping criteria are vague. The mock-response loop surfaces this immediately because you're the one getting stuck typing responses.

When those three are stable on a handful of real prompts, you write the real implementations. The implementation work is now de-risked: you know which tools to actually build, and the schemas are settled.

The thing you save isn't the implementation time itself — it's the rework. Writing a tool from scratch is fast. Rewriting a tool because its schema was wrong, then updating the prompt because the new schema needs different framing, then re-running every regression input, is what eats days.

What this looks like in tool-lab

tool-lab is what I built to do this without setting up a project each time. Three panes:

┌─ Tools (JSON editor) ─────────┬─ Conversation ────────────────────┐
│ [                             │  user: review this PR             │
│   { "name": "read_file", ... },│  assistant: I'll get the diff.    │
│   { "name": "search_code"...},│    → tool_use: get_diff()         │
│   { "name": "get_diff", ... },│    ← tool_result: <YOU TYPE>      │
│   { "name": "post_comment"...}│  assistant: ...                   │
│ ]                             │                                    │
└───────────────────────────────┴───────────────────────────────────┘

You paste your tool schemas on the left. Type the user message. The model streams its response into the right pane. When it lands a tool_use block, the conversation pauses with a text field for the result. You type whatever the tool would have returned — JSON, a string, an error, whatever. Hit continue. The loop runs again with your fake result included.

It's about 12KB of relevant logic on top of the shared SSE client I wrote about here. BYOK, no backend, your tool schemas and conversations live in localStorage only. There's a demo conversation seeded on ?demo=1 so you can see the loop run without writing tools yourself.

The thing I keep noticing: the tool-lab session for any new agent takes ten to twenty minutes. The agent design that comes out of it is consistently 30-50% smaller than what I would have written from intuition. Smaller agents with fewer, more focused tools are also dramatically easier to reason about when they go wrong in production — which is the other dividend of doing the sandbox phase.

What I'd recommend you do this week

Three escalating moves:

Today (10 minutes): Pick an agent you're already building. Paste its tool schemas into tool-lab, send a real user message, see what happens. If the agent picks the wrong tools or uses the right ones in surprising ways, you've just learned something.
This sprint (an afternoon): Make "sandbox before implementation" the default for new agents on your team. Stand up the tool schemas first, role-play five representative prompts, then write the implementations only for tools that survived. Track the count: how many initial tools made it through.
This quarter (a habit): When something goes wrong with an agent in production — wrong tool picked, weird argument shape, infinite loop — drop the trace into the sandbox before debugging the implementation. The bug is often in the design, not the code.

Tool implementations are not the hard part of agent development. Tool design is. The thing that separates teams that ship reliable agents from teams that ship agents that "mostly work" isn't the quality of their tool functions; it's how many bad tool designs they killed before writing the function.

You don't need a framework for this. You don't need a vendor. You need fifteen minutes and a willingness to play the role of every tool, by hand, until you know which ones deserve to be real.

I shipped this in tool-lab — define tools, mock responses, watch the agent loop. BYOK, no backend, runs in the browser. Source: github.com/ferhatatagun/tool-lab.

The same SSE client also powers three sibling tools — claudoscope, agent-replay, prompt-lab. All open-source, all BYOK: ferhatatagun.com/tools.

This post is mirrored from ferhatatagun.com/blog/build-the-sandbox-first — that's the canonical URL.

More from the same place:

Long-form blog — AI, LLM tooling, frontend at the model boundary
The five tools the post discusses — all browser-only, BYOK, open-source
GitHub · X · LinkedIn

Happy to discuss here or on the canonical post — both threads stay open.

Your prompt isn't better. You just remember it being better.

Ferhat Atagün — Mon, 08 Jun 2026 11:27:10 +0000

Every developer who has shipped a Claude-powered feature has had this conversation with themselves:

"OK, the old prompt was too long, this one's tighter — feels like it's giving better answers… and faster too, I think? Let's ship it."

You ship it. A week later something feels off — maybe outputs are flakier on the edge cases, maybe the bill went up, maybe a coworker tells you "the AI doesn't get it anymore." You don't remember the exact previous prompt. You don't have a baseline. You change it back. Or you don't, and live with a quiet regression for a month.

I have done this maybe forty times. Most of us have. The reason isn't that prompt iteration is hard. The reason is that evaluating prompt iteration is hard, and we don't have the tooling for it, so we substitute taste — which works fine until it doesn't.

TL;DR

"It feels better" is not data. Your sample size is one query, your memory is recent, your prior is sunk cost.
The minimum useful comparison is the same input through two prompts in parallel, surfacing three numbers: output (do they say the same thing?), latency (how long did each take?), cost (how much did each spend?).
Models change too — comparing GPT-style verbose system prompts on Sonnet 4.5 vs Haiku 4.5 surfaces ~10× cost differences for outputs you'd score the same.
Running them in parallel makes it fair: same time of day, same API state, same input. Running them sequentially in a chat window does not.
A browser-only tool can do this in 4 seconds. You don't need a benchmarking framework. You need to see them side by side.

What "vibes" actually costs

The trap with prompt tuning is that the only dimension a chat-style UI shows you is the output text. You read it, decide if it sounds right, and move on. Three things get hidden:

1. Latency. Did this take 3 seconds or 11? You squinted, kind of remembered, but you weren't watching a stopwatch. Across a thousand production requests this difference is the gap between "snappy" and "slow."

2. Cost. The verbose system prompt that produces beautiful structured output uses 4,000 input tokens. The terse one uses 600. They both produce ~800 output tokens. At Sonnet pricing that's the difference between $14 and $4 per thousand calls. You don't see this difference looking at one response.

3. Output drift. "Cleaner" outputs sometimes mean the model lost a useful constraint. The polite preamble you stripped out was actually doing something. The structured format you added looks neat but truncates on long inputs. Side-by-side reveals this; sequential doesn't, because you remember the gist of the previous answer, not the specifics.

The whole point of A/B testing is to lift all three of these into the same field of view, on the same input, at the same time. That's it. That's the entire idea. The reason most of us don't do it is that we don't have the tool — and the friction of switching between two tabs, hitting send twice, copying output into a diff viewer, and looking up cost in the dashboard is enough to make us shrug and ship.

Same input, two prompts, parallel

The mechanism is unspectacular:

const [outA, outB] = await Promise.all([
  runClaude({ system: promptA, messages, model }),
  runClaude({ system: promptB, messages, model }),
]);

That's the core. Two requests fired in parallel against the same messages. The trick is that both streams are happening simultaneously — same network conditions, same API load, same time-of-day cache warmth. Sequential A→B isn't a fair comparison; if the API was congested for the first call and cached the second, you're measuring weather, not signal.

What you do with the two outputs is where it gets interesting. The boring version: log both, eyeball, pick one. The version that actually works: side-by-side render, each with its own latency clock, each with its own token count and cost dollars, each with a diff highlight if you want to see exactly where they disagree.

The thing I've found is that 80% of the time both prompts produce substantively equivalent outputs. The reason to pick one is purely on cost or latency — there's no semantic improvement, you just got a 4× cheaper version of the same answer. The remaining 20% is where the outputs actually diverge meaningfully, and that's where eyeballs are needed, but at least now you know to look.

What "better" looks like in numbers

A concrete example from last week. I had two versions of a system prompt for a code-review tool:

Version A — 1,800 tokens, full taxonomy of issue types, examples for each, explicit JSON schema:

You are a senior staff engineer reviewing a pull request. For each
issue you find, classify it under one of:
- correctness (the code is wrong)
- security (the code is exploitable)
- performance (the code is slow)
- maintainability (the code is hard to read)
...

Version B — 280 tokens, no taxonomy, schema implied by an example:

Review this code. For each problem, return JSON like:
[{"severity": "high"|"medium"|"low", "line": 42, "issue": "..."}]
Don't comment on style; focus on bugs and security.

Same input (a 600-line Python file). Both went to Sonnet 4.5. Side-by-side run:

	Version A	Version B
Input tokens	2,640	1,120
Output tokens	820	740
Latency	5.3s	3.1s
Cost	$0.0202	$0.0145
Issues found	7	6

Looking at the diff: both flagged the same 5 critical issues. Version A also flagged a # TODO as a maintainability issue and split a complex function into two suggested refactors. Version B was tighter — it caught fewer minor things but every single thing it caught was actionable.

I shipped B. Not because it was "better" in a soft sense; because it was 28% cheaper and 41% faster for outputs that a human would consider equivalent on the work that mattered. That is what an A/B framework gives you that a chat UI doesn't: a basis for the decision that isn't "feels right."

If I had only run version B sequentially after deleting version A, I would have lost the comparison and convinced myself version B was either much better or much worse than it actually was.

The cross-model angle

The same setup also surfaces something subtle that I think most teams underuse: the right model is also a prompt choice.

Same prompt, two models — Sonnet 4.5 vs Haiku 4.5 — on the same input:

	Sonnet 4.5	Haiku 4.5
Latency	4.1s	0.9s
Cost (input+output)	$0.011	$0.0008
Output quality	9/10	8/10

For the right kind of task, that's a ~13× cost reduction with a quality drop most users would never notice in a UI. The wrong kind of task — anything requiring complex multi-step reasoning — and Haiku will whiff in ways Sonnet wouldn't, and the comparison protects you from that too. You don't have to guess which kind of task you have; you can measure it on five real inputs in five minutes.

How prompt-lab does this

I built prompt-lab because the friction of A/B testing prompts in my own work was high enough that I was skipping the step and shipping by vibes. The tool's whole job is to remove that friction:

Two prompt panes. Paste prompt A on the left, prompt B on the right.
One input pane. Type the user message once.
Hit run. Both responses stream into their respective panes simultaneously.
Below each pane: a small scoreboard with input tokens, output tokens, latency, cost.
At the bottom: a verdict line — "A: $0.0202 / 5.3s · B: $0.0145 / 3.1s · B 28% cheaper, 41% faster."

That's the entire UI. It's a browser tool, BYOK, no backend. It's about 8KB of relevant logic plus the streaming client from the previous post.

You can also do same-prompt-different-model, or different-prompt-different-model. The arena doesn't care which one you're testing — you set the two columns and hit run.

What I'd recommend you do this week

Three steps, increasing in effort:

Today (5 minutes): Open prompt-lab. Take whatever prompt your team is currently shipping. Make a shorter version of it. Run them both on three real inputs. If the shorter one wins on cost+latency with no semantic loss on the inputs you care about, you just paid for your week.
This sprint (an afternoon): Build a small eval harness. Pick 10 representative inputs that span your real traffic. Run every prompt change through them before merging. Doesn't need to be fancy — a JSON file of inputs and a script that diffs outputs is enough to catch the worst regressions.
This quarter (a project): Make A/B comparison part of your prompt review process. Every PR that changes a prompt should include the run output for the same 10 inputs, with the cost and latency numbers in the description. Same energy as showing test results in a code review.

The economics of LLM apps are increasingly about prompt design and model choice. The teams that compete will be the ones that measure both. The teams that don't will keep shipping vibes-based prompt changes and wondering why the bill keeps creeping up while users complain it "feels worse."

You don't need to outsmart your future self. You just need to make it possible for them to look back and know what was actually changing.

I shipped this in prompt-lab — two prompts side by side, BYOK, no backend, runs in the browser. Source: github.com/ferhatatagun/prompt-lab.

The same SSE client also powers three sibling tools — claudoscope, agent-replay, tool-lab. All open-source, all BYOK: ferhatatagun.com/tools.

This post is mirrored from ferhatatagun.com/blog/stop-choosing-prompts-by-vibes — that's the canonical URL.

More from the same place:

Long-form blog — AI, LLM tooling, frontend at the model boundary
The five tools the post discusses — all browser-only, BYOK, open-source
GitHub · X · LinkedIn

Happy to discuss here or on the canonical post — both threads stay open.

Prompt caching is the cheapest Claude optimization. Nobody measures it.

Ferhat Atagün — Mon, 08 Jun 2026 11:27:03 +0000

Pull up the last week of Anthropic API bills from any team shipping a Claude-powered product. Two out of three of them are paying for context they could be reading from cache for one-tenth the price. Most of them don't know it, because the dashboard doesn't tell them and the SDKs don't either — by the time the response lands, the only number anyone looks at is output_tokens, and even then mostly when something seems expensive.

The information is in every response. Anthropic puts it in usage:

"usage": {
  "input_tokens": 312,
  "cache_creation_input_tokens": 4180,
  "cache_read_input_tokens": 0,
  "output_tokens": 187
}

Four numbers. The first time a cached prompt runs you pay 1.25× the input price to write the cache. Every subsequent call within the TTL pays 0.1× to read it. The ratio between those two lines is the difference between a $3,000/month bill and a $300/month one. And almost no one is graphing it.

TL;DR

Every Claude response carries cache-hit data in usage. Most apps log it nowhere.
The first call after a cache miss costs 1.25× input extra; every hit after costs 0.1× input. Break-even is two reads.
The cache TTL is 5 minutes by default. A request pattern that fires once every six minutes is paying the write penalty forever and getting zero benefit.
The fix is observability, not code: graph cache hit ratio over time, alert when it dips, and you'll find the bug before the invoice does.
A 150-line browser tool is enough to do this for any project that streams from the Messages API.

What the four numbers actually mean

When you send a request with cache_control: { type: "ephemeral" } somewhere in your messages, the API checks if it's seen an identical prefix in the last 5 minutes. There are three outcomes:

Cache miss, new content. The full prompt is processed normally. input_tokens reflects the uncached portion; cache_creation_input_tokens reflects what got written into cache for next time.
Cache hit. The cached prefix is read at 10% the price. cache_read_input_tokens shows what was read; input_tokens is just the new suffix.
TTL expired. Same shape as a miss — you pay the creation surcharge again.

So a single response tells you exactly which of these three happened. Not "approximately." Exactly. Per request. For free.

The pricing math (Sonnet 4.5, June 2026) shapes up like this for a 5,000-token system prompt that gets queried once and then again four minutes later:

Scenario	First call	Second call	Total
No caching	5,000 × $3 = $0.015	5,000 × $3 = $0.015	$0.030
Cache, hit	5,000 × $3.75 = $0.019	5,000 × $0.30 = $0.0015	$0.020
Cache, miss (TTL out)	5,000 × $3.75 = $0.019	5,000 × $3.75 = $0.019	$0.038

The third row is the failure mode. You enabled caching, you're paying the write penalty, and nobody's actually hitting the cache. Without measurement, this row looks identical to the second in your code — same headers, same prompt structure, same response — but it's 90% more expensive than not caching at all.

How a bad cache hit ratio sneaks in

Three patterns I've watched teams ship and then quietly bleed money over:

1. Per-user system prompts. Someone interpolated the user's name or org ID into the system prompt to feel "personalized." Every cache write is now per-user, and unless that user fires a second request within five minutes, every call pays the creation surcharge. The fix is moving the personalization into the user message and keeping the system prompt static — but you only see this fix is needed when the hit ratio graph is flat at zero.

2. Subtly drifting prompts. Maybe you append the current timestamp, maybe a "today is" line, maybe you regenerate a list of available tools that arrives in a non-deterministic order. The cache key is the exact byte sequence; one character of drift and you've invalidated the whole prefix. Tools that serialize tool definitions before sending are an especially fun source of this — JSON.stringify on an object with shuffled keys produces different bytes, no hit.

3. Wrong TTL for your traffic pattern. A chatbot that gets ~one message every ten minutes has a structural mismatch with a 5-minute ephemeral cache. You're paying the write penalty on every conversation turn. Either bump to the 1-hour cache (more expensive write, way longer life) or accept that caching isn't economical for your traffic shape — but you need the data to make either decision.

All three of these are invisible from a code review. They're only visible in the usage telemetry.

The minimum viable observability

You don't need a metrics stack for this. You need to log four fields per request and chart them. The unhelpful version is the one most teams have:

logger.info("claude response", { tokens: r.usage.output_tokens });

The version that pays for itself in one week is:

const u = r.usage;
const hitRate = u.cache_read_input_tokens / 
                (u.cache_read_input_tokens + u.cache_creation_input_tokens || 1);

logger.info("claude.usage", {
  input: u.input_tokens,
  output: u.output_tokens,
  cache_create: u.cache_creation_input_tokens ?? 0,
  cache_read: u.cache_read_input_tokens ?? 0,
  hit_rate: hitRate,
  cost_estimate: estimateCost(u, model),
});

The hit_rate field is the one that matters. Group by route, by model, by user-agent — whatever your traffic dimensions are. Anything trending toward zero on a cache-using endpoint is a money leak.

The cost_estimate is what makes the dashboard land in conversations with non-engineers. Anthropic publishes pricing per token tier; the conversion is mechanical:

function estimateCost(u: Usage, model: string) {
  const p = pricing[model]; // { input, output, cache_write, cache_read }
  return (
    u.input_tokens * p.input +
    u.output_tokens * p.output +
    (u.cache_creation_input_tokens ?? 0) * p.cache_write +
    (u.cache_read_input_tokens ?? 0) * p.cache_read
  ) / 1_000_000;
}

That's it. Five lines of arithmetic and you've got per-request dollars on every Claude call your app makes.

Why I built a tool for this anyway

I built claudoscope because I wanted to see this data live, while the response was streaming, without instrumenting whatever app I was iterating on. The use case is "I'm about to ship a prompt change, did my cache behavior just regress?" — the slow, expensive way is deploying it and looking at logs an hour later; the fast way is pasting the request into a tool that tells you in 4 seconds.

The whole thing is a browser-only client. Bring your own key, no backend. Every event from the SSE stream is parsed and the usage object is broken out into a panel:

┌─ X-Ray ────────────────────────────────────────┐
│ input         312      $0.0009                 │
│ cache write 4,180      $0.0157  ◄─ first run  │
│ cache read      0      $0.0000                 │
│ output        187      $0.0028                 │
│ ─────────────                                  │
│ total                  $0.0194                 │
│                                                │
│ hit ratio: 0% (cold) — re-run within 5m       │
└────────────────────────────────────────────────┘

Hit "send" a second time within the TTL and the bars rearrange — cache write goes to zero, cache read fills, the cost number drops by 90%. It's the kind of thing that's obvious once you see it move and invisible if you don't.

It's about 100KB gzipped and the source is in one file. The pricing tier logic is in another. There's no third file.

What I'd actually recommend you do today

The order of operations, in increasing effort:

Right now (5 minutes): Open claudoscope, paste your most expensive prompt, run it twice. Look at the difference. If the hit ratio isn't ~99% on the second call, you have a cacheability bug, not an optimization opportunity.
This week (an afternoon): Add the usage logging block above to every Claude call site in your app. Ship it. Don't bother building a dashboard yet — grep your logs and you'll find the worst offenders in fifteen minutes.
This month (a sprint): Move the four usage fields into your real metrics pipeline (Datadog/Honeycomb/Grafana/whatever). Graph cache hit ratio by endpoint. Alert when it drops below your floor.
Optional (if you're me): Build the visualizer because seeing it move in real time is the thing that makes it stick.

Three out of four of those are configuration, not code. The interesting part isn't the implementation; it's that almost nobody has done it. The teams I've talked to who do have it — without exception — found a cache misconfiguration in the first week of dashboards and saved more than the work cost them. The teams who don't have it are usually paying the cache creation surcharge for nothing.

The Anthropic API gives you everything you need to know whether your caching is working. The only question is whether you look.

I shipped this visualization in claudoscope — bring-your-own-key, no backend, runs in the browser. Source: github.com/ferhatatagun/claudoscope.

The same SSE client also powers three sibling tools — agent-replay, prompt-lab, tool-lab. All open-source, all BYOK: ferhatatagun.com/tools.

This post is mirrored from ferhatatagun.com/blog/prompt-caching-nobody-measures — that's the canonical URL.

More from the same place:

Long-form blog — AI, LLM tooling, frontend at the model boundary
The five tools the post discusses — all browser-only, BYOK, open-source
GitHub · X · LinkedIn

Happy to discuss here or on the canonical post — both threads stay open.

Building a streaming Claude client in the browser — without the SDK

Ferhat Atagün — Mon, 08 Jun 2026 11:26:32 +0000

I wanted to call Claude from a browser. The Anthropic SDK said no — sort of.

When I tried import Anthropic from "@anthropic-ai/sdk" in a Next.js app, the bundler crashed. The error pointed at node:fs/promises, deep inside the package — an agent-toolset module that reads files from disk and obviously cannot run in a browser. It isn't optional code; it's pulled in by the SDK's main client entry.

So either I waited for a browser-clean entry point (eventually, maybe), or I talked to the Messages API directly. The endpoint is HTTP. The streaming format is Server-Sent Events. I'd done this for OpenAI before — how hard could it be?

Turns out: about 150 lines of TypeScript for a usable client, and the result is cleaner than the SDK for the kind of tool I was building. Here's what that took and why I'd recommend it for anything browser-side that touches the Claude API.

TL;DR

The official SDK pulls in Node-only modules and breaks browser bundles.
Direct fetch works once you send anthropic-dangerous-direct-browser-access: true.
The streaming format is straightforward SSE — split events on \n\n, parse data: lines.
The only mild gotcha is tool_use blocks: their input arrives as input_json_delta chunks you accumulate and parse at content_block_stop.
Hand-rolled means tiny bundle, fewer abstractions, full visibility into what the protocol is doing.

The CORS unlock

Browsers won't let you fetch() https://api.anthropic.com by default. Anthropic ships a flag to allow it: send anthropic-dangerous-direct-browser-access: true and CORS opens up. The header's name is a warning — keys typed into a browser are visible to anyone with devtools open. For a bring-your-own-key developer tool that's fine; for a production app shipping a server-side key, it isn't.

With the header in place, a minimal request looks like this:

await fetch("https://api.anthropic.com/v1/messages", {
  method: "POST",
  headers: {
    "content-type": "application/json",
    "x-api-key": apiKey,
    "anthropic-version": "2023-06-01",
    "anthropic-dangerous-direct-browser-access": "true",
  },
  body: JSON.stringify({
    model,
    max_tokens: 1024,
    messages: [{ role: "user", content: "Hello." }],
    stream: true,
  }),
});

stream: true gives back a Server-Sent Events stream. The response body is a ReadableStream<Uint8Array> — chunks of bytes you decode as text. Events are delimited by a blank line; each event is a couple of lines (event: <type> and data: <json>), and the meaningful payload lives in data:.

What the stream actually looks like

For a plain text response, the SSE event sequence is:

event: message_start
data: { "type": "message_start", "message": { ..., "usage": {...} } }

event: content_block_start
data: { "type": "content_block_start", "index": 0,
        "content_block": { "type": "text", "text": "" } }

event: content_block_delta
data: { "type": "content_block_delta", "index": 0,
        "delta": { "type": "text_delta", "text": "Hello" } }

event: content_block_delta
data: { "type": "content_block_delta", "index": 0,
        "delta": { "type": "text_delta", "text": " there" } }

event: content_block_stop
data: { "type": "content_block_stop", "index": 0 }

event: message_delta
data: { "type": "message_delta", "delta": { "stop_reason": "end_turn" },
        "usage": { "output_tokens": 12 } }

event: message_stop
data: { "type": "message_stop" }

Each content_block_delta carries a partial token. Concatenate the text fields per index and you have the streamed message. Done — for plain text.

Three things make this slightly more interesting:

Multiple content blocks per message (text plus tool_use, or several tool_use blocks).
The tool_use block's input arrives as a sequence of partial-JSON deltas, not all at once.
Aborting cleanly when the user clicks Stop.

Parsing the stream

The parser is small. Read chunks, accumulate them in a buffer, split on \n\n (the SSE event separator), and process each event:

const reader = res.body!.getReader();
const decoder = new TextDecoder();
let buffer = "";

while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  buffer += decoder.decode(value, { stream: true });

  let sep: number;
  while ((sep = buffer.indexOf("\n\n")) !== -1) {
    const rawEvent = buffer.slice(0, sep);
    buffer = buffer.slice(sep + 2);

    const dataLine = rawEvent.split("\n").find((l) => l.startsWith("data:"));
    if (!dataLine) continue;

    const evt = JSON.parse(dataLine.slice(5).trim());
    handle(evt);
  }
}

TextDecoder with { stream: true } matters — without it you'll get garbled UTF-8 when a multi-byte character lands on a chunk boundary. Anthropic sends a lot of em-dashes; ask me how I know.

handle(evt) switches on evt.type and updates state. For text-only, the only events that move the UI are content_block_delta (append text to the current text block) and message_delta (final usage). For a full client, I keep a blocks: Block[] array indexed by evt.index and mutate the matching block as deltas arrive.

Tool use: partial-JSON deltas

Tool calling is where this gets a little trickier. When the model decides to call a tool, you get a content_block_start with content_block: { type: "tool_use", id, name, input: {} } — the input is empty. The arguments arrive in content_block_delta events shaped like this:

event: content_block_delta
data: { "type": "content_block_delta", "index": 1,
        "delta": { "type": "input_json_delta", "partial_json": "{\"cit" } }

event: content_block_delta
data: { "type": "content_block_delta", "index": 1,
        "delta": { "type": "input_json_delta", "partial_json": "y\":\"Ist" } }

You can't JSON.parse a partial string. So I accumulate them per block index and only parse at content_block_stop:

const toolUseJson: Record<number, string> = {};

case "content_block_start": {
  const cb = evt.content_block;
  if (cb.type === "tool_use") {
    blocks[evt.index] = { type: "tool_use", id: cb.id, name: cb.name, input: {} };
    toolUseJson[evt.index] = "";
  } else if (cb.type === "text") {
    blocks[evt.index] = { type: "text", text: "" };
  }
  break;
}

case "content_block_delta": {
  const d = evt.delta;
  if (d.type === "text_delta") {
    (blocks[evt.index] as TextBlock).text += d.text;
  } else if (d.type === "input_json_delta") {
    toolUseJson[evt.index] += d.partial_json;
  }
  break;
}

case "content_block_stop": {
  const b = blocks[evt.index];
  if (b?.type === "tool_use") {
    try {
      b.input = JSON.parse(toolUseJson[evt.index] || "{}");
    } catch {
      b.input = {};
    }
  }
  break;
}

This is the entire tool-use accommodation. The UI gets a clean callback when the block completes, with a parsed object as input ready to render.

A nice consequence of the per-block accumulation: text deltas can be rendered live — typing animation, caret blink, the whole thing — while tool_use cards appear only when their input is fully assembled. That feels right. Text is conversational; tool calls are commands.

Abort

Don't skip this. A streaming request that the user has clicked Stop on should actually stop, not run to completion in the background:

const ac = new AbortController();
await fetch(ENDPOINT, { ..., signal: ac.signal });
// later, when the user clicks Stop:
ac.abort();

reader.read() throws on the next iteration after abort, and signal.aborted becomes true. Catch it, distinguish it from a real error, and surface a clean "stopped" state:

try {
  // ... the read loop ...
  cb.onDone({ usage, stopReason });
} catch (err) {
  if (signal?.aborted) {
    cb.onDone({ usage, stopReason: "aborted" });
    return;
  }
  cb.onError(errorMessage(err));
}

The user gets the partial response they've already seen plus a "stopped" badge, instead of a generic crash.

Errors that mean something

A 401 from the API can mean several things; a 429 can mean several things. The browser hands you a Response you have to drill into. Parse the body as JSON, look for error.message, fall back to status-code messages, and present something the user can act on:

async function readError(res: Response): Promise<string> {
  try {
    const body = await res.json();
    const msg = body?.error?.message ?? body?.message;
    if (msg) return `${res.status} · ${msg}`;
  } catch {
    /* fall through */
  }
  if (res.status === 401) return "401 · Invalid API key.";
  if (res.status === 429) return "429 · Rate limited — wait a moment.";
  return `${res.status} · Request failed.`;
}

Boring, but the difference between "the app crashed" and "your key is invalid, fix it" is the difference between a tool and a toy.

What this gets you

The whole SSE client — request, parsing, tool use, abort, errors — fits in about 150 lines of TypeScript and ships in a browser bundle that is, in my case, around 100 KB gzipped including React, Tailwind v4, Framer Motion, and the rest. The SDK alone is larger than that.

The other thing it gets you is honesty. The most interesting part of working with the Claude API is the streaming behaviour — caching turning on, tokens accumulating, tool calls landing one block at a time. Hiding that behind an SDK abstraction means you have to debug the SDK before you can debug your app. With direct fetch, your client is the protocol, and when something goes wrong you read the SSE events as they arrive.

I shipped this approach in claudoscope, a browser-only x-ray for Claude API calls. The whole token-economics visualization — cache reads, cache writes, uncached input, output, cost delta — is computed straight from the stream events described above. No SDK, no backend, no server-side proxy.

src/
  app/page.tsx              orchestration
  lib/anthropic.ts          the ~150-line client from this post
  lib/pricing.ts            tier-aware cost from usage events
  components/XRayPanel.tsx  what makes the data visible

The same client now powers three sibling tools — agent-replay, prompt-lab, tool-lab — without modification. Once the SSE parsing is yours, it composes.

If you've been waiting to put the Claude API in a browser tool because the SDK fights you: it's about an afternoon's work, and the result is small, debuggable, and yours.

The four tools, all open-source and BYOK: ferhatatagun.com/tools.

Source for the SSE client described here: github.com/ferhatatagun/claudoscope/blob/main/src/lib/anthropic.ts.

This post is mirrored from ferhatatagun.com/blog/browser-only-claude-streaming — that's the canonical URL.

More from the same place:

Long-form blog — AI, LLM tooling, frontend at the model boundary
The five tools the post discusses — all browser-only, BYOK, open-source
GitHub · X · LinkedIn

Happy to discuss here or on the canonical post — both threads stay open.

DEV Community: Ferhat Atagün

How I shipped a blog Google couldn't see

What the page was actually serving

Why prerender = true wasn't enough

The home page was even worse

The three problems the fix uncovered

1. Prerender worker OOM

2. The prerender crawler followed every link in the rendered HTML

3. The shared host wasn't pulling from gh-pages

How to catch the next one before it ships

What this cost

See the prompt before you ship it

What you can actually pre-flight

The four-fold cost difference no one was looking for

Why the heuristic mode exists

What about prompt caching

What I'd recommend you do this week

What I learned shipping four open-source Claude dev-tools in two weekends

The constraint that made the work possible

One tool, one thing it makes visible

BYOK + browser-only as a trust multiplier

A 150-line client, copied four times

The articles aren't marketing — they're scaffolding

What I'd do differently

What's next

How I debug Claude agents by replaying their trace

What an agent trace actually contains

The right primitive: a timeline of decisions

The bugs that become obvious in this view

What replay gives you that re-running doesn't

What this looks like in agent-replay

What I'd recommend you do this week

Build the sandbox before you write a single tool

What "premature implementation" actually looks like

The role-play pattern

When you can stop role-playing

What this looks like in tool-lab

What I'd recommend you do this week

Your prompt isn't better. You just remember it being better.

What "vibes" actually costs

Same input, two prompts, parallel

What "better" looks like in numbers

The cross-model angle

How prompt-lab does this

What I'd recommend you do this week

Prompt caching is the cheapest Claude optimization. Nobody measures it.

What the four numbers actually mean

How a bad cache hit ratio sneaks in

The minimum viable observability

Why I built a tool for this anyway

What I'd actually recommend you do today

Building a streaming Claude client in the browser — without the SDK

The CORS unlock

What the stream actually looks like

Parsing the stream

Tool use: partial-JSON deltas

Abort

Errors that mean something

What this gets you

Why `prerender = true` wasn't enough