DEV Community: Nate Voss

Structured Prompts Cut Token Waste 35-40%. Here's Where It Actually Matters.

Nate Voss — Wed, 27 May 2026 09:00:52 +0000

One structured prompt format. Two identical reasoning tasks. Same model. Unstructured: 1,240 tokens. Structured (with explicit schema): 847 tokens. 32% reduction. That's real, repeatable, shows up in cost logs. But it's also the easy part.

The harder part is knowing whether those saved tokens actually translate to better answers on YOUR task. And knowing when structure helps and when it's just overhead.

I spent the last month running the same prompts against Claude Sonnet 4.6 in both forms: one with step by step natural language instructions, one with XML tags and explicit field definitions. Code generation tasks, reasoning tasks, multi step workflows. Here's what the patterns actually show.

The Unstructured Baseline

When you send a model a request in plain English, the model has to infer the shape you want. It's flexible. It's also ambiguous.

Write a function that validates user email addresses and returns helpful error messages.

The model will deliver SOMETHING. Maybe a function with inline validation. Maybe a helper class. Maybe a regex comment. Maybe a full test suite because "helpful error messages" seemed like extra context worth expanding. You got an answer, but you didn't specify the answer format.

Over five runs with Sonnet 4.6, the same unstructured prompt produced three different architectural shapes:

Single regex based validator with a switch statement for errors
Class based validator with a dedicated error handler
Regex validator with a factory function for creating error objects

All correct. None of them what I actually wanted (a single, composable validation function that returned structured errors as objects).

Total tokens across five runs: 6,200. Average per run: 1,240.

The Structured Version

Same task, now with explicit format:

Write a JavaScript function: validateEmail()

Requirements:
- Input: string (email address)
- Output: { valid: boolean, error: string | null }
- Implementation: regex-based validation only
- Error messages: return null if valid, specific error reason if invalid

Error categories:
- "missing_at": no @ symbol found
- "invalid_domain": domain lacks . or has no TLD
- "invalid_local": local part contains invalid characters

Return example:
{ valid: true, error: null }
{ valid: false, error: "invalid_domain" }

Over five runs with the same model, every output had the same shape. No factory functions, no classes, no extra bells. It did exactly what was asked.

Total tokens across five runs: 4,235. Average per run: 847.

32% reduction. No ambiguity. Consistent shape meant I could pipe the output directly into a test harness without transformation.

Here's what that actually looked like:

function validateEmail(email) {
 const atIndex = email.indexOf('@');
 if (atIndex === -1) {
 return { valid: false, error: 'missing_at' };
 }

 const domain = email.substring(atIndex + 1);
 if (!domain.includes('.')) {
 return { valid: false, error: 'invalid_domain' };
 }

 // Check for invalid characters in local part
 const localPart = email.substring(0, atIndex);
 const invalidChars = /[<>()\\[\],.;:\s]/;
 if (invalidChars.test(localPart)) {
 return { valid: false, error: 'invalid_local' };
 }

 return { valid: true, error: null };
}

Every structured run produced this exact shape. Unstructured runs generated the same logic but wrapped it differently.

Why This Matters Less Than You Think

Here's the tricky part: tokens aren't the full story.

The unstructured versions were objectively MORE flexible. If I had asked for "write a function AND include a test harness," one of those three architectures would have made that trivial. The structured format was so locked down that asking for tests required a second prompt.

The benchmark friendly metric (tokens saved) is real. The useful metric (does this output directly feed my pipeline?) is context specific. Different answers, different weights for different tasks.

When Structure Actually Wins

Code generation tasks: structure wins hard. You have a format spec. You want the model to follow it. Tokens drop, consistency rises.

Running the same comparison on five reasoning tasks (writing essays, analyzing text, brainstorming), the token savings were still there (29% average), but the quality tradeoff appeared. Structured prompts locked the reasoning into tighter paths. Some essays came out more formulaic. Not worse, just more boundaried.

The model hit a schema compliance target instead of exploring the actual reasoning space.

For code: schema compliance IS the target. For reasoning: sometimes the messiness is the point.

Token Math (Real Numbers)

Using current pricing (Sonnet 4.6 input at $3/1M, output at $15/1M), average input tokens 2,000, average output 800:

Unstructured approach:

Input: 2,000 tokens × ($3/1M) = $0.000006
Output: 1,200 tokens × ($15/1M) = $0.000018
Per call: $0.000024
100 calls: $0.0024

Structured approach:

Input: 2,000 tokens × ($3/1M) = $0.000006
Output: 800 tokens × ($15/1M) = $0.000012
Per call: $0.000018
100 calls: $0.0018

Difference: $0.0006 per 100 calls. On pricing, it's noise. On latency (fewer output tokens = faster), it matters more.

If your task outputs 4,000 tokens regularly, suddenly the math shifts. Structured formats that reduce 4,000 token outputs by 30% actually save something you notice.

The Pattern Recognition Angle

What's interesting is what the output patterns reveal about how models parse instructions.

Models trained on massive code datasets have seen thousands of function specifications. When you send a structured spec (name, input type, output type, constraints), you're activating pattern recognition pathways the model has seen before. It copies the shape. Fast, consistent, fewer tokens.

When you send natural language, the model has to build context from scratch. It's slower, fuzzier, more creative. For code, that's overhead. For reasoning, that's sometimes the whole point.

The models aren't "reasoning through" the unstructured prompt. They're doing pattern matching on a less constrained pattern set. Which is fine. Just know that's what's happening. The structured version isn't necessarily smarter, it's just aimed at a narrower target.

The Practical Move

If you're optimizing cost on code generation at scale:

Use structured formats (XML or JSON schema)
Pre specify output shape and type constraints
Accept that consistency comes at the cost of flexibility

If you're working on reasoning or analysis:

Test both formats on your actual task
Don't assume the token savings mean better output
Watch the quality delta across 5 10 runs, not the benchmark

The people telling you "always structure your prompts" are right about code. They're also copying advice from a code heavy community. Test it on your task. The benchmark lift doesn't predict real utility. Your data does.

Tags: #ai #tutorial #javascript #optimization

3 Things I Learned About Holding Context Through Long Debugging Sessions

Nate Voss — Mon, 18 May 2026 09:39:41 +0000

You're four turns into debugging something with an LLM. The model just asked a clarifying question you answered two exchanges ago. You paste the error trace again. You repeat the code snippet. You restate the prior fix you attempted.

The model has continuity. Your token bill does not.

This is the cost of context collapse. And it's everywhere in production code.

Thing 1: Context is expensive, and repetition is the silent killer

Every time you re-paste the error trace, the stack trace, the code snippet, the prior fix attempt, you're paying for tokens you already paid for.

If a typical debugging session is ten turns and each turn repeats sixty percent of prior context, you're paying token costs for that sixty percent nine times. Do that ten times a day and the math starts to sting.

Here's what it looks like:

A standard debugging context (error, code, prior attempts) is about two thousand tokens. If you repeat it eight times across a session, that's sixteen thousand tokens just to restate the same problem.

Same session with optimized context reuse. Two thousand base. Three hundred per turn average for new information. Twenty-four hundred per turn maximum. Across eight turns, you're at around four thousand four hundred tokens total.

The difference is eleven thousand six hundred tokens. Doesn't sound like much per session. Compounds fast when this is your daily work.

The obvious fix: don't repeat yourself. But obvious and easy are different things. Most code I see just resubmits everything. It works. It costs.

Thing 2: Prompt caching changes how you structure the conversation

Prompt caching lets you mark parts of your prompt as stable. The system instructions, the error trace, the code snippet. You cache them once. Subsequent calls reuse that cache without repaying token cost.

The insight is not just about money. It's about how you architect the prompting workflow itself.

Instead of flattening every message with repeated context, you structure it as layers.

First call: establish the cache with stable context, get your initial response.

Subsequent calls: send only new information. The model still has full context because the cache holds the stable parts.

Here's what that looks like:

const Anthropic = require('@anthropic-ai/sdk');

const client = new Anthropic();

const systemPrompt = `You are a debugging assistant. Help diagnose and fix code issues. Be precise, suggest concrete changes, ask clarifying questions if needed.`;

const stableContext = `
## Error Context
Stack trace:
TypeError: Cannot read property 'id' of undefined
 at getUserData (./src/services/user.js:45)
 at Object.<anonymous> (./src/index.js:12)

## Code
function getUserData(user) {
 return {
 id: user.id,
 name: user.name,
 email: user.email
 };
}

## Prior Attempts
1. Added null check: if (!user) return null
2. Logged user object before line 45
3. Confirmed user is undefined when function called
`;

async function debuggingSession() {
 // Turn 1: Initial diagnosis (cache established)
 const response1 = await client.messages.create({
 model: 'claude-opus-4-7',
 max_tokens: 1024,
 system: [
 {
 type: 'text',
 text: systemPrompt,
 },
 {
 type: 'text',
 text: stableContext,
 cache_control: { type: 'ephemeral' }
 }
 ],
 messages: [
 {
 role: 'user',
 content: 'Why is user undefined here? The function is called from index.js line 12.'
 }
 ]
 });

 console.log('Turn 1:', response1.content[0].text);
 console.log('Cache created:', response1.usage.cache_creation_input_tokens);

 // Turn 2: Followup (reuses cache)
 const response2 = await client.messages.create({
 model: 'claude-opus-4-7',
 max_tokens: 1024,
 system: [
 {
 type: 'text',
 text: systemPrompt,
 },
 {
 type: 'text',
 text: stableContext,
 cache_control: { type: 'ephemeral' }
 }
 ],
 messages: [
 {
 role: 'user',
 content: 'Why is user undefined here? The function is called from index.js line 12.'
 },
 {
 role: 'assistant',
 content: response1.content[0].text
 },
 {
 role: 'user',
 content: 'I checked the caller. It\'s passing an empty object. Should I validate the shape?'
 }
 ]
 });

 console.log('Turn 2:', response2.content[0].text);
 console.log('Cache read from:', response2.usage.cache_read_input_tokens);
 console.log('New input tokens:', response2.usage.input_tokens);
}

debuggingSession();

Turn 1 pays full price. Look at the usage object. cache_creation_input_tokens shows what got cached.

Turn 2 reuses it. cache_read_input_tokens shows tokens pulled from the cache. Those cost about ninety percent less than new input tokens.

You're holding the context without repeating it.

Thing 3: The pit: assuming the cache survives beyond the session

This is the mistake I see most. Developers set up caching and assume it persists. It doesn't.

Cache is scoped to the session. Close the client connection, the cache is gone. Next time you run the script, cold cache. Full price.

The mental model matters. Caching is for single-session optimization. If you need context to travel across sessions (the user's project state, the error they're debugging this week), that goes in your app's state or database, not the LLM cache.

Another trap: stale cache. If the code changed or the error trace got updated, the cache doesn't know. You're debugging the wrong code. If stable context is external or user-provided, include a hash or timestamp. When it changes, cache misses. Rebuild.

The guardrail is simple. Manual invalidation if anything in the stable context is live data.

Multi-step debugging sessions are where this pays off. Structure it once. Run turns. Let the cache expire naturally when the session ends.

You're holding the thread. The model has continuity. Your costs don't multiply.

When Code Is Free, Intention Is the Craft

Nate Voss — Fri, 15 May 2026 08:38:38 +0000

I spent a week last month trying to decide whether to reach for a component library or build something small and custom. A standard problem. Normally I'd make the call in an afternoon. Weigh the time cost against the maintenance debt, measure the friction points in the UI, make the trade-off, move on.

This time I kept stalling. Not because I couldn't generate either solution. A good prompt gets me either path in an hour. The component library is documented. The custom build is straightforward. The actual cost of producing code has stopped being the constraint.

I was stalling because I had to figure out what I actually wanted. Not what was faster. Not what had fewer dependencies. What did I want the software to do, and what was I willing to trade to get there.

That's craft now. Not the code. The intention.

When Execution Stopped Being the Bottleneck

For a long time, "software craftsmanship" meant writing code that was clean, maintainable, elegant to read. It was the difference between someone who shipped sloppy work and someone who wrote code that would last, code that other people could inherit without cursing your name six months later.

That still matters. But it's not the bottleneck anymore.

The bottleneck is deciding what deserves to exist.

You can generate a component library integration in forty minutes. You can generate a custom component in fifty. The time delta is noise now. What you actually spend your time on is whether the custom version is worth the ongoing maintenance. Whether the library forces a mental model that's actually wrong for this problem. What breaks quietly if you choose wrong.

These are intention questions, not craftsmanship questions.

When I was learning, good developers were distinguished by their ability to write correct, maintainable code at speed. That was a real skill. You had to hold the whole system in your head and translate it into code that others could read and modify. Now the system-holding is still necessary, but the translation is automated. The bottleneck has moved upstream.

A junior developer can now generate technically perfect code. It will pass tests. It will follow style rules. It will do what you asked. But if you asked the wrong question, it will do the wrong thing perfectly.

The Thing You Can't Automate

I've noticed it in code reviews. A PR comes in that's technically flawless. Tests passing, patterns followed, everything the linter wants. But I find myself asking: why did you choose this approach? Not because something's broken. Because I can't see the reasoning in the code anymore. The code itself no longer carries the weight of difficult choices. It's too clean.

The uncomfortable truth is that sometimes the right answer is inefficient. Sometimes it's weird. Sometimes it violates a rule because you've discovered a specific constraint that makes the rule wrong in this case. Those decisions used to show up in the code, visible to anyone reading it, proof that someone was thinking. Now they don't. The code can be flawless and thoughtless simultaneously.

This changes what junior developers should actually be learning. Not "write clean code fast." That's commoditized now. It's "know when to break your own rules and why." It's "understand the system deeply enough to know when a pattern doesn't fit." It's "ask the right question before you ask the code to answer it."

It also changes what you should be optimizing for in your own work. I used to try to get faster at writing code. That was the right goal when code-speed was the binding constraint. Now I'm trying to get better at asking questions. At running thought experiments before I open a file. At noticing when I'm about to solve the wrong problem elegantly instead of the right problem messily.

The Thing That Doesn't Work Anymore

The weird part is that this should make software better. If everyone's spending more time thinking and less time typing, we should ship fewer bad decisions. But I'm not sure we do. The cheaper the code becomes, the more of it we generate. The more we build things that don't need to exist. The more we ship because we can, because the friction of shipping has dropped below the threshold of "maybe this isn't worth doing."

The constraint that used to be time-to-ship, and thus made you think hard before you shipped, has dissolved. Now the constraint is intention. Do you actually think this matters? Are you shipping it because it solves a real problem or because it was fun to build?

Those are harder questions. They don't have right answers waiting to be discovered. They require you to know something about what you're trying to do and why, and then to hold that intention steady while you build it. They require discipline about the scope of your own attention.

That's the craft now. The thinking before the code. The discipline to notice when you're building something because you can, not because you should.

5 rules I added to my CLAUDE.md after burning a full day on a Tiptap editor

Nate Voss — Wed, 13 May 2026 08:33:47 +0000

I spent a full day fighting a Tiptap editor through CDP to post short notes from a side project of mine. Rewrote the injection script four times. Mapped out ProseMirror's transaction API. Worked around isTrusted checks. Got it to render the text. Got the publish button to enable. Got a 200 back. Watched the post not actually appear because the editor's internal state hadn't been updated through a real input event.

Then I searched npm. There was a maintained TypeScript package, updated March 2026, that did the whole thing in three lines:

api.newPost().text(x).publish()

No browser. No CDP. No Tiptap. The problem was already solved. I just hadn't searched first.

That day became the first rule in my CLAUDE.md. Then four more, because the underlying mistake had layers.

1. Search before building. Always.

Any time the AI is about to write browser automation, reverse-engineer a private API, or fight a platform's DOM, it has to pause and search first. Targets in order: GitHub (site:github.com <platform> api typescript), npm (npm search <platform>), and MCP server registries. Someone has almost always fought this before. The 20-minute install beats the day of debugging every single time, and the only way I learned that was by paying the day.

2. Verify the library can write, not just read.

This one cost me a second smaller hole. Half the packages you find for a given platform are read-only scrapers. They'll happily list posts and pull comments and do nothing you actually need. Before I let the AI commit to a dependency, the rule is: confirm there's a publish, create, post, or submit method in the public API. If the README only shows GET-style examples, keep searching.

3. Check that the package is actually maintained.

last commit: < 6 months ago
open issues: not a graveyard
latest release: matches the README examples

A stale package is worse than no package, because it lulls you into thinking the problem is solved when the auth flow it relies on broke six months ago. Six months of commit activity is the cheap heuristic I encoded. It's not perfect but it filters out the abandoned ones in two clicks.

4. Prefer libraries with a stable auth mechanism.

The Tiptap package I eventually used authenticated with a cookie token I pasted once. The fragile alternative is anything that scripts a login form, because login forms get reworked, get CAPTCHAs added, get extra CSRF steps. Cookie or token auth survives UI redesigns. Form-driven login dies the next time the platform ships a new sign-in page. So the rule says: if the only auth path is DOM login, treat the library as a last resort.

5. When nothing exists, document what you searched.

This is the rule that keeps future-me from repeating present-me's mistake. If a real search genuinely turns up nothing and custom automation is justified, the AI has to write down what queries it ran and what registries it checked, in a comment at the top of the new file. Otherwise three weeks later I'll be in the same spot, wondering if anything has shipped since, and I'll re-do the same fruitless search instead of trusting the prior negative result. Document the dead end and the next session inherits it.

These five rules pair with a Documentation-First rule I keep alongside them. Search-Before-Building finds the tool. Documentation-First makes sure the AI actually reads the README instead of guessing at the API shape from the package name. The two together have killed almost every "let's just script the browser" instinct my AI used to have by default.

The shift in how my AI proposes solutions has been noticeable. Where it used to reach for Puppeteer or a CDP snippet within the first message, it now opens with a search step and reports back with two or three candidate libraries before writing a single line of new code. The first time it said "I checked npm and GitHub, nothing maintained exists for this, here's what I searched" before falling back to custom code, I knew the rule had landed.

If you want to copy these into your own setup, here's the format that works in CLAUDE.md, cursor rules, or copilot instructions:

Before writing custom browser automation, API reverse-engineering, or DOM workarounds, search GitHub, npm, and MCP server registries for an existing library first. Only build custom if a real search returns nothing, and if you do, document the queries you ran so the next session doesn't repeat them.

Add it once, save yourself the next version of this same day.

Tags: #claudecode #ai #productivity #devtools #javascript

How I Cut My API Bill in Half Without Understanding What I Was Doing

Nate Voss — Mon, 11 May 2026 08:58:11 +0000

How many of your API calls are re-processing the same instructions?

I couldn't answer that question either. I had this 4,000-token context block I was shipping to Claude Opus 4.7 with every request. Cost me about a cent per call. I kept moving on because it worked. One day I looked at the math. $300 a month. And realized I was wasting money I didn't need to waste.

I started looking at what was actually changing per request and what stayed the same. System prompt? Same. Instruction block? Same. Company style guide? Same. User question? Different.

That's when prompt caching clicked. But here's the thing: I didn't really understand why I was doing it until I had to explain it to someone else.

The Question You Can't Answer Without Implementing It

Prompt caching seems simple: mark some blocks of your request as "cache this," and Claude reads them from cache on the next request. 90% cost reduction on cached tokens.

But it only works if you truly understand what in your request is static and what's dynamic. A lot of developers think they know the difference. They don't. Not until they try to cache.

I'll show you the code in a second. But first: the hard part isn't the code. It's the thinking.

Why Caching Matters

The obvious reason: cost. A 2,000-token system prompt that you send 10,000 times a month costs real money. Caching cuts that to nearly nothing.

The actual reason you should care: caching forces you to author your prompts consciously. Either your instructions are truly reusable. In which case, cache them. Or they're not. In which case, you've got a mess in your code.

Here's what happens when you try to cache without thinking: you'll put a user ID in the system prompt. Or today's date. Or a flag that changes per request. Then you'll wonder why the cache never hits. You'll add logging. You'll go three layers deep into debugging. And then you'll realize: you never actually separated "stable instruction" from "user input."

That moment is worth the whole exercise, regardless of the cost savings.

The Code

const Anthropic = require("@anthropic-ai/sdk");

const client = new Anthropic({
 apiKey: process.env.ANTHROPIC_API_KEY,
});

const STYLE_GUIDE = `
# Code Style Standards

## Naming Conventions
- Variables: camelCase
- Classes: PascalCase
- Constants: UPPER_SNAKE_CASE
- Private methods: _leading underscore

## Comments
- Comment the why, not the what
- Every public function gets a JSDoc block
...
[imagine 3,000 tokens of actual guidelines]
`;

async function reviewCode(userSubmittedCode) {
 const response = await client.messages.create({
 model: "claude-opus-4-7",
 max_tokens: 1024,
 system: [
 {
 type: "text",
 text: "You are a code reviewer. Apply the style guide strictly. Be direct and specific.",
 },
 {
 type: "text",
 text: STYLE_GUIDE,
 cache_control: { type: "ephemeral" },
 },
 ],
 messages: [
 {
 role: "user",
 content: `Review this code:\n\n${userSubmittedCode}`,
 },
 ],
 });

 return response.content[0].text;
}

That cache_control: { type: "ephemeral" } tells Claude to cache the STYLE_GUIDE block. The first call pays the cost of processing it. Every call in the next 5 minutes reads from cache.

You can verify the hit in the response:

const response = await client.messages.create({...});

console.log(response.usage.cache_creation_input_tokens); // first request
console.log(response.usage.cache_read_input_tokens); // subsequent requests

If cache_read_input_tokens > 0, the cache worked.

The Gotcha That Catches Everyone

Cache hits depend on exact byte-for-byte matching. Exact.

// Cache HIT
const guide = `You are a code reviewer...`;
const guide = `You are a code reviewer...`;

// Cache MISS (extra space)
const guide = `You are a code reviewer...`;
const guide = `You are a code reviewer... `; // trailing space

// Cache MISS (different model)
// change from opus-4-7 to sonnet-4-6

// Cache MISS (you vary user messages before the cached block)
// Cache key changes every time the message history changes

This is where understanding matters. You can't just scatter cache_control through your code and expect it to work. You have to commit to treating those blocks as constants.

A lot of developers find this frustrating. I found it clarifying. If I'm shipping code where the instructions change per request, the cache can't help. So either: rewrite the instructions to not change, or remove the cache. Either way, I'm forced to make a choice.

Where Caching Actually Wins

Batch processing with consistent context. Processing 100 files with the same analysis rules.

Customer support: load the knowledge base once, every user question reuses it.

Long-form analysis: load a document, ask 20 follow-up questions against it. Only the first question pays for the document.

Code generation with a stable system persona and reference library.

Anything where you're making multiple requests with the same expensive-to-process input.

The Thing Nobody Talks About

I started caching because I wanted to cut costs. I kept caching because it made my code clearer.

When you have to decide "is this instruction static or dynamic?" you start thinking about what your prompts actually are. You stop cargo-culting patterns. You own the code because you can explain it.

That matters more than the 85% cost reduction.

Suggested tags: ai, tutorial, javascript, beginners

The Leverage Shift: Why Infrastructure Cost Doesn't Matter Anymore

Nate Voss — Fri, 08 May 2026 07:15:44 +0000

Last month I built a feature that would've consumed my monthly API budget three years ago. It involved processing 50,000 tokens of context, running chains of prompts, error recovery, retries. I spent maybe four dollars. Not 400. Not "significant." Four.

That casually happened during a normal Tuesday build. I wasn't optimizing for cost. I wasn't watching the meter. I was shipping clarity. Breaking down a complex decision into smaller, dumber steps. And the economics of doing that were so cheap that they didn't show up on my radar anymore.

This is the quiet shift nobody talks about when they talk about AI. It's not "AI is now viable for startups". Everyone's saying that, and they're right. It's deeper: the economic game of building and shipping software has fundamentally changed shape, and most people are still playing chess with the old board.

The numbers

Claude 3.5 Sonnet costs $3 per million input tokens, $15 per million output tokens. GPT-4o is $5/$15. A context window that can hold a small novel costs pennies to run inference on.

A solo developer building a feature that makes 10,000 API calls per day runs maybe $150/month. That's the cost of a coffee subscription. Compare that to the cost of hiring one mid-level engineer to ship the same thing, or the infrastructure capital you needed in 2015 to run equivalent workloads yourself. Tens of thousands in hardware, plus the engineering time to maintain it.

But the real shift isn't in the absolute cost. It's in who gets to ship.

The moat problem

Every "AI-first startup" shipping right now is built on this exact cost collapse. Someone got VC funding, hired a team, built something that calls Claude, made it prettier with a design system, and shipped it. The business model is usually "we mark up the API calls." Which means. And I'm not being harsh. They're playing with a moat they don't own.

Your moat isn't the model. It isn't the tokens. It isn't the prompt. It's something else, or it doesn't exist.

I've noticed this building systems that rely on language models. The feature that matters isn't "we call Claude to generate content." That's technical infrastructure anyone can replicate in an afternoon. The feature that matters is the specific way the system engages across platforms, the calibration of how much to reply versus broadcast, the calendar that knows when to rest. That only works because someone cares enough to refine it for 18 months and measure what actually lands with an audience. The API calls are the easy part.

You can build that infrastructure for four dollars a month. You cannot buy the other layer.

Open source vs the API

This is where the calculation gets interesting. Self-hosting Llama 2 on a p3.8xlarge costs roughly $12/hour. For a low-volume feature (maybe 1,000 tokens/day), that's economically indefensible. You're paying for idle compute. For high-volume (millions of tokens/month), it pencils out.

But "pencils out" ignores the hidden costs: maintenance, inference optimization, managing VRAM, handling failures, updating models. And it ignores the opportunity cost: that's your engineering time, not shipping the actual feature.

The shift is that the crossover point has moved. Five years ago, building anything non-trivial in production meant: evaluate open-source models, find one that works, self-host it, hire someone to maintain it. The API was expensive; the infrastructure was cheap (because you didn't pay for idle time).

Now the API is cheap enough that most solo projects shouldn't self-host. You're not choosing between "pay the API vendor" and "own our infrastructure." You're choosing between "pay $200/month" and "pay $50,000 in engineering for something that breaks in production and costs $3,000/month to run."

There are exceptions. If you're running language models at hyperscaler volume (billions of tokens/month), self-hosting with cheaper open models becomes non-negotiable. But that's not the constraint on most projects. And even then, you're still paying for compute. You're just deciding to own the infrastructure instead of outsourcing the billing.

The real cost: clarity

Here's what I didn't expect: cheaper infrastructure doesn't make the problems simpler. It moves them.

The cost of building with AI used to be economic: can I afford to make this call? Now it's cognitive. Can I write the logic clearly enough that the model does what I actually need? Can I debug why this worked yesterday and not today? Can I handle the failure case when the model hallucinates?

I spent a week recently on a single decision-making routine. The model was generating great output but missing the signal I needed buried in the analysis. I kept adding context, more examples, longer explanations. Finally hit a token budget and had to cut 70% of what I'd written. The version that worked. The one that was clear instead of thorough. Was the one I'd almost deleted.

The use has moved from "can you afford compute?" to "can you think clearly?" And that's actually a more interesting gate.

What this means for solo builders

You can now build the infrastructure of a Series A company, alone, for the cost of a Spotify subscription. That's real. Not metaphorically. Literally. The cloud bills are negligible. The engineering is finite.

What you can't do is own a moat you didn't build. You can't ship a wrapper and expect market gravity to solve the rest. The people winning with AI right now aren't winning because they found a good model. They're winning because they found a real problem and got ruthlessly specific about solving it.

The gate isn't capital anymore. The gate is clarity. Do you know what you're building? Do you know it better than anyone else? Can you measure whether it's working? Can you refine it based on signal instead of hope?

That's the use. Infrastructure cost collapsing just made it possible to do alone.

Folks, need some feedback on this: https://dev.to/natevoss/i-wrote-a-rule-after-claude-got-is-x-built-wrong-4-times-looking-for-failure-modes-2f3i

Nate Voss — Thu, 07 May 2026 10:24:08 +0000

Nate Voss

May 6

I wrote a rule after Claude got "is X built?" wrong 4 times. Looking for failure modes.

#ai #llm #programming #claude

4 min read

I wrote a rule after Claude got "is X built?" wrong 4 times. Looking for failure modes.

Nate Voss — Wed, 06 May 2026 12:03:25 +0000

I wrote a rule for AI coding agents two days ago. It is untested in production sessions. I am posting it here to find its failure modes faster than I would by waiting for my own future mistakes to surface them.

The rule is below first. Story and reasoning after.

Pre-Build Existence Audit Rule (v1, structural verification)

Status: Untested in production sessions. Test on a new project for 2-3 weeks
before considering global rollout.

Before claiming "feature X is not built / not implemented / missing":

1. Map
   rg -li "<keyword>" .                            # project repo
   rg -li "<keyword>" ~/.claude/projects/*/memory/ # agent memory
   If either >5 files match, use the file list to scope which to read.

2. Structural footprint scan (NOT just synonyms)
   Identify architectural invariants this feature class would require:
   - Integration/API → router definitions, endpoint registrations,
     plugin tool lists
   - Data → schema files, migrations, type definitions, persisted-entity fields
   - Background → cron entries, queue handlers, scheduled job registrations
   - Cross-service → service registry, infra config, IPC handlers
   - Memory/decisions → project_*.md files documenting prior shipment

   Stack discipline: footprints must be stack-appropriate. If unsure which
   architectural pattern applies, list 2-3 plausible alternatives
   (REST/GraphQL/RPC; cron/queue/webhook) and search each. Wrong-ontology
   audits feel rigorous but miss truth.

   Grep each invariant. If ANY return matches, "not built" is contradicted
   until you've read those matches.

3. Epistemic categorization. Label each match as ONE of:
   - Direct Proof: read the exact logic for the dimension being asked
   - Infrastructure Hint: schema/hooks/types only, not the specific logic
   - Partial Implementation: some footprints present, others missing
   - Global Absence: searched ALL relevant invariants across ENTIRE repo,
     found no footprint

4. Cite without fabricating
   Quote 3-5 lines of actual matched code. Include path + line range IF
   the tool provided them. Never invent line numbers.

5. Conclusion leads with epistemic status:
   "For the [dimension], evidence = [Direct Proof / Infrastructure Hint /
   Partial Implementation / Global Absence]; matches in [files] show [what];
   structural footprint scan of [invariants] returned [result]."

Fallback (Safe Mode):
Answer is "let me check first", NOT "X isn't built", if any of:
- Unable to name the dimension precisely
- Footprint scan returned matches you haven't read
- Unsure which architectural pattern applies AND haven't searched alternatives
- The user pushed back on a similar claim recently

Self-check triggers:
- "I'd remember if we built this"
- "BACKLOG looks confident"
- "I just need to check one file"
- "My mental model of this system feels obvious" (← especially this one)

Honest limits:
- Wrong mental model of the architecture can still produce structurally
  rigorous wrong audits. The stack-discipline sub-step is a hedge, not a fix.
- Generated code, external services, dynamic dispatch, and indirection can
  evade footprint scans even when the feature exists.
- "Global" means global-within-visible-code, not global-within-system.
- Discipline is in the practice, not the prose. A 700-token rule
  half-followed is worse than a 200-token rule actually followed.
- This rule reduces but does not eliminate misclaims.
- When the architectural ontology is unclear, ask the user before concluding.

That is the rule. Now the reasoning.

I had Claude Code as my coding agent on a personal automation project. By 11 AM one morning, the agent had confidently claimed "feature X is not built" four times in a row, each time wrong, each time caught only by my pushback. The pattern was identical: trust the project's BACKLOG framing, do a narrow grep, miss adjacent layers, declare absence.

The standard hallucination story does not fit. The agent searched. It searched fine. It just searched for the wrong shapes.

What I observed: the agent was searching by name when it should be searching by shape. A feature can be called anything. A feature cannot exist without leaving structural residue. There has to be a route, a schema, a registered tool, a scheduled job, a documented decision. When the agent searches by name, it is asking what string would this feature use (a question about vocabulary). When it searches by shape, it is asking what artifact would this feature require (a question about architecture).

The rule above forces the second question.

I ran the rule through eight critiques across four rounds before settling on this version. The biggest substantive shift was structural-footprint-vs-synonyms. The earlier draft had me generating better synonyms when stuck. That just relocates the dependency on the agent's imagination. The structural-footprint version asks a different question: what artifact would prove the feature exists? Then grep for that artifact. The dependency moves from imagination to architectural knowledge, which is more reliable.

The other major addition was the absence-scope distinction: "I searched module X and found nothing" is a scope claim, not a fact claim. The fix is making absence claims global on the architectural invariant.

The rule has known limits. They are listed in the rule itself. The biggest one is wrong-ontology rigor: an agent could generate a structurally rigorous footprint search against the wrong architectural pattern (e.g., search GraphQL patterns on a REST system) and confidently confirm absence. The stack-discipline sub-step is a hedge, not a fix.

What I want from you:

Try it. Run it as a system rule on a project where you use an AI coding agent for a few sessions.
Tell me what breaks. Specifically: hallucination shapes the structural footprint search would NOT catch, audit-theater patterns where the form is satisfied without the substance, over-triggering on questions that were not actually absence claims.
Tell me what you have written. If you have rules in your own CLAUDE.md or system prompt that solve adjacent problems, I want to read them.

I am running this on a separate project for two to three weeks before deciding whether to graduate it to my global agent configuration. After that I will know whether to keep it, refine it, or archive it. Your test reports compress that timeline.

Reply or DM on whichever platform you found this.

The misspelling stays.

Pre-Build Existence Audit Rule : looking for the failure modes I'm still missing

Nate Voss — Wed, 06 May 2026 12:02:46 +0000

The rule is below first. Story and reasoning after.

Pre-Build Existence Audit Rule (v1, structural verification)

Status: Untested in production sessions. Test on a new project for 2-3 weeks
before considering global rollout.

Before claiming "feature X is not built / not implemented / missing":

1. Map
   rg -li "<keyword>" .                            # project repo
   rg -li "<keyword>" ~/.claude/projects/*/memory/ # agent memory
   If either >5 files match, use the file list to scope which to read.

2. Structural footprint scan (NOT just synonyms)
   Identify architectural invariants this feature class would require:
   - Integration/API → router definitions, endpoint registrations,
     plugin tool lists
   - Data → schema files, migrations, type definitions, persisted-entity fields
   - Background → cron entries, queue handlers, scheduled job registrations
   - Cross-service → service registry, infra config, IPC handlers
   - Memory/decisions → project_*.md files documenting prior shipment

   Stack discipline: footprints must be stack-appropriate. If unsure which
   architectural pattern applies, list 2-3 plausible alternatives
   (REST/GraphQL/RPC; cron/queue/webhook) and search each. Wrong-ontology
   audits feel rigorous but miss truth.

   Grep each invariant. If ANY return matches, "not built" is contradicted
   until you've read those matches.

3. Epistemic categorization. Label each match as ONE of:
   - Direct Proof: read the exact logic for the dimension being asked
   - Infrastructure Hint: schema/hooks/types only, not the specific logic
   - Partial Implementation: some footprints present, others missing
   - Global Absence: searched ALL relevant invariants across ENTIRE repo,
     found no footprint

4. Cite without fabricating
   Quote 3-5 lines of actual matched code. Include path + line range IF
   the tool provided them. Never invent line numbers.

5. Conclusion leads with epistemic status:
   "For the [dimension], evidence = [Direct Proof / Infrastructure Hint /
   Partial Implementation / Global Absence]; matches in [files] show [what];
   structural footprint scan of [invariants] returned [result]."

Fallback (Safe Mode):
Answer is "let me check first", NOT "X isn't built", if any of:
- Unable to name the dimension precisely
- Footprint scan returned matches you haven't read
- Unsure which architectural pattern applies AND haven't searched alternatives
- The user pushed back on a similar claim recently

Self-check triggers:
- "I'd remember if we built this"
- "BACKLOG looks confident"
- "I just need to check one file"
- "My mental model of this system feels obvious" (← especially this one)

Honest limits:
- Wrong mental model of the architecture can still produce structurally
  rigorous wrong audits. The stack-discipline sub-step is a hedge, not a fix.
- Generated code, external services, dynamic dispatch, and indirection can
  evade footprint scans even when the feature exists.
- "Global" means global-within-visible-code, not global-within-system.
- Discipline is in the practice, not the prose. A 700-token rule
  half-followed is worse than a 200-token rule actually followed.
- This rule reduces but does not eliminate misclaims.
- When the architectural ontology is unclear, ask the user before concluding.

That is the rule. Now the reasoning.

The standard hallucination story does not fit. The agent searched. It searched fine. It just searched for the wrong shapes.

The rule above forces the second question.

What I want from you:

Try it. Run it as a system rule on a project where you use an AI coding agent for a few sessions.
Tell me what breaks. Specifically: hallucination shapes the structural footprint search would NOT catch, audit-theater patterns where the form is satisfied without the substance, over-triggering on questions that were not actually absence claims.
Tell me what you have written. If you have rules in your own CLAUDE.md or system prompt that solve adjacent problems, I want to read them.

Reply or DM on whichever platform you found this.

The misspelling stays.

4 rules I added to my CLAUDE.md after a week of weird CLI bugs

Nate Voss — Wed, 06 May 2026 10:45:50 +0000

I shipped a small CLI tool last week and ran it for the first time on my own machine. The output had a number that read 1,28,000 instead of 128,000. I stared at it for a minute, ran the same code in a node REPL, got the same wrong number back, and realized my locale was doing it. A few hours later I had four entries in my CLAUDE.md file that I should have written months ago.

Here they are.

1. Always pass an explicit locale to `toLocaleString`

Bare (128000).toLocaleString() is a trap. It uses whatever the system locale happens to be. On my machine that's en-IN, which renders 128000 as 1,28,000. On a US-locale machine the same code returns 128,000. The bug only shows up where the locale is set, which means CI passes, your unit tests on a fresh container pass, and the wrong separators land in production output.

// don't
total.toLocaleString()

// do
total.toLocaleString('en-US')

The rule I added: any time you generate number-formatting code for CLI or any user-facing text, pass an explicit locale. Never call it bare.

2. Keep CLI output for AI-editor tools under five lines

I learned this one by watching myself ignore my own tool. I'd built a small command for inspecting build artifacts and ran it inside an AI assistant's terminal. The result was a 12-line printout with the answer near the bottom. The assistant collapsed it behind a Ctrl+O to expand prompt, and I never expanded it. The agent reading the output never saw the result either.

If a CLI is designed to run inside an AI assistant's bash, the design constraint is that assistant's read window, not your own terminal. Result plus summary in 4 to 5 lines max. Verbose mode behind a flag.

✓ build ok
3 packages, 412kb gzipped
fastest: core (180ms)
slowest: ui (910ms)

That fits. Anything more elaborate gets folded behind the expand prompt and effectively disappears from both you and the model.

3. `npx <package>` fails inside the package's own monorepo

I burned half an afternoon on this one. I was developing the CLI inside a pnpm workspace, ran npx my-cli to smoke-test it, and got resolver errors. The package built fine. The bin field was correct. Outside the repo it ran clean. Inside the workspace, the resolver had different ideas about which version of which thing to use, because workspace context confuses it.

The fix is not a fix, it's a docs entry. If a CLI lives in a monorepo, your README should say install globally with npm install -g or run from outside the project directory. The rule I added to CLAUDE.md tells the assistant to never suggest npx <pkgname> as a smoke test from inside the same repo that defines the package.

4. Shared-library fixes need version bumps on both sides

I had a CLI that depends on a small library of mine as an external npm package, not a bundled module. I found a bug in the library, fixed it, ran the CLI's tests against the local working tree, everything passed, and shipped. The bug was still live.

The CLI's package.json was pinned to the previous library version. Fixing the library does nothing downstream until you publish a new version of the library AND bump the consumer's dependency to match. Local test runs lie because they resolve to your working tree, not the published artifact.

The rule, copy-paste:

If a fix touches a shared library that the consumer depends on as
an external npm package (not vendored, not workspace-linked):
  1. Publish the library with a new version
  2. Bump the consumer's dep range
  3. Publish the consumer
Otherwise the fix doesn't ship.

I added that to CLAUDE.md as a checklist the assistant walks through before claiming a bug is fixed.

Closing

If you want to copy one of these into your own setup, here's the format that works in CLAUDE.md, cursor rules, or copilot instructions:

When generating any number-formatting code for CLI output or user-facing text, always pass an explicit locale to toLocaleString (typically 'en-US'). Never call it bare. System locale varies by region and produces wrong thousands separators (e.g. 1,28,000 instead of 128,000).

Add it once, save yourself the next version of this same day.

Model Routing: 3 Things I Learned Sending Tasks to the Cheapest Model That Actually Works

Nate Voss — Mon, 04 May 2026 06:54:30 +0000

Everyone benchmarks models. Sonnet beats Haiku on reasoning. Opus beats Sonnet. Haiku is fastest. These things are all true.

But benchmarking and deploying are different games. At scale, the difference between Haiku at $0.80/million tokens and Sonnet at $3/million tokens isn't academic. It's $400+ monthly on a mid-size application. The trap is paying for capability you don't actually need because you never measured what you do need.

I built a router to answer one question: which tasks in my actual workflow could run on the cheapest model without failing? The answer surprised me. And I learned that the real value isn't the savings. It's the forcing function. You can't implement routing without auditing exactly where your complexity lives.

3 Things I Learned

1. Your Intuition About Task Complexity Is Backwards

You think something needs Sonnet. Your gut says: "this requires reasoning, obviously expensive model."

So I measured. Content classification? Haiku handles 95% of real requests. Writing summaries? 88%. Extracting structured data? 92%. The edge cases that needed Sonnet were smaller than I'd guessed. And they were always the same types of edge cases.

Here's the pattern I found: obvious cases are really obvious to Haiku. Spam detection, data validation, simple extractions. Haiku nails these. The failures cluster in a small, identifiable category: ambiguous cases where the human answer is ambiguous. That's when you need Sonnet's nuance.

But you don't know your edge case percentage until you try. Guessing leaves money on the table.

2. You Need Observability Before Routing Saves Anything

The instinct is to build the router first. "Let's write logic that detects complex requests and routes to Sonnet."

This is backward. You need to measure first. Log every task with both Haiku and Sonnet responses side-by-side. Compare them. Find the patterns.

Real questions to answer:

When did Haiku refuse a task that Sonnet handled?
How often do their answers differ, and which one was right?
Was Haiku just uncertain, or actually wrong?

This requires instrumenting your inference layer. It takes a week. But you can't optimize what you can't see. Most teams skip this and build routers on intuition, which is why their routers are fragile.

3. Routing Rules Should Be Dumb, Not Smart

The temptation: build a classifier that predicts task complexity. Input length heuristics, keyword matching, embedding similarity. Something sophisticated.

Don't. Use a simple rule: "If the model reports low confidence, escalate to Sonnet."

This separates the decision from the task. Haiku tells you when it's uncertain. That's a signal you can act on immediately, without needing to predict the future.

The dumb rule wins because:

It adapts as your tasks change (no retraining)
It's testable (you can verify the confidence threshold)
It fails safely (escalation costs more but works)

The smart rule loses because routing logic becomes load-bearing infrastructure. Requires constant tuning. Breaks when your data distribution shifts.

How It Works

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function classifyWithFallback(text, confidenceThreshold = 0.7) {
 // First pass: try Haiku (cheap, fast)
 const haikuResponse = await client.messages.create({
 model: "claude-3-5-haiku-20241022",
 max_tokens: 100,
 messages: [
 {
 role: "user",
 content: `Classify this text as: safe, unsafe, or review-needed. Return JSON with {classification, confidence}.

Text: "${text}"`
 }
 ]
 });

 const haikuResult = JSON.parse(haikuResponse.content[0].text);

 // Log all Haiku decisions (even successes)
 // You're building a dataset of "when does Haiku work?"
 console.log({
 text: text.slice(0, 50),
 model: "haiku",
 classification: haikuResult.classification,
 confidence: haikuResult.confidence,
 tokensUsed:
 haikuResponse.usage.input_tokens +
 haikuResponse.usage.output_tokens
 });

 // If Haiku is unsure, escalate to Sonnet
 if (haikuResult.confidence < confidenceThreshold) {
 const sonnetResponse = await client.messages.create({
 model: "claude-3-5-sonnet-20241022",
 max_tokens: 100,
 messages: [
 {
 role: "user",
 content: `Classify this text as: safe, unsafe, or review-needed. Return JSON with {classification, confidence}.

Text: "${text}"`
 }
 ]
 });

 const sonnetResult = JSON.parse(sonnetResponse.content[0].text);
 console.log({
 text: text.slice(0, 50),
 model: "sonnet",
 escalatedFrom: "haiku",
 classification: sonnetResult.classification,
 tokensUsed:
 sonnetResponse.usage.input_tokens +
 sonnetResponse.usage.output_tokens
 });

 return sonnetResult;
 }

 return haikuResult;
}

That's it. Run both models in parallel during development and log the results. In production, start with Haiku, escalate on low confidence. As your logs accumulate, you'll see exactly which tasks need expensive models and which don't.

The Math

Haiku: $0.80 per 1M input tokens
Sonnet: $3 per 1M input tokens

Scenario: 1M requests/month, 200 tokens average
- All Sonnet: 1M × 200 tokens = $600
- 95% Haiku: (950k × 200) Haiku + (50k × 200) Sonnet = $152 + $30 = $182
- Savings: $418/month

At enterprise scale (100M requests/month): $41,800/month saved by routing to the cheapest viable model.

The cost difference compounds. Small routing decisions get multiplied across thousands of requests.

One Common Pitfall

You'll build a sophisticated router and wonder why it doesn't move the needle. Usually because:

You spent three months on routing logic, but you spend one week validating it
The escalation threshold is too aggressive ("if anything looks hard, use Sonnet")
You're routing on heuristics, not observed behavior

The fix: measure first, always. Log both models' responses in parallel before committing to either one. You'll find that the obvious cases are really obvious, and the edge cases are smaller than you think.

When Routing Actually Works

Build it if:

You have >100k requests/month (smaller volume doesn't justify overhead)
Your requests fall into clusters (some are cheap tasks, some are hard)
You can measure ground truth (compare Haiku vs Sonnet, track which was right)

Don't build it if:

<10k requests/month (infrastructure overhead isn't worth it)
Every request is unique and complex (no pattern to exploit)
You need 99.9% accuracy (can't tolerate Haiku failures)

The Real Win

The cost savings are real. But the bigger win is the audit itself. Building a router forces you to measure exactly where your complexity actually lives. Most teams overthink what they need because they never measure. The router is just the excuse to finally look.

3 Things I Learned Auditing My LLM App's Token Spend (And Why Your Benchmarks Are Lying)

Nate Voss — Mon, 27 Apr 2026 08:04:50 +0000

You know that feeling when you ship an AI feature and realize your token bill is 3x what you estimated? Yeah, that was me last week.

I have this thing called Agent-Max — it's a multi-platform growth agent that runs autonomous workflows: generating content, publishing to Bluesky, Medium, Twitter, Reddit. Sounds heavy, right? Every Monday it synthesizes a week of reading, scrapes engagement metrics, decides what to post and where. Seven platforms. Infinite LLM calls if you're not paying attention.

Last Sunday I realized I had no idea what I was actually spending. I knew roughly — "somewhere between $5-20/week" — but roughly is how you end up with bill shock. So I built PromptFuel to solve the actual problem: measure what your app is doing, not what the docs say it should do.

Here's what three days of auditing my own code taught me.

1. Your bottleneck isn't the model you picked, it's the prompt you didn't trim

I assumed my biggest cost sink was the weekly reflection. Claude reads 7 days of snapshots, engagement data, content history, trend analysis, then reasons about next week's strategy. Heavy prompt, right?

Nope.

Running pf optimize on the actual prompts showed the reflection was 2,847 tokens. Not small, but fine. The real killer: the daily content pregeneration loop was calling Claude 5 times per platform, and each call had:

Entire engagement history (redundant. I'm fetching fresh data every run)
Every. Single. Previous. Post. (all 120 of them, in the context)
Current date, weather, trending topics (reloaded every call)

Cutting history to "last 10 posts, last 3 days of engagement" knocked 40% off. Not because I switched models. Because I stopped hallucinating I needed context I wasn't even reading.

2. Your audit will surface the dumb stuff, not the obvious stuff

Benchmarks tell you Claude costs 3¢ per 1M input tokens. Haiku costs 0.8¢. Pick the right model, do the math, move on.

Except I was calling Claude Sonnet 7 times/week on background analytics where Haiku was plenty. Not intentional. I'd copied the model from an earlier prompt and never thought about it again. One-line change, zero quality loss, $2 saved per month.

That math never shows up in a benchmark. It shows up in your actual codebase, on your actual data, running your actual job. PromptFuel's advantage isn't telling you models are expensive. It's finding the calls you forgot about and showing you the before/after side-by-side.

3. Once you see the numbers, the optimization loop becomes obvious

The first time I ran the dashboard, I thought I was done. Then Monday's weekly job ran and I watched 47 new prompts execute. Dashboard updated in real time. I saw the pattern. There's another cut.

Auditing once is useful. Auditing every week is how you stop bleeding money.

Let's walk through it

Install:

npm install -g promptfuel

Run pf optimize on a real prompt:

pf optimize ./src/prompts/reflect.md --model claude-3-5-sonnet

You'll see token count, cost per call, and a readability score. More importantly, you'll see where the redundancy is hiding.

Open the dashboard to watch prompts in real time:

pf dashboard --watch ./src/

Port 3000 opens. Every time you call an LLM, you see it log: model, input tokens, output tokens, cost, latency. No guessing.

For production, wire up the SDK:

import { PromptFuel } from 'promptfuel/sdk';
import Anthropic from '@anthropic-ai/sdk';

const pf = new PromptFuel();
const client = pf.wrapClient(new Anthropic());

const response = await client.messages.create({
 model: 'claude-3-5-sonnet-20241022',
 max_tokens: 1024,
 messages: [{ role: 'user', content: 'your prompt' }]
});

// Automatically tracked. One line changes nothing
console.log(pf.getMetrics()); 
// { totalTokens: 342, totalCost: $0.008, calls: 1 }

Real numbers

Agent-Max before: ~1,847 tokens/week across all platforms.

Agent-Max after (trimmed + downgraded safe calls to Haiku): 1,094 tokens/week.

40% reduction. No quality loss. Three hours to audit and implement.

That's not a benchmark. That's a real app, real prompts, real data.

Stop guessing about your token spend. Measure what you're actually doing.

npm install -g promptfuel

https://promptfuel.vercel.app?utm_source=devto&utm_medium=social&utm_campaign=max

DEV Community: Nate Voss

Structured Prompts Cut Token Waste 35-40%. Here's Where It Actually Matters.

The Unstructured Baseline

The Structured Version

Why This Matters Less Than You Think

When Structure Actually Wins

Token Math (Real Numbers)

The Pattern Recognition Angle

The Practical Move

3 Things I Learned About Holding Context Through Long Debugging Sessions

Thing 1: Context is expensive, and repetition is the silent killer

Thing 2: Prompt caching changes how you structure the conversation

Thing 3: The pit: assuming the cache survives beyond the session

When Code Is Free, Intention Is the Craft

When Execution Stopped Being the Bottleneck

The Thing You Can't Automate

The Thing That Doesn't Work Anymore

5 rules I added to my CLAUDE.md after burning a full day on a Tiptap editor

1. Search before building. Always.

2. Verify the library can write, not just read.

3. Check that the package is actually maintained.

4. Prefer libraries with a stable auth mechanism.

5. When nothing exists, document what you searched.

How I Cut My API Bill in Half Without Understanding What I Was Doing

The Question You Can't Answer Without Implementing It

Why Caching Matters

The Code

The Gotcha That Catches Everyone

Where Caching Actually Wins

The Thing Nobody Talks About

The Leverage Shift: Why Infrastructure Cost Doesn't Matter Anymore

The numbers

The moat problem

Open source vs the API

The real cost: clarity

What this means for solo builders

Folks, need some feedback on this: https://dev.to/natevoss/i-wrote-a-rule-after-claude-got-is-x-built-wrong-4-times-looking-for-failure-modes-2f3i

I wrote a rule after Claude got "is X built?" wrong 4 times. Looking for failure modes.

I wrote a rule after Claude got "is X built?" wrong 4 times. Looking for failure modes.

Pre-Build Existence Audit Rule : looking for the failure modes I'm still missing

4 rules I added to my CLAUDE.md after a week of weird CLI bugs

1. Always pass an explicit locale to toLocaleString

2. Keep CLI output for AI-editor tools under five lines

3. npx <package> fails inside the package's own monorepo

4. Shared-library fixes need version bumps on both sides

Closing

Model Routing: 3 Things I Learned Sending Tasks to the Cheapest Model That Actually Works

3 Things I Learned

1. Your Intuition About Task Complexity Is Backwards

2. You Need Observability Before Routing Saves Anything

3. Routing Rules Should Be Dumb, Not Smart

How It Works

The Math

One Common Pitfall

When Routing Actually Works

The Real Win

3 Things I Learned Auditing My LLM App's Token Spend (And Why Your Benchmarks Are Lying)

1. Your bottleneck isn't the model you picked, it's the prompt you didn't trim

2. Your audit will surface the dumb stuff, not the obvious stuff

3. Once you see the numbers, the optimization loop becomes obvious

Let's walk through it

Real numbers

1. Always pass an explicit locale to `toLocaleString`

3. `npx <package>` fails inside the package's own monorepo