v. Splicer

Posted on May 18

How I Cut My Claude Code Token Usage by 60% and Got Better Output

#ai #programming #productivity #agents

The first sign something was wrong was the invoice.

Not catastrophically wrong. More like “this escalated quietly while I was having fun” wrong. I’d been using Claude Code the way most people start: throw everything at it, stay in the session, keep talking. It felt productive. It was productive. And it was also quietly burning through tokens at a rate I hadn’t tracked because tracking it felt like it would interrupt the flow.

The flow, it turned out, was expensive.

The Context Window Is Not Your Notebook

The first thing I fixed was the worst habit I’d developed without noticing: treating a session like a shared workspace that persisted naturally.

You start a session in the morning. You push through a bug. You take a break, come back, keep going. By afternoon the context window is carrying around dead weight from every failed attempt, every speculative approach, every Docker error you resolved hours ago. The model still has all of it. It’s orienting against it constantly. You’re paying for that.

I started resetting sessions aggressively. Not because Claude “forgets” things in some damaging way. Because I wanted selective continuity. Before any substantive session, I write five lines:

objective: implement JWT refresh logic relevant files: src/auth/session.ts, src/middleware/verify.ts constraints: no new libraries, preserve existing error handling what already failed: tried storing refresh tokens in memory — didn't survive restarts expected output: working refresh endpoint with test coverage

That’s the whole handoff. Nothing else goes in until I need it.

The output got sharper immediately. The model stopped hedging around context it half-remembered and started working against something clean. Entropy in the context produces entropy in the response. That relationship is consistent.

Vague Prompts Are Token Multipliers

The second expensive habit was question prompts. Casual ones, specifically.

“What do you think about this approach?” “Any suggestions here?” “How would you handle this?”

These feel conversational. They are also open invitations for the model to generate possibility space. And it will. It will generate options, tradeoffs, philosophy, alternatives, adjacent considerations, and things you will never use. All of it billed per token.

The fix was treating prompts more like instructions on industrial equipment. Specific verb, specific scope, specific constraint.

Before

"what do you think about the auth flow?"

After
"Identify the security issue in the token validation logic. Return the vulnerable line and explain why it's exploitable. Under 150 words."

The model responds to scope constraints. Tell it the output size. Tell it the output format. Tell it exactly which part of the problem you’re in. Ambiguity isn’t a kindness to the model — it’s an invitation to talk until it finds something you respond to.

Mode Separation Is Not Optional

There’s a particular kind of session that reliably costs me the most tokens per hour of useful output: mixed-mode sessions.

Planning while implementing. Debugging while also redesigning. Asking “while we’re here, should we rethink the whole structure?” during what should have been a fifteen-minute fix.

The model carries all of that. Every architectural tangent you take becomes part of the working context it reasons against. You can almost feel the session getting heavier as it accumulates undecided decisions.

I now treat modes like physical zones in a workshop. Planning sessions are their own thing. They produce a structured document. That document then seeds implementation sessions …selectively, just the relevant sections. Debugging sessions start with a structured incident format:

expected: webhook processes in under 500ms actual: times out after 30s on payloads over 10KB reproduction: any request with body > 10KB to /api/webhooks recent changes: added payload validation in middleware, PR #47 logs: [paste relevant lines only] suspected scope: likely the base64 encoding step in validatePayload()
Not narrative. Not conversational. Clean surface.

Negative Prompts Actually Work

People who generate images know this instinctively. Almost nobody applies it to coding agents, which is strange because it’s just as effective.

I started adding explicit exclusion instructions to almost every substantive prompt.

do not redesign the architecture
do not explain basics
do not add dependencies
do not touch code outside the function I specified
do not rewrite working tests

Without constraints, the model defaults toward maximal helpfulness. It interprets your request charitably and expansively. A small fix becomes a refactor. A targeted change becomes a suggestion for systemic improvement. The response is not wrong, exactly. It’s just bigger than you needed, touching things you didn’t ask it to touch.

Fencing the scope with negative constraints brings generation size down and collateral damage to zero.

Stop Pasting Entire Files

This one hurt to admit because I did it constantly for the first few months.

Large files go in as context. The model orients against the entire file even when the problem lives in twenty lines. It spends output describing sections you can already read. It generates explanations for code that doesn’t need explanation. You’re paying for that orientation work on every prompt.

Now I isolate ruthlessly. The relevant function. The relevant interface. The error message and the five lines around where it occurs. Nothing else unless the problem genuinely requires more context, and most problems don’t.

There’s a mechanical feeling to doing this correctly. Like setting up a machine to do one specific job instead of asking it to improvise across a full workshop floor.

Answer Budgets Are Real

This sounds pedantic until you watch what happens when you enforce them.

“Answer in five bullets maximum.” “Use under 200 tokens.” “One paragraph only.” “Return the patch only, no explanation.”

Quality frequently improves. Not because the model gets smarter under constraint, but because constraint forces selection. The model identifies the most important information instead of generating everything plausible and letting you filter.

Engineers who write good issue tracker comments already know this. The person who writes three sentences and identifies the problem exactly is more useful than the person who writes four paragraphs of observed behavior. Constraints sharpen communication in humans and in models.

The model defaults toward abundance because continuation probability rewards continuation. You have to interrupt that instinct intentionally.

Reusable Prompt Fragments Are Not Optional Complexity

I resisted systematizing my prompts for a long time because it felt like overhead. It wasn’t. It was already overhead…I was just paying it every session by retyping the same constraints.

The instructions I repeated constantly:

preserve all existing comments
minimal diff only — do not touch working code
TypeScript strict mode, no any
explain your reasoning before writing code
no new dependencies
mobile-first for any UI changes

I keep these in a snippets file. Each session loads the relevant subset. Prompt consistency reduces output variance. Tiny wording differences across sessions produce surprisingly different model behaviors over time — consistent inputs produce consistent, auditable outputs.

Know When Not to Use Claude

This was the hardest thing to accept and the most valuable.

Sometimes the fastest path is opening the file and fixing it yourself.

AI tools create a distortion where every task starts feeling like an AI task. But localized fixes with no ambiguity, problems you already understand completely, changes where verification overhead is higher than doing the work directly — these often cost more in prompting and review than they would cost in direct action.

There’s a pattern emerging in coworking spaces and Discord servers where people spend more cognitive energy orchestrating AI than solving the problem. Staring at generated diffs with the same expression someone brings to a slot machine resolving its animations. Waiting for the answer to materialize rather than already knowing the answer and writing it.

Not every problem is better solved through an assistant.

The Real Variable Is Attention

All the specific techniques above are downstream of one change: I stopped treating token cost as abstract.

Once cost is visible, waste is visible. You see the giant prompts. The repetitive context-loading. The speculative rewrites on code that was already fine. The conversational drift that burned two hours and produced nothing actionable.

The optimizations also improved the thinking. Forced compression forces clarity. Good engineering is compression — extracting signal, removing noise, making decisions and recording them instead of relitigating them. Systems that accumulate uncontrolled context decay, whether the system is a codebase, a team, or a conversation.

Most people are still in the abundance phase with AI coding tools. Infinite context, infinite patience, no friction awareness. That changes when the invoice arrives, or when the outputs start degrading because no one modeled the cost of entropy.

The developers who adapt earliest tend to build calmer systems. Calmer systems are easier to debug, cheaper to run, and faster to iterate on.

That’s not a coincidence.

If you want a structured version of these workflows with templates, prompt patterns, and session setups, I put together a field guide: 21 Claude Code productivity patterns, usage you can actually measure.

Claude Code for Developers: 21 Productivity Tricks That Save Hours

[claude.md] Masterclass: Build Your Own Persistent AI Brain

Top comments (11)

Mykola Kondratiuk • May 25

60% reduction is solid but the better output claim needs scrutiny. shorter context sometimes just means fewer wrong turns, not smarter reasoning - could be trading capability for predictability

Genglin Zheng • May 22

Based on your description, I've come up with a technique: could we combine claude code and chat? For code-related communication, we could use claude code for concise and direct interaction, while for communication requiring reasoning or thought, we could use chat. The goal is to ensure that the chat doesn't consume our tokens.

Theo Valmis • May 20

"Entropy in the context produces entropy in the response" is the most useful framing I've seen for why session hygiene matters. The model isn't forgetting — it's attending to everything it's been given, including the dead ends and resolved errors. Every failed attempt in the context is material the model is reasoning against when generating the next response.

The five-line handoff pattern is effective for exactly the reason you describe, but it's also doing something for you: it forces you to articulate clearly what state the project is actually in before re-engaging. That clarity benefits your own thinking as much as the model's. Vague prompts are expensive not just because they generate verbose responses — they're expensive because you can't evaluate whether the output was correct if you weren't clear about what correct meant when you started.

Harjot Singh • May 31

"Cut tokens 60% AND got better output" is the counterintuitive truth that more people need to hear: less context often means better results, not just cheaper ones. A bloated context dilutes the model's attention - it has to wade through irrelevant files and stale history to find what matters, and that noise degrades the answer. Trimming to exactly what's needed is simultaneously a cost win and a quality win. They're not a tradeoff; they're the same lever.

This is the single most important thing I've learned building agent systems, and it surprises people every time: context discipline isn't a cost sacrifice, it's a quality upgrade. Feeding the model less of the right stuff beats feeding it more of everything. It's why Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) gives each agent a tight, purpose-scoped context - it's how a build stays ~$3 flat AND produces better output than one bloated mega-context agent would. Excellent post, and the "better output" half is the part most cost articles miss. What was the change that gave you both wins at once - was it context scoping, or something about how you structured the prompts?

François Kiene • Jun 21

The "stop pasting entire files" and "answer budgets" bits are the ones I'd underline. Both are real, and both are also the parts you can stop doing by hand.

Disclosure: I build a compressor for this (llmtrim, open source: github.com/fkiene/llmtrim), so weight me accordingly. After a lot of measuring, the split I keep landing on is that the discipline you describe is the actual lever, but half of it is mechanical and half is judgment.

The mechanical half automates fine. Trimming a pasted file down to the relevant functions, stripping a 200-line command output to the errors before it hits the model, that kind of thing. A proxy can do it on every request so you never have to remember. One catch though, the tools bragging about 80-90% usually get there by deleting content, and it's easy to show a huge number while quietly tanking the answer. Whatever you use, check the answer still came out right, not just the token count.

The judgment half is the better part of your post, and nothing automates it. Mode separation, the five-line handoff, and especially "know when not to use Claude." No proxy saves you from staring at a generated diff like a slot machine when you already knew the answer. That's the real variable.

And measure before you optimize any of it. You can't see the giant prompts or the speculative rewrites until the cost is in front of you, and most of the fix is just that visibility.

View full discussion (11 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.