Sourabh Mourya

Posted on May 28

You Don’t Need More Tokens, You Need Better Thinking

#ai #webdev #productivity #automation

I got my first Anthropic bill and genuinely thought it was a mistake.

It wasn't. I was just being wasteful without realizing it.

After a week of obsessing over token usage, I cut my costs by nearly 60% with better results. Here's what I learned.

What even is a token?

A token isn't a word. It's closer to a word-chunk.

As a rough rule: 1 token ≈ 4 characters in English. So "unbelievable" is about 3 tokens. A full paragraph might be 80–120 tokens.

Every API call counts both directions:

Input tokens — everything you send (your prompt + system prompt + conversation history)
Output tokens — everything the model sends back

Output tokens cost more. Always. Keep that in mind.

Where most people bleed tokens without knowing

When I audited my prompts, I found the same mistakes every time.

Bloated system prompts. I had one that was 800 words of "be helpful, be concise, be professional..." repeated five different ways. That runs on every single call.

Dumping full context when partial context works. I was sending entire documents and saying "answer this question about section 3." The model reads all of it. Every time.

Asking for long outputs when short ones do the job. "Explain this concept" gets you 400 words. "Explain this in 2 sentences" gets you 40 words and usually a cleaner answer.

Conversation history that never gets trimmed. In multi-turn chats, every previous message gets re-sent. A 20-turn conversation can be 3,000 tokens before you've even typed your next message.

What actually works to cut costs

1. Be specific about output length.
Add "in 1 paragraph" or "in under 100 words" to your prompt. Models respect this and it forces tighter answers.

2. Chunk your context.
Instead of sending a full document, extract and send only the relevant section. A focused 200-token excerpt beats a 2,000-token dump every time.

3. Shrink your system prompt ruthlessly.
Cut anything that's vague or repeated. "Be concise" said once beats "please ensure your responses are as concise as possible and avoid unnecessary verbosity" said once. Less is genuinely more.

4. Use a cheaper model for simple tasks.
Not every call needs your most powerful model. Routing classification, summarization, or formatting tasks to a smaller model can cut costs 5–10x with zero quality loss.

5. Cache repeated context.
If you're building something and sending the same instructions or documents repeatedly, look into prompt caching. Anthropic, OpenAI, and others support it — cached tokens cost a fraction of fresh ones.

6. Ask for structured output.
"Return JSON with keys: summary, action, confidence" is shorter to process and parse than "please write a detailed explanation followed by the recommended action and your confidence level."

The mindset shift that helped most

I stopped thinking "how do I write better prompts" and started thinking "what's the minimum context this model actually needs to answer correctly?"

That question changed everything.

The model doesn't need your life story. It needs the right facts, a clear goal, and a defined output format. Everything else is noise you're paying for.

Now I want to hear from you:

Have you ever actually looked at your token usage per call, or do you just watch the bill?
What's the dumbest token waste you've caught in your own prompts?
Have you found a trick that cut costs without hurting quality?
Are you using prompt caching yet — and did it actually help? Drop your answer below. Even "I had no idea tokens worked this way" counts — we've all been there.

Top comments (1)

AudioProducer.ai • May 28

The minimum-context lens fits publishing agents the same way. AudioProducer.ai runs an hourly marketing worker that re-reads its workflow file at the start of every run rather than carrying state across runs, which sounds expensive but the file IS the cache: codified rules instead of "be concise" repeated five ways in a system prompt, structured task files instead of free-form briefs, one publish-event per run as the bounded unit.

Structured output lands hard on the audio side too. Each chapter ships with a per-line speaker map, per-paragraph soundscape annotation, per-character voice card as fixed-slot JSON the writer can re-tag in place. That schema rigidity is what keeps re-render input flat (no re-explaining the format each pass) AND collapses LLM failure modes to "schema-valid but semantically wrong," which is a much cheaper category to review than free-form text drift.

Output-length discipline matches per-publish task: one comment, one blog post, one tweet, bounded by platform norms not model preference. The "what's the minimum the model needs to answer correctly" question is the same question creative AI tools should be asking when designing the artifact the model produces.