I Put My Prompts on a Diet and Cut My LLM Bill by 72%

#ai #llm #python #devtools

I got my monthly LLM bill last week and decided to actually audit it. I traced the biggest line item back to a system prompt I'd written months ago and never touched. The opening line read: "Please be sure to carefully read the following instructions and make absolutely certain that you follow each and every one of them precisely and without exception."

That's 47 words. It means: "Follow these instructions." That's 3 words.

I was paying for 44 words of politeness, repeated across millions of API calls, every single month. The model doesn't care that I said "please." It doesn't appreciate the "without exception." It just... reads the instructions and follows them, or doesn't, regardless of how eloquently I begged.

The Real Problem With How We Write Prompts

We write prompts like formal emails to a manager we're trying to impress. We hedge. We pad. We add ceremonial phrases that feel important but carry zero semantic weight. "Please take the following document and carefully read through all of its contents before proceeding to generate a thorough and comprehensive summary that covers all of the main points and key details contained within."

That sentence has 44 words. It means: "Summarize this document." 3 words.

The model processes every token. You pay for every token. And a shocking percentage of the average production system prompt is filler that a good editor would cut in 30 seconds.

So I built a tool to be that editor.

Install and Start Cutting

pip install token-diet

# offline, no API key needed
echo "Please take the following document..." | token-diet

# with Claude API for smarter compression
export ANTHROPIC_API_KEY=sk-ant-...
token-diet system_prompt.txt --level aggressive

# pipe-friendly for automation
token-diet prompt.txt --quiet | pbcopy

The --quiet flag suppresses the stats output and just prints the compressed text, which makes it trivially composable in shell pipelines. Compress, copy to clipboard, done.

What the Output Looks Like

$ token-diet system_prompt.txt --level balanced

  Original : 847 tokens
  Compressed: 234 tokens
  Saved     : 613 tokens (72.4%)
  Est. cost : $0.0031 saved per call (claude-haiku pricing)

  Techniques applied:
    - Removed 12 politeness markers ("please", "kindly", "if you could")
    - Collapsed 8 redundant qualifiers ("thorough and comprehensive" -> "thorough")
    - Shortened 5 verbose constructions ("in order to" -> "to")
    - Removed 3 obvious intent phrases ("I would like you to", "your task is to")

  Compressed output:
  Follow these instructions. Summarize the document covering main points...

Run it on a prompt you've had in production for more than a month and brace yourself.

How It Actually Works

There are two modes depending on whether you have an API key set.

Rule-based mode (offline, no key needed): A set of regex patterns strips known filler phrases, collapses redundant modifier pairs, rewrites verbose preposition constructions, and removes meta-instructions the model ignores anyway. This is fast, deterministic, and free to run. It catches maybe 60-70% of the low-hanging fruit.

API mode (Claude Haiku): The compressed text from rule-based mode gets sent to claude-haiku with a meta-prompt instructing it to compress further while preserving all semantic content and technical constraints. Haiku is the cheapest model in the Claude lineup — the irony of using an LLM to compress LLM prompts is not lost on me, but the math works out. You spend a fraction of a cent to compress a prompt that saves you dollars per thousand calls in production.

The --level flag controls aggression:

gentle — removes politeness markers only, won't restructure sentences
balanced — adds hedging language removal and obvious redundancy collapse
aggressive — full restructuring, maximally terse output

The --diff Flag

$ token-diet prompt.txt --diff

- Please be sure to carefully read the following instructions and make
- absolutely certain that you follow each and every one of them precisely.
+ Follow these instructions.

- I would like you to generate a thorough and comprehensive summary that
- covers all of the main points and key details contained within the document.
+ Summarize the document covering all main points.

The diff output exists because "trust but verify" applies to automated compression. For anything customer-facing, you want to review what changed before shipping.

Real Numbers on the Filler Phrases

The rule-based compressor ships with a catalog of known filler patterns. A few that came up most often when I ran it against my own prompt library:

"please" / "kindly" / "if you could" — appeared in 89% of my prompts, average 3x each
"thorough and comprehensive" — I apparently love this phrase, it was in 14 different files
"in order to" instead of "to" — 47 occurrences across all prompts
"I would like you to" / "your task is to" / "you will be responsible for" — 31 occurrences of telling the model what it's about to do before telling it what to do
"make absolutely certain" — 8 occurrences, none more effective than "ensure"

These aren't edge cases. They're the default register most people slip into when writing instructions, and they add up fast at scale.

What I Learned Building This

Tokens are not characters. The 4-chars-per-token heuristic is close enough for estimation, but tiktoken gives precise counts if you need them. token-diet uses tiktoken when available and falls back gracefully.
Rule-based NLP is underrated. Most of the compression happens before the API call ever fires. Regex-and-replace over a well-curated phrase list handles the bulk of it, and it's instant, offline, and free.
Meta-prompting works. Using an LLM to rewrite prompts for other LLMs feels recursive to the point of absurdity, but the quality delta between rule-based-only and rule-based+Haiku is real and measurable.
Pipe-friendliness matters. The single best design decision was the --quiet flag. Making it trivially composable with pbcopy, xclip, curl, and other tools is what makes it actually live in a workflow rather than just get installed and forgotten.
Production prompts drift. The reason my system prompt had 47 words of throat-clearing is that it was written once, never reviewed, and accumulated edits from multiple people over six months. A compression pass is also an editing pass.

Go Put Your Prompts on a Diet

The tool is free, works offline by default, and installs in ten seconds.

pip install token-diet

Source, issues, and PRs: https://github.com/LakshmiSravyaVedantham/token-diet

If you find a filler phrase pattern I haven't caught yet, open an issue. The phrase catalog is the core of the rule-based engine and it should be a community-maintained list at this point.