DEV Community

Cover image for I Benchmarked the Viral "Caveman" Prompt to Save LLM Tokens. Then My 6-Line Version Beat It.
Kuba Guzik
Kuba Guzik

Posted on • Originally published at Medium

I Benchmarked the Viral "Caveman" Prompt to Save LLM Tokens. Then My 6-Line Version Beat It.

Last week, an open-source project called caveman promised to save 75% on LLM tokens by making AI talk like a caveman — dropping filler words, skipping pleasantries, keeping only technical substance. The repository collected 4,000 stars on GitHub in days. Developers shared it as a breakthrough in token efficiency.

The claim, it turns out, is both true and misleading. When I benchmarked the caveman prompt on real coding tasks across Claude Sonnet and Opus, the actual token savings landed between 14 and 21 percent — meaningful, but far from the headline figure. More surprising was what happened next: a six-line micro prompt I distilled from the original, just 85 tokens instead of 552, outperformed the full skill on both models.


What are LLM tokens and why do they matter?

Every time you use ChatGPT, Claude, or any AI tool, you're spending tokens. Tokens work like a taxi meter. Every word the AI reads or writes ticks it up.

More tokens = slower answers and higher bills.

So a trick that cuts 75% of tokens with no quality loss? That's just free money.

Here's what caveman actually does. Ask the AI a simple question — "why is my app slow?" — and compare:

Normal AI:
"The performance issues you're experiencing are likely caused by the fact that your application is making multiple redundant API calls on each page load. I would recommend implementing a caching layer to store frequently accessed data, which should significantly reduce the number of network requests and improve overall response times."

Caveman AI:
"Redundant API calls each page load. Add cache. Fewer requests = faster."

Same diagnosis. Same fix. One burned 50 tokens on politeness and filler. The other said it in 12.

But here's the catch — that's a chatbot answering a question. What happens when the AI is doing actual work?


Caveman token savings: benchmark results on real tasks

To test caveman token savings on real work, I designed two tasks that a developer might hand an AI coding tool on any given Tuesday: diagnose a production incident from server logs and config files, and extract exact timeout and retry settings from source code. Each task had a verifiable correct answer. The AI either returned the right facts, or it didn't.

I ran three groups across Claude Sonnet and Claude Opus, three repetitions each. The baseline prompt already said "Be concise" and demanded structured JSON output — no fluff allowed even without caveman. The second group used the full 552-token caveman skill from the viral repo. The third used my own distilled version: six lines, 85 tokens.

The results were consistent across both models:

Claude Sonnet (36 runs)

  • Baseline: 259 avg output tokens
  • Caveman full (552 tok injected): 225 avg output tokens — 13% reduction
  • Caveman micro (85 tok injected): 223 avg output tokens — 14% reduction

Claude Opus (36 runs)

  • Baseline: 227 avg output tokens
  • Caveman full (552 tok injected): 207 avg output tokens — 9% reduction
  • Caveman micro (85 tok injected): 180 avg output tokens — 21% reduction

Quality never dropped. Across all 72 runs, both models returned 100 percent of the correct facts. Not a single data point was lost to compression.

But 14 to 21 percent is not 75 percent. That discrepancy is the most important part of this story.


Why 75% token savings don't happen in real workflows

Picture this. You hire a writer. You pay by the word.

You say "write me a report." They deliver ten pages. Introduction, pleasantries, conclusion, three appendices. You hand it back: "rewrite this like a caveman." They cut it to three pages. You saved 70%.

Now picture a different version. You say "three pages, bullet points only, no intro." They deliver three tight pages. You hand back the same caveman instructions. They trim it to two and a half. You saved 17%.

That's the entire difference.

The original caveman benchmark used a baseline of "You are a helpful assistant." No instructions to be brief. The AI wrote essays. Caveman mode slashed them. Big gap.

Our baseline already said "Be concise. Return JSON." The AI was already tight. Less fluff to cut. The savings landed at 14–21%.

There's a second factor. The original benchmark tested explanation-heavy prompts: "explain React re-renders," "microservices vs monolith," "set up a PostgreSQL connection pool." Those prompts generate paragraphs of prose — exactly what caveman mode is built to compress. Our benchmark tested structured extraction: diagnose an incident from logs, pull timeout values from source code. The output is dense by nature. Less prose, less to cut.

The 75% number isn't wrong. It's just measuring something most people aren't doing.


The 6-line caveman micro prompt that beat the original

The full caveman skill is 552 tokens of rules, examples, and edge cases. I distilled it into a 6-line micro prompt:

Respond like smart caveman. Cut all filler, keep technical substance.

Drop articles (a, an, the), filler (just, really, basically, actually).
Drop pleasantries (sure, certainly, happy to).
No hedging. Fragments fine. Short synonyms.
Technical terms stay exact. Code blocks unchanged.
Pattern: [thing] [action] [reason]. [next step].
Enter fullscreen mode Exit fullscreen mode

85 tokens. That's it.

On Opus, the micro version saved 21%. The full skill? Only 9%.

Turns out the model already knows how to be brief. It doesn't need a 552-token tutorial. It needs six lines of permission.

Six lines. Same result. One-sixth the injection cost.


How to reduce LLM token costs with caveman prompting

If you use AI to chat, learn, and brainstorm -- caveman saves 40-65% of output tokens. The AI normally writes you essays. Caveman kills the essays. Big win.

If you use AI for structured work -- code, data extraction, JSON outputs -- expect 14-21%. Your prompts are already doing the heavy lifting. Caveman adds a nice bonus on top.

If you're running AI at scale through an API -- 14% across millions of calls is real money. And since quality stays at 100%, the cost of trying is zero.

Copy the six lines above into your system prompt. That's all you need. The prompt, benchmark code, and raw data are open source at kuba-guzik/caveman-micro.


The real source of token savings in LLM prompts

The caveman skill works. But nobody in that viral thread mentioned this:

The biggest token savings don't come from how the AI talks. They come from how you ask.

"Be concise. Return JSON." Those five words in your base prompt already handle 60% of the savings. Caveman picks up another 14-21% on top.

Most people tweak the AI's output. The real win is in how you write the input.

Start there. Then add the six lines. Collect your free savings.


Tested on Claude Sonnet and Opus via Claude Code CLI. 18 runs per model, 3 reps per condition. Quality verified by automated fact-checking against known correct values. Benchmark based on the methodology described in Brevity Constraints Reverse Performance Hierarchies in Language Models (March 2026). Full benchmark code and raw data: github.com/kuba-guzik/caveman-micro.

Top comments (0)