A couple of weeks ago I was hammering through millions of tokens daily, hitting quotas and rate limits left and right, forcing me to switch providers and juggle subscriptions. Then I found Caveman.
Most foundational models aim to be helpful assistants, mimicking friendly support staff stuffed with pleasantries. All those "This is a brilliant idea!" and "According to my research using the internet with the playwright web browser, I did find more information regarding that topic bla bla bla..." are bloating up the available context space.
What I, as someone running a fleet of AI coding agents daily, really want from my agents is to be efficient communicators: highlight noteworthy information, omit irrelevant, unimportant, or redundant text.
Every page of information sent back and forth between me, my agents, and the LLMs costs tokens and pollutes context space. That's where Caveman comes in.
What Caveman Does
Caveman is a SKILL.md-based plugin that hooks into your agent system and teaches it to strip away all text fragments that aren't strictly needed to transport the same semantic meaning. It compresses text the way you'd compress a lossless .bmp into a much smaller .webp. Technically not the same pixels, but for all practical purposes the same image.
Results: 50-75% Token Reduction
Before Caveman, I exhausted my Anthropic weekly quota in 2-3 days. After two weeks of daily use, I haven't filled it once. That's easily a 50%+ reduction in total token usage, maybe closer to the postulated 75%.
The side effects are all positive:
- Better latency: less context for the model to process means faster responses
- Better throughput: more VRAM available for KV-cache when the context is smaller
- Cleaner reasoning: the model spends fewer tokens on preamble and more on the actual problem
Live Demo
The full article includes an interactive compression demo where you can paste any text and see it compressed at different levels:
- Lite: light cleanup, mostly natural language
- Full: significant compression, still readable
- Ultra: aggressive, looks like gibberish but models understand it perfectly
- Classical Chinese modes: encodes English concepts as single CJK characters
I was skeptical that "gibberish-looking" compressed text would produce the same quality output from models. It does. I haven't felt a drop in reasoning quality whatsoever.
Read the full article with the live compression demo at hardcore.engineer
Top comments (0)