Cut Your AI Agent Token Costs by 75% With One Skill Plugin

#ai #llm #productivity #devtool

A couple of weeks ago I was hammering through millions of tokens daily, hitting quotas and rate limits left and right, forcing me to switch providers and juggle subscriptions. Then I found Caveman.

Most foundational models aim to be helpful assistants, mimicking friendly support staff stuffed with pleasantries. All those "This is a brilliant idea!" and "According to my research using the internet with the playwright web browser, I did find more information regarding that topic bla bla bla..." are bloating up the available context space.

What I, as someone running a fleet of AI coding agents daily, really want from my agents is to be efficient communicators: highlight noteworthy information, omit irrelevant, unimportant, or redundant text.

Every page of information sent back and forth between me, my agents, and the LLMs costs tokens and pollutes context space. That's where Caveman comes in.

What Caveman Does

Caveman is a SKILL.md-based plugin that hooks into your agent system and teaches it to strip away all text fragments that aren't strictly needed to transport the same semantic meaning. It compresses text the way you'd compress a lossless .bmp into a much smaller .webp. Technically not the same pixels, but for all practical purposes the same image.

Results: 50-75% Token Reduction

Before Caveman, I exhausted my Anthropic weekly quota in 2-3 days. After two weeks of daily use, I haven't filled it once. That's easily a 50%+ reduction in total token usage, maybe closer to the postulated 75%.

The side effects are all positive:

Better latency: less context for the model to process means faster responses
Better throughput: more VRAM available for KV-cache when the context is smaller
Cleaner reasoning: the model spends fewer tokens on preamble and more on the actual problem

Live Demo

The full article includes an interactive compression demo where you can paste any text and see it compressed at different levels:

Lite: light cleanup, mostly natural language
Full: significant compression, still readable
Ultra: aggressive, looks like gibberish but models understand it perfectly
Classical Chinese modes: encodes English concepts as single CJK characters

I was skeptical that "gibberish-looking" compressed text would produce the same quality output from models. It does. I haven't felt a drop in reasoning quality whatsoever.

Read the full article with the live compression demo at hardcore.engineer