DEV Community

Cover image for Defluffer - reduce token usage πŸ“‰ by 45% using this one simple trick! [Earthday challenge]

Defluffer - reduce token usage πŸ“‰ by 45% using this one simple trick! [Earthday challenge]

GrahamTheDev on April 18, 2026

This is a submission for Weekend Challenge: Earth Day Edition Fluffer: someone who helps people "get ready for work" in the adult film industry De...
Collapse
 
frost_ethan_74b754519917e profile image
Ethan Frost

Ok this is genuinely funny but also makes a real point. The irony of using an LLM to save tokens sent to an LLM is something I've been thinking about too - it's like driving to the gym to use the treadmill.

The 45% reduction claim is interesting though. I wonder how much of that holds up with more technical prompts vs conversational ones? My guess is code-heavy prompts would compress less since there's less fluff to begin with.

Also the 60 GW/year math is hilarious in the best way. Even if the real number is 1/10th of that it's still a wild amount of wasted compute on "please" and "I would really appreciate if you could".

Collapse
 
hichnik profile image
Iurii Didkovskyi

It's real Architectual point, for sure. This is way how many MoE LLM Works: they may actually have 2 LLMs, one for 1b params, and big one for example 100-200b params. Each request goes for small LLM, and if answer is ok, just returned to user. In fact, there are approx 45% token economy profit! Just google LLM Architectures to ensure

Collapse
 
grahamthedev profile image
GrahamTheDev

Nto seen a MoE architecture like that, most use the "activate just a few experts" for each token, which is why they are tricky to work with sometimes on mixed contexts. Is this something else or a specific type of MoE you are talking about so I can go look it up! πŸ’—

Collapse
 
grahamthedev profile image
GrahamTheDev

Yeah its definitely a joke way of presenting a serious thing to consider. 🀣

Like you said, just stripping "please" would save a few Megawatts of electricity! I should have done a "remove the pleasantries" bot! haha.

And yeah, I think with a coding agent there would be far smaller results using this particular technique, maybe a 10% reduction (if we could apply it to it's chain of thought) - at that point the technique would be different in that we try and reduce how much context ends up in the main agent's context.

Glad you enjoyed the submission! πŸ’—

Collapse
 
theeagle profile image
Victor Okefie

The irony is the point. You're using an LLM to write code that reduces token usage, but the LLM that writes the longest response is the one you needed. That's not a failure of the tool, it's a signal that token efficiency and helpfulness are sometimes in tension. Defluffer works because it's rule-based, not model-based. No irony there. Saving tokens by spending tokens would be recursive waste. You avoided that. That's the actual insight. Most people would have built an LLM to shorten prompts. You built a dictionary. Simpler, cheaper, faster, and it doesn't need to be prompted not to hallucinate.

Collapse
 
grahamthedev profile image
GrahamTheDev

Exactly, especially on the token efficiency vs getting the job part!

As for a simple solution - well i am a simple soul, so I do simple things! hahaha

Collapse
 
narnaiezzsshaa profile image
Narnaiezzsshaa Truong

Nice.

LLM systems waste enormous amounts of compute on unnecessary tokens.

And because context is reloaded every turn, waste compounds.

The principles in this post map cleanly to real engineering patterns:

  • Whitespace normalization β†’ trivial, safe
  • Phrase collapsing β†’ controlled compression
  • Fluff removal β†’ domain‑specific stopwording
  • Synonymization / stemming β†’ semantic compression
  • Code‑block protection β†’ structural preservation
  • Logic symbolization β†’ compact representation of boolean intent

These are all valid pre‑processing or intermediate‑representation techniques.

Will see if I can operationalize them without breaking meaning, safety, or reliability with the right architecture, perhaps this weekend.

Collapse
 
grahamthedev profile image
GrahamTheDev

Please do give it a go and tag me in a comment / the article (if you write one) as to how you made it more "production ready", would love to see it! πŸ’—

Collapse
 
vladimir_light profile image
Vladimir L

Creating a tool (using LLM) to "reduce" fluff-factor of promts using another LLM in order to "save" resources... but then writing an article (probaly with some help of LLM) over 300+ Words... full of greenwashing-like claims .....where only one sentence really matters ... absolutelly 10/10

Collapse
 
grahamthedev profile image
GrahamTheDev

I like the tight circle it creates of problem -> use problem to solve problem, but not really and then claim you are saving the world while actually building a toy! hahaha

Collapse
 
itskondrat profile image
Mykola Kondratiuk

this is the kind of thing that actually matters when you are running many agents in parallel - token bloat compounds fast. curious if you measured savings on structured prompts vs freeform, those seem to behave differently

Collapse
 
grahamthedev profile image
GrahamTheDev

No, it was a silly "this is a technique you should consider" article. Defo not an actual production ready thing so just did minimal testing!

Basically take the concept, defo dont take the code! hahaha

Collapse
 
itskondrat profile image
Mykola Kondratiuk

Honest disclaimer probably saves more headaches than the article itself. Token bloat across parallel agents is one of those costs that only becomes real at scale anyway β€” concept-level awareness is exactly what most teams need first.

Collapse
 
motedb profile image
mote

45% token reduction is substantial β€” that's meaningful at scale, especially for long-running agentic workflows where every token has a real cost.

Curious about what the "fluff" actually consists of in your pipeline. Is it mostly semantic redundancy (repeated framing, re-explaining what was just said), formatting artifacts from previous turns, or something else? The answer changes whether this is a preprocessing step versus something that needs to happen mid-generation.

Also β€” did you measure any quality regression? The risk with aggressive deduplication is losing subtle but important distinctions in the remaining tokens.

Collapse
 
grahamthedev profile image
GrahamTheDev

If you look at the codepen, it is purely Phrase -> replacement. Most are removes, some are abbreviate etc.

Bear in mind, this is a toy project, to prove a point on token management, not a production tool!

Collapse
 
devtorres_ profile image
Devon Torres

haha the naming convention alone is worth the click

real talk though -- the point about context window multiplication is underrated. 45% reduction on the initial prompt compounds into massive savings across a full conversation. combine that with picking the right model for each task and your api bill drops dramatically

nice submission 🌱

Collapse
 
grahamthedev profile image
GrahamTheDev

Haha yeah I had a bit of fun with the naming 🀣

The point on picking the right model is also one that most people don't spend enough time on and then wonder why their costs are astronomical!!!

Currently a Minimax m2.7 fan for actually getting stuff done at scale, hoping Deepseek v4 is as the "leaked" stats show as that will be pretty much my go to (as it was when v3 came out until others leap frogged it).

With all that being said I still prefer Claude overall, just can't afford for it to do all the lifting!

Collapse
 
megallmio profile image
Socials Megallm

the 45% reduction is impressive. did you benchmark this across different types of prompts or just specific use cases? curious how it handles code vs natural language.

Collapse
 
grahamthedev profile image
GrahamTheDev

Code it does not touch, so that does not compress. In the demo codepen there are 20 test phrases behind a tab at the bottom with their input and output you can check.

But bear in mind, this is a silly project designed to show an idea, not something to actually use!

Collapse
 
megallmio profile image
Socials Megallm

i’m obsessed with token reduction lately always looking for cleaner prompts and smarter chunking this kind of thinking is a real cost-saver when you’re running your own fine-tunes on a budget.

Collapse
 
grahamthedev profile image
GrahamTheDev

Yeah, the principle is good, I hope you manage to extract the useful parts from my silly article and save yourself some dollars! (and time as we all know long context = slower responses and higher corrections!) πŸ’—

Collapse
 
grahamthedev profile image
GrahamTheDev

Anyone used Caveman? I have actually been using it for real and although the actual token usage reduction is minimal I have been fascinated by how well LLMs understand "Caveman speak"

Collapse
 
whennotesfly profile image
When Notes Fly

Great article! Very useful insights on token optimization.

Collapse
 
grahamthedev profile image
GrahamTheDev

Thanks, glad you enjoyed it / found it useful! πŸ’—

Collapse
 
mamoor_ahmad profile image
Mamoor Ahmad

Fantastic.

Collapse
 
grahamthedev profile image
GrahamTheDev

Glad you enjoyed it! πŸ™πŸΌπŸ’—

Collapse
 
ratan_khurana profile image
Ratan Khurana

Awesome man, liked it.

Collapse
 
grahamthedev profile image
GrahamTheDev

πŸ™πŸΌπŸ’—

Collapse
 
ingosteinke profile image
Ingo Steinke, web developer

At the same time, people add more system prompts to explicitly require what only seemed self-evident: correct code, current language levels and software versions, let the AI check sources and documentation instead of hallucinating etc.

Currently everyone seems to build stuff on top of existing AI systems to optimize them to become more correct, more efficient, more whatever. Doctoring the symptoms because we can't reach teh root cause. The whole idea of LLM seems to be build on fluffer in a way. And even concise input tends to produce verbose fluffy output.

Collapse
 
peacebinflow profile image
PEACEBINFLOW

The irony of using an LLM to write code that reduces token usage for LLMs is the kind of recursive absurdity that actually makes a real point. Every token you don't send is compute you don't pay forβ€”in dollars, in watts, in water. The environmental framing is playful, but the underlying dynamic is genuine. We're all just... talking more than we need to.

What's interesting is how much of prompt "fluff" is social conditioning. We add pleasantries because we're used to talking to humans. "I would really appreciate it if you could..." is just keyboard calories. The model doesn't care. It processes the instruction the same either way. But writing a terse prompt feels rude, even when the recipient is a matrix of weights.

The code block protection is the detail that makes this actually usable. Without it, you'd strip i from every loop and break everything. With it, the compression stays safely outside the parts that matter. That's the difference between a joke and a tool you might actually run locally before pasting into a chat window.

I'm curious if you noticed any patterns in what kind of language inflated the token count most. Was it the polite framing, the redundant clarifications, or something else? Feels like there's a taxonomy of prompt bloat hiding in the data.

Collapse
 
zhijiewong profile image
Zhijie Wong

Imagine I can save token to get more tokens.