This is a submission for Weekend Challenge: Earth Day Edition
Fluffer: someone who helps people "get ready for work" in the adult film industry
De...
For further actions, you may consider blocking this person and/or reporting abuse
Ok this is genuinely funny but also makes a real point. The irony of using an LLM to save tokens sent to an LLM is something I've been thinking about too - it's like driving to the gym to use the treadmill.
The 45% reduction claim is interesting though. I wonder how much of that holds up with more technical prompts vs conversational ones? My guess is code-heavy prompts would compress less since there's less fluff to begin with.
Also the 60 GW/year math is hilarious in the best way. Even if the real number is 1/10th of that it's still a wild amount of wasted compute on "please" and "I would really appreciate if you could".
It's real Architectual point, for sure. This is way how many MoE LLM Works: they may actually have 2 LLMs, one for 1b params, and big one for example 100-200b params. Each request goes for small LLM, and if answer is ok, just returned to user. In fact, there are approx 45% token economy profit! Just google LLM Architectures to ensure
Nto seen a MoE architecture like that, most use the "activate just a few experts" for each token, which is why they are tricky to work with sometimes on mixed contexts. Is this something else or a specific type of MoE you are talking about so I can go look it up! π
Yeah its definitely a joke way of presenting a serious thing to consider. π€£
Like you said, just stripping "please" would save a few Megawatts of electricity! I should have done a "remove the pleasantries" bot! haha.
And yeah, I think with a coding agent there would be far smaller results using this particular technique, maybe a 10% reduction (if we could apply it to it's chain of thought) - at that point the technique would be different in that we try and reduce how much context ends up in the main agent's context.
Glad you enjoyed the submission! π
The irony is the point. You're using an LLM to write code that reduces token usage, but the LLM that writes the longest response is the one you needed. That's not a failure of the tool, it's a signal that token efficiency and helpfulness are sometimes in tension. Defluffer works because it's rule-based, not model-based. No irony there. Saving tokens by spending tokens would be recursive waste. You avoided that. That's the actual insight. Most people would have built an LLM to shorten prompts. You built a dictionary. Simpler, cheaper, faster, and it doesn't need to be prompted not to hallucinate.
Exactly, especially on the token efficiency vs getting the job part!
As for a simple solution - well i am a simple soul, so I do simple things! hahaha
Nice.
LLM systems waste enormous amounts of compute on unnecessary tokens.
And because context is reloaded every turn, waste compounds.
The principles in this post map cleanly to real engineering patterns:
These are all valid preβprocessing or intermediateβrepresentation techniques.
Will see if I can operationalize them without breaking meaning, safety, or reliability with the right architecture, perhaps this weekend.
Please do give it a go and tag me in a comment / the article (if you write one) as to how you made it more "production ready", would love to see it! π
Creating a tool (using LLM) to "reduce" fluff-factor of promts using another LLM in order to "save" resources... but then writing an article (probaly with some help of LLM) over 300+ Words... full of greenwashing-like claims .....where only one sentence really matters ... absolutelly 10/10
I like the tight circle it creates of problem -> use problem to solve problem, but not really and then claim you are saving the world while actually building a toy! hahaha
this is the kind of thing that actually matters when you are running many agents in parallel - token bloat compounds fast. curious if you measured savings on structured prompts vs freeform, those seem to behave differently
No, it was a silly "this is a technique you should consider" article. Defo not an actual production ready thing so just did minimal testing!
Basically take the concept, defo dont take the code! hahaha
Honest disclaimer probably saves more headaches than the article itself. Token bloat across parallel agents is one of those costs that only becomes real at scale anyway β concept-level awareness is exactly what most teams need first.
45% token reduction is substantial β that's meaningful at scale, especially for long-running agentic workflows where every token has a real cost.
Curious about what the "fluff" actually consists of in your pipeline. Is it mostly semantic redundancy (repeated framing, re-explaining what was just said), formatting artifacts from previous turns, or something else? The answer changes whether this is a preprocessing step versus something that needs to happen mid-generation.
Also β did you measure any quality regression? The risk with aggressive deduplication is losing subtle but important distinctions in the remaining tokens.
If you look at the codepen, it is purely Phrase -> replacement. Most are removes, some are abbreviate etc.
Bear in mind, this is a toy project, to prove a point on token management, not a production tool!
haha the naming convention alone is worth the click
real talk though -- the point about context window multiplication is underrated. 45% reduction on the initial prompt compounds into massive savings across a full conversation. combine that with picking the right model for each task and your api bill drops dramatically
nice submission π±
Haha yeah I had a bit of fun with the naming π€£
The point on picking the right model is also one that most people don't spend enough time on and then wonder why their costs are astronomical!!!
Currently a Minimax m2.7 fan for actually getting stuff done at scale, hoping Deepseek v4 is as the "leaked" stats show as that will be pretty much my go to (as it was when v3 came out until others leap frogged it).
With all that being said I still prefer Claude overall, just can't afford for it to do all the lifting!
the 45% reduction is impressive. did you benchmark this across different types of prompts or just specific use cases? curious how it handles code vs natural language.
Code it does not touch, so that does not compress. In the demo codepen there are 20 test phrases behind a tab at the bottom with their input and output you can check.
But bear in mind, this is a silly project designed to show an idea, not something to actually use!
iβm obsessed with token reduction lately always looking for cleaner prompts and smarter chunking this kind of thinking is a real cost-saver when youβre running your own fine-tunes on a budget.
Yeah, the principle is good, I hope you manage to extract the useful parts from my silly article and save yourself some dollars! (and time as we all know long context = slower responses and higher corrections!) π
Anyone used Caveman? I have actually been using it for real and although the actual token usage reduction is minimal I have been fascinated by how well LLMs understand "Caveman speak"
Great article! Very useful insights on token optimization.
Thanks, glad you enjoyed it / found it useful! π
Fantastic.
Glad you enjoyed it! ππΌπ
Awesome man, liked it.
ππΌπ
At the same time, people add more system prompts to explicitly require what only seemed self-evident: correct code, current language levels and software versions, let the AI check sources and documentation instead of hallucinating etc.
Currently everyone seems to build stuff on top of existing AI systems to optimize them to become more correct, more efficient, more whatever. Doctoring the symptoms because we can't reach teh root cause. The whole idea of LLM seems to be build on fluffer in a way. And even concise input tends to produce verbose fluffy output.
The irony of using an LLM to write code that reduces token usage for LLMs is the kind of recursive absurdity that actually makes a real point. Every token you don't send is compute you don't pay forβin dollars, in watts, in water. The environmental framing is playful, but the underlying dynamic is genuine. We're all just... talking more than we need to.
What's interesting is how much of prompt "fluff" is social conditioning. We add pleasantries because we're used to talking to humans. "I would really appreciate it if you could..." is just keyboard calories. The model doesn't care. It processes the instruction the same either way. But writing a terse prompt feels rude, even when the recipient is a matrix of weights.
The code block protection is the detail that makes this actually usable. Without it, you'd strip
ifrom every loop and break everything. With it, the compression stays safely outside the parts that matter. That's the difference between a joke and a tool you might actually run locally before pasting into a chat window.I'm curious if you noticed any patterns in what kind of language inflated the token count most. Was it the polite framing, the redundant clarifications, or something else? Feels like there's a taxonomy of prompt bloat hiding in the data.
Imagine I can save token to get more tokens.