GrahamTheDev

Posted on Apr 18

Defluffer - reduce token usage 📉 by 45% using this one simple trick! [Earthday challenge]

#devchallenge #weekendchallenge #javascript #ai

DEV Weekend Challenge: Earth Day

This is a submission for Weekend Challenge: Earth Day Edition

Fluffer: someone who helps people "get ready for work" in the adult film industry
Defluffer: a simple script that removes "fluff" and "filler" from your prompts! Could save over a million trees worth of CO2 a year (but probably not).

Don't worry, this article is about the latter and is my silly submission (but with a real message and principles that can have massive environmental impact) for the Earth Day challenge!

What I Built

I built Defluffer - a text length reduction tool to keep your prompts nice and short!

Save an average of 45% tokens of your prompt text
with near zero compute!!!!

Every token you can save in a prompt means hundreds of tokens saved in a full conversation with a LLM due to how the whole context is loaded at each step (simplified explanation).

Defluffer is inspired by Caveman - except it uses code to reduce the payload size as using a LLM...to save tokens sent to a LLM, well...it uses tokens...and that just seemed silly!

Is it a serious project?
Absolutely not, don't use it in production for the love of all that is mighty!

Are the principles useful to think about?
Absolutely!

Fewer tokens = fewer Megawatts = less pollution / water / reduces the need for rare resources for GPUs etc. etc.

It also saves you money when using the APIs vs subscriptions!

In theory, if you "sured up" this script, and every developer in the world used it, we could save over 60 Gigawatts a year (fluffed numbers from Gemini based on 40 mil devs using AI, 30 prompts a day and saving 135 tokens per prompt.)

Or:

🏡 5600 Homes Powered for a YEAR!
📱 3.94 billion Phone Charges
🌳 1.12 million Tree CO2 absorption Equivalence!

Now THAT is how we save the planet!

Demo

There is a box at the top where you can enter a prompt and see how many tokens / words you can save when it is "defluffed"!

Here is a demo prompt you can copy paste in to try it!:

Hello there! I would really appreciate it if you could act as a senior backend developer. I am trying to figure out how to write a python script that connects to the database and retrieves all of the information from the user repository. 

Make sure that the results are filtered so that the retry count is greater than or equal to 5, and the active status is strictly equals to true. Due to the fact that the application is currently in the production environment, it is required that you utilize the environment configurations instead of hardcoding the parameters into the functions.

Also, I have a question about the following snippet. Could you please refactor this code without using any external libraries? 

` ` `javascript
function calculateMaximum(array) {
    if (array === null) return 0;
    return Math.max(...array);
}
` ` `

Take into consideration that the output should be formatted as a standard JSON object. If you don't mind, please provide a step by step guide on how to deploy this microservice to the kubernetes cluster at the very end. Thank you so much!

You can also see some sliders to see potential yearly savings in CO2 / power below that in the "impact calculator" tab.

You can also see the test suite size reduction results using Defluffer on a few sample prompts in the "test results" tab!

Codepen Demo, make sure to scroll down!!!

Code

The core code is really simple.

The hard part was the list of phrases to "compress" (which is essentially just a list of phrases that we do a replace on, or remove).

You can View The Code and Replace List in Codepen

Below is the entire class though!

class Defluffer {
  constructor(dictionaries) {
    this.phrasesAndLogic = { ...dictionaries.phrases, ...dictionaries.logic };
    this.synonyms = dictionaries.synonyms || {};
    this.blacklist = new Set(dictionaries.blacklist || []);
  }

  compress(prompt) {
    let text = prompt;
    let protectedItems = [];

    // 1. Extract and protect code blocks
    text = text.replace(/(```
{% endraw %}
[\s\S]*?
{% raw %}
```|`[^`]+`)/g, (match) => {
      protectedItems.push(match);
      return `PROT${protectedItems.length - 1}PROT`;
    });

    // 2. Strip multi-word blacklist entries
    for (const entry of this.blacklist) {
      if (!entry.includes(' ')) continue;
      const escaped = entry.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
      text = text.replace(new RegExp(`\\b${escaped}\\b`, 'gi'), '');
    }

    // 3. Phrase and logic collapsing
    for (const [phrase, replacement] of Object.entries(this.phrasesAndLogic)) {
      if (!phrase) continue;
      const escaped = phrase.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
      const regex = new RegExp(`\\b${escaped}\\b`, 'gi');
      text = text.replace(regex, () => {
        if (!replacement || replacement.trim() === '') return ' ';
        protectedItems.push(replacement);
        return `PROT${protectedItems.length - 1}PROT`;
      });
    }

    // 4. Tokenize
    let tokens = text.split(/(\b[a-zA-Z0-9_'-]+\b)/);

    // 5. Apply single-word blacklist and synonyms
    tokens = tokens.map(token => {
      if (!/^[a-zA-Z0-9_'-]+$/.test(token)) return token;
      if (/^PROT\d+PROT$/.test(token)) return token;
      const lower = token.toLowerCase();
      if (this.blacklist.has(lower)) return '';
      if (this.synonyms[lower]) return this.synonyms[lower];
      return token;
    });

    // 6. Rejoin and clean
    text = tokens.join('')
      .replace(/\s+/g, ' ')
      .replace(/\s+([.,?!;:])/g, '$1')
      .trim();

    // 7. Restore protected items
    protectedItems.forEach((item, index) => {
      const placeholder = `PROT${index}PROT`;
      while (text.includes(placeholder)) {
        text = text.replace(placeholder, item);
      }
    });

    // 8. Final cleanup
    return text
      .replace(/\s+/g, ' ')
      .replace(/\s+([.,?!;:])/g, '$1')
      .trim();
  }
}

How I Built It

Vibe coded with Google Gemini!

I essentially:

mapped out the problem space and the inspiration.
went though provided options (which included NLP libraries and other things I dismissed) until we came up with the core principles:
- Whitespace Reduction: Pure regex changing tabs, double spaces etc to single spaces.
- Phrase Collapsing: Dictionary lookup or phrases and their replacements.
- Fluff Blacklist: Hash set lookup of words to just remove (a, it etc.).
- Symbolic Logic: Dictionary lookup and replace ("not" becomes !)
- Stemming/Synonyms: Dictionary lookup and replace ("application" becomes "app")
Got Gemini to write the code and create the dictionary
Asked for more dictionary items
Asked for even more
Gave up and asked Claude as it isn't stingy with message length
added basic code exclusion (we dont want to remove i as a var, so we leave code intact) and key phrases exclusion "act as a" to "be", but then make sure "be" is protected so we don't remove it later.
got Gemini to write some test phrases.
got Gemini to add a pretty UI and some basic "equivalant CO2 savings" at the bottom.

Prize Categories

Best Use of Google Gemini???!???...even though I had to use Claude as it just won't do long messages?

I mean I am asking a LLM to write code to reduce it's own token usage so the irony of wanting a long message is not lost on me so techincally Gemini was better than Claude here? haha

Top comments (45)

Ingo Steinke, web developer • Apr 22

At the same time, people add more system prompts to explicitly require what only seemed self-evident: correct code, current language levels and software versions, let the AI check sources and documentation instead of hallucinating etc.

Currently everyone seems to build stuff on top of existing AI systems to optimize them to become more correct, more efficient, more whatever. Doctoring the symptoms because we can't reach teh root cause. The whole idea of LLM seems to be build on fluffer in a way. And even concise input tends to produce verbose fluffy output.

Narnaiezzsshaa Truong • Apr 20

Nice.

LLM systems waste enormous amounts of compute on unnecessary tokens.

And because context is reloaded every turn, waste compounds.

The principles in this post map cleanly to real engineering patterns:

Whitespace normalization → trivial, safe
Phrase collapsing → controlled compression
Fluff removal → domain‑specific stopwording
Synonymization / stemming → semantic compression
Code‑block protection → structural preservation
Logic symbolization → compact representation of boolean intent

These are all valid pre‑processing or intermediate‑representation techniques.

Will see if I can operationalize them without breaking meaning, safety, or reliability with the right architecture, perhaps this weekend.

GrahamTheDev • Apr 20

Please do give it a go and tag me in a comment / the article (if you write one) as to how you made it more "production ready", would love to see it! 💗

Victor Okefie • Apr 21

The irony is the point. You're using an LLM to write code that reduces token usage, but the LLM that writes the longest response is the one you needed. That's not a failure of the tool, it's a signal that token efficiency and helpfulness are sometimes in tension. Defluffer works because it's rule-based, not model-based. No irony there. Saving tokens by spending tokens would be recursive waste. You avoided that. That's the actual insight. Most people would have built an LLM to shorten prompts. You built a dictionary. Simpler, cheaper, faster, and it doesn't need to be prompted not to hallucinate.

GrahamTheDev • Apr 21

Exactly, especially on the token efficiency vs getting the job part!

As for a simple solution - well i am a simple soul, so I do simple things! hahaha

Mykola Kondratiuk • Apr 20

this is the kind of thing that actually matters when you are running many agents in parallel - token bloat compounds fast. curious if you measured savings on structured prompts vs freeform, those seem to behave differently

GrahamTheDev • Apr 20

No, it was a silly "this is a technique you should consider" article. Defo not an actual production ready thing so just did minimal testing!

Basically take the concept, defo dont take the code! hahaha

Mykola Kondratiuk • Apr 20

Honest disclaimer probably saves more headaches than the article itself. Token bloat across parallel agents is one of those costs that only becomes real at scale anyway — concept-level awareness is exactly what most teams need first.

Mamoor Ahmad • Apr 19

Fantastic.

GrahamTheDev • Apr 19

Glad you enjoyed it! 🙏🏼💗

mote • Apr 20

45% token reduction is substantial — that's meaningful at scale, especially for long-running agentic workflows where every token has a real cost.

Curious about what the "fluff" actually consists of in your pipeline. Is it mostly semantic redundancy (repeated framing, re-explaining what was just said), formatting artifacts from previous turns, or something else? The answer changes whether this is a preprocessing step versus something that needs to happen mid-generation.

Also — did you measure any quality regression? The risk with aggressive deduplication is losing subtle but important distinctions in the remaining tokens.

GrahamTheDev • Apr 20

If you look at the codepen, it is purely Phrase -> replacement. Most are removes, some are abbreviate etc.

Bear in mind, this is a toy project, to prove a point on token management, not a production tool!

Vladimir L • Apr 21

Creating a tool (using LLM) to "reduce" fluff-factor of promts using another LLM in order to "save" resources... but then writing an article (probaly with some help of LLM) over 300+ Words... full of greenwashing-like claims .....where only one sentence really matters ... absolutelly 10/10

GrahamTheDev • Apr 21

I like the tight circle it creates of problem -> use problem to solve problem, but not really and then claim you are saving the world while actually building a toy! hahaha

PEACEBINFLOW • Apr 22

The irony of using an LLM to write code that reduces token usage for LLMs is the kind of recursive absurdity that actually makes a real point. Every token you don't send is compute you don't pay for—in dollars, in watts, in water. The environmental framing is playful, but the underlying dynamic is genuine. We're all just... talking more than we need to.

What's interesting is how much of prompt "fluff" is social conditioning. We add pleasantries because we're used to talking to humans. "I would really appreciate it if you could..." is just keyboard calories. The model doesn't care. It processes the instruction the same either way. But writing a terse prompt feels rude, even when the recipient is a matrix of weights.

The code block protection is the detail that makes this actually usable. Without it, you'd strip i from every loop and break everything. With it, the compression stays safely outside the parts that matter. That's the difference between a joke and a tool you might actually run locally before pasting into a chat window.

I'm curious if you noticed any patterns in what kind of language inflated the token count most. Was it the polite framing, the redundant clarifications, or something else? Feels like there's a taxonomy of prompt bloat hiding in the data.

GrahamTheDev • Apr 22

Yeah, if I did this properly I would look at techniques to reduce code size, but it would take an understanding of what the LLM was looking for (i.e. if it is looking to see how a function works, but not intending to edit it, we could minify it).

Defo an interesting area to look at!

Cris Mihalache • Apr 23 • Edited

Loved this. The humor lands, but the core point is real:

small prompt cuts compound fast in multi-turn, multi-agent workflows.

Protecting code blocks is the key detail that makes this practical.

A follow-up comparing conversational/spec/code-edit prompts with a quick “meaning preserved” score would be great.

GrahamTheDev • Apr 23

If I were to do this properly, I would go a route similar to rtk-ai.app/ and focus on core things that are safe to compress / summarise deterministically, as well as string replacement etc.

David Russell • Apr 23

I actually do something fairly REAL and deterministic like this for transcript parsing for our Conversational Intelligence platform. There's a lot of inefficiency in the way that various transcription platforms (Whisper is HORRIBLY inefficient - every word has timing tags around it) generate their output.

also... look for "filler words" and "stammering" like repeated words, etc.

So, smoosh... fewer tokens, same quality output.

And... yes, AI helps me tune the code with more and more transcripts... but I would definitely oppose executing that transactionally. Maybe there IS a viability of a super tiny model (4o-mini or something) before siccing a more expensive model on deeper analysis.

GrahamTheDev • Apr 23

Lookup Caveman (I linked it in the article), you can set that to use a tiny model - but honestly it seems stupid to use a LLM to reduce the output of a LLM, when you can just instruct the LLM to be brief, don't use pleasantries, use abbreviations etc.

View full discussion (45 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.