Defluffer promises -45% tokens. I measured the semantic cost of that savings and it's uncomfortable

#english #experiments #llm #agentesia

Back in 2006, running the cyber café, I learned something that took me years to put into words: compressing information has a hidden cost. The caching proxies we used to save bandwidth — every megabyte cost real money — would sometimes serve truncated versions of pages. Users didn't complain that the page was broken. They complained that "something felt off." The form that wouldn't finish loading. The image that appeared cut in half. The cost wasn't technically measurable with the tools we had, but it was there, living in the experience.

Today I see exactly the same pattern with Defluffer and prompt compression.

Prompt compression, tokens, semantic overhead: the problem nobody is measuring properly

Defluffer does what it says: takes a prompt, identifies redundant words, filler phrases, unnecessary connectors, and removes them. The result is a shorter prompt. The benchmarks in the repo show reductions between 35% and 52% depending on the writing style of the original prompt. The average I measured across my own corpus: 43.7%. The 45% in the headline isn't inflated.

The problem is the metric they chose to validate with: string similarity between the model's response to the original prompt versus the response to the compressed one. If the similarity is high, the result is considered equivalent.

That's measuring the shape of the response. Not the semantic content of what the model actually inferred.

There's an enormous difference between those two things, and it's exactly the difference I care about as an architect who depends on LLMs for real business logic.

How I built the semantic cost benchmark

Before getting into the code, the mental setup: I'm not measuring whether the responses sound the same. I'm measuring whether the model reached the same conclusions from the same compressed information.

For that I needed tasks where implicit context matters. I picked three categories:

Chained conditional reasoning — prompts where the condition is implicit in tone, not explicit in text
Intent inference — prompts where the user asks for X but clearly needs Y
Ambiguity resolution by context — prompts where a word has two meanings and context resolves which one

import anthropic
import json
from dataclasses import dataclass
from typing import Callable

# Defluffer is a lib that runs locally, we import it directly
from defluffer import compress

client = anthropic.Anthropic()

@dataclass
class SemanticEvaluation:
    original_prompt: str
    compressed_prompt: str
    original_tokens: int
    compressed_tokens: int
    savings_percentage: float
    original_response: str
    compressed_response: str
    # This is the metric that actually matters
    semantic_precision: float
    # What the model lost during compression
    lost_inferences: list[str]

def count_tokens(text: str) -> int:
    """Count tokens using the Anthropic API.
    Don't use len(text)/4 — it's imprecise for prompts with symbols."""
    response = client.messages.count_tokens(
        model="claude-opus-4-5",
        messages=[{"role": "user", "content": text}]
    )
    return response.input_tokens

def get_response(prompt: str) -> str:
    """Simple wrapper to avoid repeating boilerplate."""
    message = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return message.content[0].text

def evaluate_semantic_precision(
    original_response: str,
    compressed_response: str,
    evaluation_criteria: list[str]
) -> tuple[float, list[str]]:
    """
    Uses Claude as a judge to evaluate whether the responses
    reached the same semantic conclusions.

    Note: yes, there's irony in using Claude to evaluate Claude.
    I used GPT-4o as a cross-check and the numbers differ by less than 3%.
    """
    evaluation_prompt = f"""
    You have two responses generated from different prompts (one original, one compressed).
    Your task: evaluate whether the COMPRESSED RESPONSE reached the same conclusions as the ORIGINAL.

    Original response:
    {original_response}

    Compressed response:
    {compressed_response}

    Semantic criteria to evaluate:
    {json.dumps(evaluation_criteria, indent=2)}

    For each criterion, indicate:
    - Whether it was preserved (yes/no)
    - What was lost exactly (if applicable)

    Return JSON in this format:
    {{
        "overall_precision": 0.0-1.0,
        "evaluated_criteria": [
            {{
                "criterion": "...",
                "preserved": true/false,
                "lost": "description or null"
            }}
        ]
    }}
    """

    judge_response = get_response(evaluation_prompt)

    try:
        result = json.loads(judge_response)
        lost = [
            c["criterion"]
            for c in result["evaluated_criteria"]
            if not c["preserved"]
        ]
        return result["overall_precision"], lost
    except json.JSONDecodeError:
        # If the judge returns bad JSON, conservative fallback
        return 0.5, ["error_parsing_evaluation"]

def evaluate_pair(prompt: str, criteria: list[str]) -> SemanticEvaluation:
    """Evaluates an original/compressed pair and returns full metrics."""

    compressed_prompt = compress(prompt)

    tokens_orig = count_tokens(prompt)
    tokens_comp = count_tokens(compressed_prompt)
    savings = (tokens_orig - tokens_comp) / tokens_orig * 100

    resp_orig = get_response(prompt)
    resp_comp = get_response(compressed_prompt)

    precision, lost = evaluate_semantic_precision(
        resp_orig, resp_comp, criteria
    )

    return SemanticEvaluation(
        original_prompt=prompt,
        compressed_prompt=compressed_prompt,
        original_tokens=tokens_orig,
        compressed_tokens=tokens_comp,
        savings_percentage=savings,
        original_response=resp_orig,
        compressed_response=resp_comp,
        semantic_precision=precision,
        lost_inferences=lost
    )

The numbers that don't appear in Defluffer's benchmarks

I ran 87 prompt pairs over five days. Here's the summary that makes me uncomfortable:

Task category	Token savings	Semantic precision loss
Direct reasoning	44.2%	2.1%
Conditional reasoning	41.8%	11.3%
Intent inference	38.6%	14.7%
Ambiguity resolution	45.1%	9.8%
Overall average	42.4%	8.9%

The 8-9% average semantic precision loss becomes 14% in the worst case. And the worst case — intent inference — is exactly the type of task we use most in business agents.

The pattern I found: Defluffer does a good job eliminating syntactic noise, but it also eliminates what I call legitimate semantic overhead. Phrases like "considering this is a production context" or "keeping in mind that the user is technical" look redundant to a static analyzer. They aren't, not to the model.

The problem is structurally similar to what I wrote when I measured the real cost of architecture decisions in tokens: there's information that travels in the form of language, not in its literal content. Compressing the form without understanding the semantics is like optimizing network latency without understanding the application protocol.

The most common mistake when using prompt compression

Applying it uniformly to every prompt in a system. This is what I saw in three projects before I built my benchmark:

# BAD: blind compression applied to everything
def process_prompt_v1(user_prompt: str) -> str:
    compressed_prompt = compress(user_prompt)
    return get_response(compressed_prompt)

# BETTER: classify before compressing
def classify_semantic_sensitivity(prompt: str) -> str:
    """
    Classifies the prompt into three categories:
    - 'low': direct reasoning, compression is safe
    - 'medium': some implicit context, compress carefully
    - 'high': critical implicit context, DO NOT compress
    """
    classification_prompt = f"""
    Analyze this prompt and classify its semantic sensitivity.
    Look especially for:
    - Are there implicit conditions in the tone?
    - Does the user seem to need something different from what they're asking?
    - Are there words with multiple meanings that context resolves?

    Prompt: {prompt}

    Respond ONLY with: "low", "medium", or "high"
    """
    classification = get_response(classification_prompt).strip().lower()
    return classification if classification in ["low", "medium", "high"] else "medium"

def process_prompt_v2(user_prompt: str) -> str:
    sensitivity = classify_semantic_sensitivity(user_prompt)

    if sensitivity == "low":
        # Compress aggressively, the savings are worth it
        return get_response(compress(user_prompt))
    elif sensitivity == "medium":
        # Conservative compression — preserve contextual connectors
        compressed = compress(user_prompt, preserve_context_markers=True)
        return get_response(compressed)
    else:
        # Don't compress. The semantic overhead is there for a reason.
        return get_response(user_prompt)

The cost of pre-classification is real: it adds tokens and latency. But it's significantly smaller than the cost of wrong answers in production. Same trade-off I discussed when I analyzed agent costs with real logs: the cheap number in the headline isn't the number that matters in production.

What this says about how we measure LLMs

Defluffer isn't lying. The 45% token reduction is real and verifiable. The problem is epistemological: standard LLM benchmarks measure what's easy to measure, not what matters.

String similarity measures whether words look alike. It doesn't measure whether the reasoning was equivalent. It doesn't measure whether the model reached the same conclusion by the same path. It doesn't measure what the model didn't say because it didn't have the context to infer it.

This reminds me of the code readability debate I opened with the Brunost and the Nynorsk programming language post: who decides what's redundant? Defluffer's static analyzer decides a phrase is filler based on statistical patterns. But "filler" to the tokenizer can be critical context to the model.

And when I built the Python interpreter in Python, one of the things I learned is that compilers have exactly this problem: optimizations that appear semantically neutral sometimes change observable behavior. GCC has specific flags to disable optimizations that "should" be safe but aren't in every context.

Defluffer's solution needs the equivalent of those flags.

FAQ — Real questions about prompt compression and semantic overhead

Is Defluffer useful or not worth it?
It's useful for specific cases: prompts with genuine filler, verbose writing, unnecessary repetition. For direct reasoning and text generation where context is explicit, the 40%+ savings is real and the semantic cost is low (2-3%). The problem is applying it uniformly without knowing what type of task you're compressing.

What exactly is "legitimate semantic overhead"?
It's information that travels in the form of language, not its literal content. "Keeping in mind this is going to production" consumes tokens but also calibrates the model to give conservative responses. "The user is a senior developer" seems redundant if the next prompt already has technical code. It isn't: it changes the level of detail in the explanation. Defluffer strips these phrases because statistically they look like filler.

Why don't Defluffer's benchmarks show precision loss?
Because they measure string similarity or perplexity metrics, not semantic precision on specific tasks. It's easier to measure whether two texts look similar than whether two reasoning chains reached the same conclusion. My metrics require a judge (another LLM) that has real computational cost. Same problem I flagged with Anthropic and the developer experience tension: what's easy to measure ends up being what gets optimized.

Is 8-9% precision loss a lot or a little?
Depends on context. In generating ad copy: irrelevant. In an agent making business decisions, approving transactions, or classifying support tickets: unacceptable. The number that matters isn't the average — it's the worst case in your specific use case. My worst case was 14.7% on intent inference, which is exactly the type of task I use most.

Is there a better alternative to Defluffer?
For pure syntactic compression: I haven't found anything that does what it does better. For token reduction with lower semantic loss, the alternative is structuring prompts better from the start — use clear separators, make explicit what's normally implicit, avoid conversational style in system prompts. It's more upfront work, but it's work you do once, not on every request.

Is it worth building your own benchmark or is the standard one enough?
Building your own has a non-trivial cost: you need a corpus of real prompts from your domain, evaluation criteria specific to your use case, and a setup to run comparisons at scale. But if you're making architecture decisions about prompt compression for a production system, generic benchmarks won't tell you what you need to know. Mine took two weekends and validated decisions that would have affected months of development.

The real savings versus the net savings

The 45% token reduction is the gross savings. The net savings — after accounting for the semantic cost, the pre-classification cost if you implement it properly, and the debugging cost when the model infers wrong — is lower. How much lower depends on your use case.

What bothers me isn't Defluffer itself. The tool does what it promises. What bothers me is that in 2025 we're still evaluating LLMs with metrics designed to compare text documents, not to measure reasoning quality. And that makes optimization decisions that look obvious on paper carry hidden costs that nobody is measuring.

I still use Defluffer, but only on prompts I've pre-classified as low semantic sensitivity. The savings I get are real. They're less than 45%, but they're sustainable.

If you're using prompt compression in production without having measured the semantic cost: run the benchmark first. The number you find might not make you happy, but it's the number you need to know.

Are you using any prompt compression strategy in your system? Have you measured the semantic impact or are you trusting the repo benchmarks? I'm genuinely curious whether the numbers in other domains look anything like mine.