Heuristic Prompt Compression: Cut Your Context Window Usage Without Losing Key Information

#hermeschallenge #ai #python #agents

Your prompt is 40,000 tokens. Most of that is a long document the user uploaded. The relevant parts for this question are probably 5,000 tokens. You are paying for 35,000 tokens of context that the model may never use.

Semantic compression using another LLM call is accurate but expensive: you spend tokens to save tokens. Heuristic compression is faster and cheaper: remove boilerplate, deduplicate redundant sentences, drop stop-word-heavy sections, trim to what is statistically likely to matter.

llm-prompt-compress applies heuristic compression to reduce prompt size before sending.

The Shape of the Fix

from llm_prompt_compress import PromptCompress

compress = PromptCompress(
    target_tokens=8000,
    strategy="balanced",  # "aggressive" | "balanced" | "conservative"
)

long_document = load_user_document()  # 35,000 tokens of text

compressed = compress.compress(long_document)

print(f"Original: ~{compress.estimate_tokens(long_document)} tokens")
print(f"Compressed: ~{compress.estimate_tokens(compressed)} tokens")
print(f"Ratio: {compress.ratio(long_document, compressed):.1%}")

messages = [
    {"role": "user", "content": f"Based on this document:\n\n{compressed}\n\nAnswer: {question}"},
]

Typical result: 60-75% size reduction with 85-95% information retention for long prose documents. Numbers, code, and lists compress less aggressively to preserve their content.

What It Does NOT Do

llm-prompt-compress does not understand meaning. It applies text heuristics: sentence length, word frequency, paragraph structure, redundancy detection. It does not know which sentences are semantically important for your specific question. For query-aware compression, use a semantic compression approach (LLM-based summarization).

It does not compress structured content well. JSON, CSV, code, and tabular data often cannot be safely compressed with text heuristics — the structure is load-bearing. By default, code blocks and JSON-looking sections are preserved without compression.

It does not guarantee the original information survives compression. Heuristic compression removes content. If a low-frequency, short sentence contains critical information, it may be removed. The "conservative" strategy minimizes information loss at the cost of lower compression.

Inside the Library

The compression pipeline has multiple stages:

class PromptCompress:
    def compress(self, text: str) -> str:
        # Stage 1: Remove boilerplate
        text = self._remove_boilerplate(text)

        # Stage 2: Deduplicate near-identical sentences
        text = self._dedup_sentences(text)

        # Stage 3: Score sentence importance
        sentences = self._split_sentences(text)
        scores = self._score_sentences(sentences)

        # Stage 4: Keep top-scoring sentences up to target
        return self._select_by_budget(sentences, scores)

    def _remove_boilerplate(self, text: str) -> str:
        # Remove common boilerplate patterns
        patterns = [
            r'^\s*(?:Click here|Learn more|Subscribe|Sign up).*$',
            r'^\s*(?:Copyright|All rights reserved|Terms of service).*$',
            r'^\s*(?:Navigation|Menu|Footer|Header)\s*$',
        ]
        for p in patterns:
            text = re.sub(p, '', text, flags=re.MULTILINE | re.IGNORECASE)
        return text

    def _score_sentences(self, sentences: list[str]) -> list[float]:
        # TF-IDF inspired scoring without external libraries
        all_words = [s.lower().split() for s in sentences]
        word_freq = Counter(w for words in all_words for w in words)
        total_words = sum(word_freq.values())

        scores = []
        for words in all_words:
            if not words:
                scores.append(0.0)
                continue

            # Frequency score: prefer sentences with less common words
            freq_score = sum(1 / (word_freq[w] / total_words + 0.01) for w in words) / len(words)

            # Length penalty: very short and very long sentences score lower
            length_score = 1.0 - abs(len(words) - 20) / 100

            # Position bonus: first and last sentences in paragraphs matter more
            scores.append(freq_score * max(0.1, length_score))

        return scores

The scoring heuristic: sentences with less common words score higher (they are more content-dense). Very short sentences (fragments) and very long sentences (run-ons) are slightly penalized. First and last sentences of paragraphs receive a position bonus.

When to Use It

Use it for RAG (retrieval-augmented generation) pipelines where retrieved chunks may contain more text than necessary. Compress each chunk before assembling the context window. The compression ratio improves context utilization without requiring smaller chunk sizes.

Use it for user-uploaded document processing. Users upload PDFs, paste articles, submit long forms. Compressing the document before injecting it into the prompt can save significant cost on a high-volume system.

Use it as a pre-filter before semantic compression. For very long documents, apply heuristic compression first (cheap, fast, 60-70% reduction), then apply semantic compression on the smaller result (expensive, accurate). Two-stage compression is cheaper than one-stage semantic compression from scratch.

Skip it for short, curated prompts. If your system prompt is already tightly written, applying compression may remove important context. Compression is for untrusted, variable-length content (user input, retrieved documents), not your crafted instructions.

Install

pip install git+https://github.com/MukundaKatta/llm-prompt-compress

# Or from PyPI
pip install llm-prompt-compress

from llm_prompt_compress import PromptCompress

compress = PromptCompress(
    target_tokens=6000,
    strategy="balanced",
    preserve_code=True,    # Never compress code blocks
    preserve_json=True,    # Never compress JSON structures
)

def build_rag_context(chunks: list[str]) -> str:
    compressed_chunks = []

    for chunk in chunks:
        original_tokens = compress.estimate_tokens(chunk)
        if original_tokens > 500:  # Only compress large chunks
            compressed = compress.compress(chunk)
            ratio = compress.ratio(chunk, compressed)
            logger.debug("chunk_compressed", original=original_tokens, ratio=f"{ratio:.0%}")
            compressed_chunks.append(compressed)
        else:
            compressed_chunks.append(chunk)

    return "\n\n---\n\n".join(compressed_chunks)

def answer_with_rag(question: str, docs: list[str]) -> str:
    context = build_rag_context(docs)

    response = anthropic_client.messages.create(
        model="claude-sonnet-4-6",
        messages=[{
            "role": "user",
            "content": f"Documents:\n\n{context}\n\nQuestion: {question}",
        }],
        max_tokens=1024,
    )

    return response.content[0].text

Sibling Libraries

Library	What it solves
`llm-token-split`	Split documents into chunks before compression
`prompt-token-counter`	Count tokens to know when compression is needed
`agent-context-builder`	Section-based prompt assembly with per-section budgets
`agent-message-window`	Trim message list to fit context
`prompt-cache-warmer`	Cache the compressed prompt prefix for cheaper reruns

The context optimization stack: prompt-token-counter to measure, llm-prompt-compress to reduce, llm-token-split to chunk, agent-message-window to trim, prompt-cache-warmer to cache.

What's Next

Query-aware compression: accept the user's question alongside the document and weight sentences that contain terms from the question more highly. This is a lightweight semantic signal without requiring a full embedding model.

Extractive summary mode: instead of removing sentences, produce an extractive summary (the K most important sentences in their original order) as the compressed output. This produces a shorter but coherent version rather than a gapped text.

Benchmark mode: compress.benchmark(text, questions) that compresses the text and then evaluates how many of the questions can be correctly answered from the compressed version. Provides a quality metric for tuning the compression strategy.

Built as part of the agent-stack family: composable Python primitives for production LLM agents.