How I Built a Prompt Compressor That Reduces LLM Token Costs Without Losing Meaning

Alex Alexapolskiy — Tue, 15 Apr 2025 08:35:49 +0000

Tools like LLMLingua (by Microsoft) use language models to compress prompts by learning which parts can be dropped while preserving meaning. It’s powerful — but also relies on another LLM to optimize prompts for the LLM.

I wanted to try something different: a lightweight, rule-based semantic compressor that doesn't require training or GPUs — just smart heuristics, NLP tools like spaCy, and a deep respect for meaning.

The Challenge: Every Token Costs

In the world of Large Language Models (LLMs), every token comes with a price tag. For organizations running thousands of prompts daily, these costs add up quickly. But what if we could reduce these costs without sacrificing the quality of interactions?

Real Results: Beyond Theory

Our experimental Semantic Prompt Compressor has shown promising results in real-world testing. Analyzing 135 diverse prompts, we achieved:

22.42% average compression ratio
Reduction from 4,986 → 3,868 tokens
1,118 tokens saved while maintaining meaning
Over 95% preservation of named entities and technical terms

Example 1

Original (33 tokens):
"I've been considering the role of technology in mental health treatment.
How might virtual therapy and digital interventions evolve?
I'm interested in both current applications and future possibilities."
_
Compressed (12 tokens):
_"I've been considering role of technology in mental health treatment."

Compression ratio: 63.64%

Example 2

Original (29 tokens):
"All these apps keep asking for my location.
What are they actually doing with this information?
I'm curious about the balance between convenience and privacy."

Compressed (14 tokens):
"apps keep asking for my location. What are they doing with information."

Compression ratio: 51.72%

The Cost Impact

Let’s translate these results into real business scenarios.

Customer Support AI

(100,000 queries/day):

Avg. 200 tokens per query
GPT-4 API cost: $0.03 / 1K tokens

Without compression:

20M tokens/day → $600/day → $18,000/month
With 22.42% compression:
15.5M tokens/day → $465/day
Monthly savings: $4,050

How It Works: A Three-Layer Approach

Rules Layer

We implemented a configurable rule system instead of using a black-box ML model. For example:

Replace “Could you explain” with “explain”

Replace “Hello, I was wondering” with “I wonder”

rule_groups: remove_fillers: enabled: true patterns: - pattern: "Could you explain" replacement: "explain" remove_greetings: enabled: true patterns: - pattern: "Hello, I was wondering" replacement: "I wonder"

spaCy NLP Layer

We leverage spaCy’s linguistic analysis for intelligent compression:

Named Entity Recognition to preserve key terms
Dependency parsing for sentence structure
POS tagging to remove non-essential parts
Compound-word preservation for technical terms

Entity Preservation Layer

We ensure critical information is not lost:

Technical terms (e.g., "5G", "TCP/IP")
Named entities (companies, people, places)
Numerical values and measurements
Domain-specific vocabulary

Real-World Applications

_Customer Support
_

Compress user queries while maintaining context
Preserve product-specific language
Reduce support costs, maintain quality

_Content Moderation
_

Efficiently process user reports
Maintain critical context
Cost-effective scaling
Technical Documentation
Compress API or doc queries
Preserve code snippets and terms
Cut costs without losing accuracy
Beyond Simple Compression

What makes our approach unique?

Intelligent Preservation — Maintains technical accuracy and key data

Configurable Rules — Domain-adaptable, transparent, and editable

Transparent Processing — Understandable and debuggable

Current Limitations

Requires domain-specific tuning
Conservative in technical contexts
Manual rule editing still helpful
Entity preservation may be overly cautious

Future Development

ML-based adaptive compression
Domain-specific profiles
Real-time compression
LLM platform integrations
Custom vocabulary modules Conclusion

The results from our testing show that intelligent semantic prompt compression is not only possible — it's practical.

With a 22.42% average compression ratio and high semantic preservation, LLM-based systems can reduce API costs while maintaining clarity and intent.

Whether you're building support bots, moderation tools, or technical assistants, prompt compression could be a key layer in your stack.

Project on GitHub:
github.com/metawake/prompt_compressor
(Open source, transparent, and built for experimentation.)

DEV Community: Alex Alexapolskiy