How I Used Rust and Reinforcement Learning to Slash LLM Token Usage by 40%

#rust #opensource #news #ai

Building AI agents that need to process massive amounts of code or text usually leads to one major bottleneck: Context Window Bloat.

When building complex RAG (Retrieval-Augmented Generation) applications, developers often resort to stuffing as much information into the context window as possible. This naive approach leads to massive token usage, slower response times, and LLMs getting "lost in the middle" and degrading in reasoning accuracy.

I built Entroly, an open-source (MIT licensed) Context Engineering Engine, to solve exactly this problem. By using an information-theoretic approach powered by Reinforcement Learning, Entroly intelligently prunes and selects only the optimal fragments for any given prompt.

And because performance matters, I built the core engine in Rust.

Why Not Just Use Vector DBs?
Vector databases are incredible, but traditional vector similarity (Cosine, L2) only tells you if a fragment is semantically related to the prompt. It doesn't tell you if that fragment actually adds net-new information or if it's just repeating what another chunk already said.

If you feed an LLM five highly similar paragraphs about a function definition, you are wasting tokens over-explaining the same concept.

The Information-Theoretic Approach (5D PRISM)
Instead of blindly relying on vector similarity, Entroly calculates the entropy (information value) of each fragment compared to the current context state. I implemented the 5D PRISM optimizer which scores fragments based on:

Relevance: Cosine similarity to the query.
Novelty: Kullback-Leibler (KL) divergence to penalize redundancy.
Density: Information density vs. token bloat.
Coherence: Cross-attention similarity between chunk boundaries.
Reward: Learned historical utility of the fragment.
By maximizing this objective function, Entroly assembles a context window that packs the most diverse and relevant information into the smallest possible token footprint.

Why Rust?
When you are inserting a context-optimization layer into every single API call to your LLM, that layer cannot be slow.

Running complex submodular optimization and KL divergence calculations in Python would add hundreds of milliseconds to the pipeline latency. By writing the core engine in Rust and exposing it via mature PyO3 bindings, Entroly achieves <10ms overhead on average fragment optimization.

The Python interface remains completely pythonic and drops effortlessly into existing LangChain or custom RAG pipelines:

python
from entroly import ContextOptimizer
optimizer = ContextOptimizer(api_key="...", target_tokens=4000)

The engine prunes these down to the highest-information chunks

optimized_context = optimizer.optimize(
query="How does the routing system work?",
fragments=raw_vector_search_results
)
Self-Learning with Reinforcement Learning
What really sets Entroly apart is that it learns from your specific LLM and user behavior. Entroly natively supports feeding LLM feedback (like accuracy or task success) back into the optimizer as reward signals. Over time, the internal Multi-Armed Bandit begins to prioritize the types of fragments that actually lead to successful outcomes.

The Result
In our benchmarks across large codebases, utilizing Entroly reduces token usage by approximately ~40% without a measurable drop in reasoning accuracy, resulting in massive cost savings and latency reduction for production agents.

Check it out on GitHub
I'm incredibly proud to completely open-source the project. If you are building AI agents, dealing with huge token costs, or just interested in Rust + AI architecture, I'd love for you to check it out or contribute!

GitHub: https://github.com/juyterman1000/entroly

Let me know what you think in the comments!