DEV Community: Ashu

What Is Context Rot? The Hidden Reason Your AI Coding Assistant Gets Worse Over Time

Ashu — Thu, 09 Jul 2026 13:47:36 +0000

Your AI assistant gives sharp, accurate answers at the start of a session. An hour later, it's hallucinating function names and confusing files.

This isn't a model problem. It's context rot.

What Is Context Rot?

Context rot is the gradual degradation of AI answer quality during a long session. The context window fills with irrelevant material from earlier turns, and the model can't pay attention to what actually matters.

Two related problems:

Quality rot - accuracy drops as irrelevant content crowds out critical evidence
Economic rot - token costs compound because every request re-sends the bloated history

Why It Happens

Transformers distribute attention across ALL input tokens. As a session grows:

Content	Relevance	Token Cost
Early conversation turns	Usually stale	High
Imported dependency files	Partially relevant	Very high
Shell command output logs	Usually irrelevant	Medium-High
Duplicate file reads	Redundant	High
The actual bug/critical file	Critical	Low - crowded out

The critical file gets pushed to the middle of the context window where attention is weakest. Research backs this up - the "Lost in the Middle" paper (Liu et al., 2023) showed LLMs perform significantly worse at retrieving information placed in the middle of long contexts.

The 4 Stages

Stage 1 - Fresh Session: Context mostly empty. Sharp, specific answers. Low cost.

Stage 2 - Accumulation: 30-40% context used. Model occasionally confuses variable names. You think it's a bad prompt.

Stage 3 - Congestion: 60-80% used. Model references wrong functions, suggests already-tried fixes. You start restarting sessions.

Stage 4 - Overflow: Context full. Tool truncates history or forces new session. Answers become generic. Hallucinations spike.

The Hidden Cost

Fresh start context:  ~2,000 tokens    = $0.006/request
Session turn 50:      ~80,000 tokens   = $0.24/request  
Session turn 100:     ~180,000 tokens  = $0.54/request

Developer doing 60+ requests/day at stage 3:
Monthly cost: $350-800

With properly selected context (2,000 relevant tokens):
Monthly cost: $9-18

You're paying $350-800/month to send irrelevant code to Claude. The model isn't using it. And context rot is making it give worse answers.

The Fix: Information-Theoretic Selection

The right fix isn't "start a new session" (you lose all progress). It's mathematically ranking every fragment by relevance to the current query:

Bad:  All files -> LLM call (irrelevant files included)
Bad:  All files -> Compress -> LLM call (wrong starting set)  
Good: All files -> Rank by query relevance -> Select under budget -> LLM call

The ranking uses three signals:

BM25 relevance - lexical match between query and each fragment
Shannon entropy scoring - high-entropy fragments (errors, anomalies) selected over boilerplate
Dependency graph traversal - if the buggy function is selected, its callees are included

Budget allocation uses knapsack dynamic programming - the theoretically optimal solution.

Results

I built this into an open-source tool called Entroly. On real workloads:

Budget	Token Reduction	Accuracy
8K tokens	99.1%	100% (NeedleInAHaystack)
32K tokens	96.7%	103% (LongBench HotpotQA)
Average	87.0%	Maintained or improved

103% accuracy on HotpotQA means the model actually gives better answers with selected context than full context - because it's no longer distracted.

The tool also includes a local hallucination detector (WITNESS) that achieves 0.844 AUROC on HaluEval-QA for $0 in ~3ms - statistically ties GPT-4o-mini as a judge.

pip install entroly
entroly simulate   # see savings estimate, no API key needed

Works with Claude Code, Cursor, Aider, Codex, and 35+ others. Apache-2.0. Local-first.

Has anyone else noticed context rot in their sessions? Curious how you deal with it.

How I Cut My AI Coding Agent's Token Bill by 85% - Without Losing Answer Quality

Ashu — Fri, 05 Jun 2026 08:20:29 +0000

If you're using AI coding agents like Claude, Cursor, Codex, or Aider on a real codebase, you've probably noticed: your bill is mostly input tokens.

The agent dumps your entire repo context into every request. You pay for all of it. And the model still misses files it can't see.

I built Entroly to fix both problems - locally, on your machine.

The Two Wastes

AI coding agents waste context in two different ways:

They send too much - large repos mean 100K+ tokens per request
They send the wrong stuff - irrelevant files crowd out the ones that matter

Compression alone only fixes the first half. You can shrink 100K tokens to 30K, but if those 30K are the wrong files, the answer is still bad.

What Entroly Does

Entroly is a local verified-context layer that sits between your agent and the LLM provider. It selects the repo evidence first, compresses noisy context second, and can audit the model's answer against supplied evidence.

Key mechanisms:

Context selection - ranks your whole repo using BM25 + entropy scoring + dependency graph analysis, then packs the most answer-relevant files under a token budget using knapsack optimization
Reversible compression - everything compressed is fully recoverable via CCR handles. Nothing is lost.
Cache alignment - keeps the injected prefix byte-stable so provider cache discounts apply (Anthropic: up to 90% off, OpenAI: 50% off)
WITNESS hallucination guard - checks the model's answer against the evidence it was given. $0, ~3ms, no extra API call.

Getting Started (60 seconds)

pip install entroly
cd /your/repo
entroly go

Or test the package first:

entroly verify-claims

No API key required.

Use It However You Work

Wrap mode: entroly wrap cursor / entroly wrap aider / entroly wrap claude
Proxy mode: entroly proxy -> point your tool at localhost:9377
MCP server: entroly serve -> works with any MCP-compatible client
Library: from entroly import compress_context

Works with Claude, Cursor, Codex, Aider, Continue, Windsurf, Cline, and 30+ more tools.

Real Numbers

Benchmark	Result
Token reduction (large repos)	70-95% fewer input tokens
Accuracy (NeedleInAHaystack)	100% retained
Hallucination detection (HaluEval-QA)	0.844 AUROC
WITNESS latency	~3ms, $0

Small prompts and tiny repos may show little or no savings. Always measure on your own repository.

Why I Built This

I was spending $200+/month on AI coding tools, and most of that cost was the agent re-reading the same files over and over. Entroly combines query-conditioned context selection, recoverable compression, and WITNESS proof certificates in one local layer.

Apache-2.0, local-first, no outbound analytics by default.

GitHub: github.com/juyterman1000/entroly

Daemon that "Dreams" about your codebase so your AI agents stop hallucinating and save tokens

Ashu — Sun, 12 Apr 2026 19:50:14 +0000

The Problem If you use Claude Code, Cursor, or GitHub Copilot on a large codebase, you've probably noticed something annoying: they hallucinate.

When you ask an agent to fix a bug in a large monorepo, it blindly stuffs the context window with as many files as it can fit (usually up to 128k tokens) and crops the rest off. This means 80% of what the AI is looking at is pure noise. It gets confused, invents APIs that don't exist, and burns through your API budget.

I got tired of this, so I built Entroly.

What is Entroly?
It’s a local proxy that acts as an Epistemic Firewall for your AI agents.

When you close your laptop for the night, Entroly's local background daemon wakes up. It crawls your repository, structurally induces the architecture, and pre-fetches answers to likely problems.

When you wake up and open Cursor the next morning, the agent responds in 0.1 seconds because Entroly already "dreamt" about your codebase all night, cached the symbolic graph, and optimized your context window.

How the Magic Works:
Entroly sits locally on localhost:9377 and intercepts the traffic between your editor and the LLM.

Instead of passing raw text, it uses a high-performance Rust backend to do the heavy lifting:

The PRISM Optimizer: It tracks a 4x4 covariance matrix over 4 scoring dimensions (recency, frequency, semantic SimHash, and Shannon entropy). It mathematically filters out context noise before the LLM even sees it.
0/1 Knapsack Context Selection: It uses Dynamic Programming to pack the absolute most critical, high-signal information into the absolute smallest token footprint.
The Live Dashboard: Entroly comes with a live localhost intelligence dashboard so you can actually watch it shredding tokens and tracking your cost-savings in real-time.
The Result:
Because the LLM is only fed mathematically perfectly-dense code snippets, hallucinations drop to near zero.

As a bonus, because you are sending thousands of fewer tokens on every request, your API costs drop by up to 90%, and the LLM responds significantly faster.

How to use it today
It's completely open-source. You don't have to change your coding habits or learn a new UI.

Install it via pip:
bash
pip install entroly
entroly start
Point your AI tool's API base URL to http://localhost:9377/v1.
Open http://localhost:9378 in your browser to watch the live dashboard.
I’d love for you to try breaking it on massive codebases. Let me know what you think in the comments!
GitHub Repo: https://github.com/juyterman1000/entroly/

How I Used Rust and Reinforcement Learning to Slash LLM Token Usage by 40%

Ashu — Wed, 01 Apr 2026 06:34:32 +0000

Building AI agents that need to process massive amounts of code or text usually leads to one major bottleneck: Context Window Bloat.

When building complex RAG (Retrieval-Augmented Generation) applications, developers often resort to stuffing as much information into the context window as possible. This naive approach leads to massive token usage, slower response times, and LLMs getting "lost in the middle" and degrading in reasoning accuracy.

I built Entroly, an open-source (MIT licensed) Context Engineering Engine, to solve exactly this problem. By using an information-theoretic approach powered by Reinforcement Learning, Entroly intelligently prunes and selects only the optimal fragments for any given prompt.

And because performance matters, I built the core engine in Rust.

Why Not Just Use Vector DBs?
Vector databases are incredible, but traditional vector similarity (Cosine, L2) only tells you if a fragment is semantically related to the prompt. It doesn't tell you if that fragment actually adds net-new information or if it's just repeating what another chunk already said.

If you feed an LLM five highly similar paragraphs about a function definition, you are wasting tokens over-explaining the same concept.

The Information-Theoretic Approach (5D PRISM)
Instead of blindly relying on vector similarity, Entroly calculates the entropy (information value) of each fragment compared to the current context state. I implemented the 5D PRISM optimizer which scores fragments based on:

Relevance: Cosine similarity to the query.
Novelty: Kullback-Leibler (KL) divergence to penalize redundancy.
Density: Information density vs. token bloat.
Coherence: Cross-attention similarity between chunk boundaries.
Reward: Learned historical utility of the fragment.
By maximizing this objective function, Entroly assembles a context window that packs the most diverse and relevant information into the smallest possible token footprint.

Why Rust?
When you are inserting a context-optimization layer into every single API call to your LLM, that layer cannot be slow.

Running complex submodular optimization and KL divergence calculations in Python would add hundreds of milliseconds to the pipeline latency. By writing the core engine in Rust and exposing it via mature PyO3 bindings, Entroly achieves <10ms overhead on average fragment optimization.

The Python interface remains completely pythonic and drops effortlessly into existing LangChain or custom RAG pipelines:

python
from entroly import ContextOptimizer
optimizer = ContextOptimizer(api_key="...", target_tokens=4000)

The engine prunes these down to the highest-information chunks

optimized_context = optimizer.optimize(
query="How does the routing system work?",
fragments=raw_vector_search_results
)
Self-Learning with Reinforcement Learning
What really sets Entroly apart is that it learns from your specific LLM and user behavior. Entroly natively supports feeding LLM feedback (like accuracy or task success) back into the optimizer as reward signals. Over time, the internal Multi-Armed Bandit begins to prioritize the types of fragments that actually lead to successful outcomes.

The Result
In our benchmarks across large codebases, utilizing Entroly reduces token usage by approximately ~40% without a measurable drop in reasoning accuracy, resulting in massive cost savings and latency reduction for production agents.

Check it out on GitHub
I'm incredibly proud to completely open-source the project. If you are building AI agents, dealing with huge token costs, or just interested in Rust + AI architecture, I'd love for you to check it out or contribute!

GitHub: https://github.com/juyterman1000/entroly

Let me know what you think in the comments!

How We Cut Our AI API Bill by 78% (And Let Cursor See Our Entire Codebase)

Ashu — Sat, 28 Mar 2026 22:14:45 +0000

The Problem Nobody Talks About

When you ask Cursor to "fix the login bug in my app," here's what actually happens:

Your query gets embedded into a vector
The embedding is compared to every file in your codebase (cosine similarity)
The top 5-10 most similar files are stuffed into the context window
Everything else is invisible

Your AI has no idea about your database schema, your configuration, your test patterns, your middleware. It's working blind on 95% of your codebase.

The Information-Theoretic Solution

We built Entroly — a context engineering engine that approaches this as an optimization problem, not a search problem.

Instead of "find the most similar files," we ask: "What's the mathematically optimal set of fragments to include in the context, given a token budget?"

Step 1: Score Every Fragment

Every piece of code gets scored by Shannon entropy — measuring information density:

H(X) = -Σ p(xᵢ) · log₂(p(xᵢ))

High-entropy code (complex logic, unique algorithms) scores high. Low-entropy code (boilerplate, imports, comments) scores low.

We also measure:

Recency: was this file recently modified?
Frequency: is this file frequently accessed?
Semantic relevance: how related to the current query?

Step 2: Build the Dependency Graph

Code is not independent. auth.py depends on auth_config.py. Your API routes call functions defined in models.py.

Entroly automatically extracts:

Import relationships
Function call chains
Type references
Module dependencies

When a fragment is selected, its dependencies get a relevance boost. This is the graph-constrained knapsack — NP-hard in general, but tractable for typical code graphs.

Step 3: Solve the Optimization Problem

This is where it gets mathematically interesting.

We use KKT bisection to find the exact Lagrange multiplier for the token-budget constraint:

f(th) = Σᵢ σ((sᵢ − th) / τ) · tokensᵢ − B = 0

30 steps of bisection give us th* — the exact dual variable. Then we greedily fill the hard budget.

The beautiful part: the same σ(·/τ) appears in the REINFORCE backward pass. Zero train/test mismatch.

Step 4: Compress at Three Levels

Not every file needs full source code:

L1 (5% budget): Skeleton map — auth.py → AuthService, login(), verify_token()
L2 (25% budget): Expanded signatures for dependency-connected files
L3 (70% budget): Full source code for the most relevant fragments

Your AI sees ALL 500 files. The important ones in detail. The rest in summary.

Step 5: Learn From Outcomes

After the AI generates a response, Entroly scores how well the context worked:

Counterfactual Shapley credit: How much did each fragment contribute?
Spectral natural gradient update: Adjust the 4D weight vector using Jacobi eigendecomposition of the gradient covariance
TD(λ) eligibility traces: Credit cascades across a 3-request window

Over time, your context selection gets better without any manual tuning.

The Numbers

78% fewer tokens per request
<10ms overhead (Rust engine)
304 unit tests in Rust, 100+ in Python
24 Rust modules, ~850KB of optimized code
Works with any OpenAI-compatible API

Try It

pip install entroly
entroly go

GitHub: github.com/juyterman1000/entroly

MIT licensed. PRs welcome.

10M agents. Zero API cost. Pure Rust swarm intelligence. Most AI frameworks today are slow wrappers around LLMs. Ebbforge solves 8 fundamental benchmarks that traditional architectures fail using SIMD physics and TD-RL, all math

Ashu — Sat, 28 Feb 2026 05:20:05 +0000

Why Your AI Agent Framework Is Basically a Hashmap (And How I Fixed It With Rust Swarm Math)

Ashu ・ Feb 28

#rust #ai #machinelearning #opensource

Why Your AI Agent Framework Is Basically a Hashmap (And How I Fixed It With Rust Swarm Math)

Ashu — Sat, 28 Feb 2026 05:08:11 +0000

Most AI agent frameworks today — the ones you’ve seen all over your feed — have a fundamental problem. If their results are within 10% of a hashmap, all they've really built is a slow, expensive wrapper around a hashmap.

They rely on sequential LLM calls, massive API bills, and brittle logic that falls apart the moment the environment gets messy.

I decided to fix that.

I built Ebbforge: A high-performance swarm intelligence engine written in Rust that runs 10 million agents locally with zero API cost.

Here are the 8 benchmarks that traditional architectures fail, but Ebbforge passes.

The Demo: 1,000 Agents, 60 FPS, Zero LLM Calls

We aren't just moving dots on a screen. Every agent in this swarm is a self-healing, learning unit powered by Temporal Difference Reinforcement Learning (TD-RL) and biologically-inspired memory decay.

The 8 Unsolved Problems

We ran Ebbforge against standard LLM-centric architectures on 8 tests that define real intelligence.

1. The Intelligence vs. Hashmap Challenge

The Goal: Catch an attacker who adds "padding" to a sequence to evade detection.

Standard RAG/LLM: Misses the pattern (Critical False Negative).
Ebbforge: Uses Longest Common Subsequence (LCS) math to recognize the structure of danger, even with noise. Blocked

2. The Groundhog Day Test

The Goal: Learn from a single failure and never repeat it.

Most Agents: Loop and fail 9 times in a row.
Ebbforge: One failure creates a persistent safety pattern. 9/9 subsequent attempts blocked

3. Cascade Failure Recovery

The Goal: Kill 30% of agents mid-flight and see if the swarm survives.

Standard Systems: Corrupt their state or crash.
Ebbforge: Survived 300 concurrent agent hard-kills and self-healed. 70% completion rate sustained

4. Organic Caste Emergence

The Goal: Can behavioral specialists emerge without hardcoded rules?

Standard Systems: Require "Specialized Prompting."
Ebbforge: Start with identical agents. After 500 ticks, they naturally split into "Brokers," "Hoarders," and "Neutrals" based on physics and reward pressure alone.

(See the other 4 benchmarks on the repo!)

Why Rust?

To handle 10 million agents, you can't be waiting for Python's Global Interpreter Lock (GIL) or $0.01-per-token API calls.

Ebbforge uses:

AVX2 SIMD for physics.
Rayon for grid-partitioned parallel processing.
Zero-Copy Memory for agent communication.

At a Glance (The TL;DR)

Challenge	Traditional Architectures	Ebbforge
Survive partial system failure	Failure: Corrupt state	Success: Self-heals
Learn from 1 failure	Failure: Repeats mistake	Success: 9/9 blocked
Traumatic memory retention	Failure: Equal decay	Success: 70,000x ratio
10M agent coordination	Failure: O(N) flood	Success: Spatial wavefront

Try It Yourself

The project is now live on GitHub as a pre-compiled binary. You can run the glassmorphism demo locally on any Linux x86_64 machine.

GitHub: juyterman1000/ebbforge-swarm-intelligence

I'm looking for feedback from the Rust and AI research community. If you've ever felt that agent frameworks are too slow or too "faked," Ebbforge is for you.

P.S. We just launched on Hacker News! Check out the discussion there too.