Ashu

Posted on Mar 28

How We Cut Our AI API Bill by 78% (And Let Cursor See Our Entire Codebase)

#ai #rust #python #rag

The Problem Nobody Talks About

When you ask Cursor to "fix the login bug in my app," here's what actually happens:

Your query gets embedded into a vector
The embedding is compared to every file in your codebase (cosine similarity)
The top 5-10 most similar files are stuffed into the context window
Everything else is invisible

Your AI has no idea about your database schema, your configuration, your test patterns, your middleware. It's working blind on 95% of your codebase.

The Information-Theoretic Solution

We built Entroly — a context engineering engine that approaches this as an optimization problem, not a search problem.

Instead of "find the most similar files," we ask: "What's the mathematically optimal set of fragments to include in the context, given a token budget?"

Step 1: Score Every Fragment

Every piece of code gets scored by Shannon entropy — measuring information density:

H(X) = -Σ p(xᵢ) · log₂(p(xᵢ))

High-entropy code (complex logic, unique algorithms) scores high. Low-entropy code (boilerplate, imports, comments) scores low.

We also measure:

Recency: was this file recently modified?
Frequency: is this file frequently accessed?
Semantic relevance: how related to the current query?

Step 2: Build the Dependency Graph

Code is not independent. auth.py depends on auth_config.py. Your API routes call functions defined in models.py.

Entroly automatically extracts:

Import relationships
Function call chains
Type references
Module dependencies

When a fragment is selected, its dependencies get a relevance boost. This is the graph-constrained knapsack — NP-hard in general, but tractable for typical code graphs.

Step 3: Solve the Optimization Problem

This is where it gets mathematically interesting.

We use KKT bisection to find the exact Lagrange multiplier for the token-budget constraint:

f(th) = Σᵢ σ((sᵢ − th) / τ) · tokensᵢ − B = 0

30 steps of bisection give us th* — the exact dual variable. Then we greedily fill the hard budget.

The beautiful part: the same σ(·/τ) appears in the REINFORCE backward pass. Zero train/test mismatch.

Step 4: Compress at Three Levels

Not every file needs full source code:

L1 (5% budget): Skeleton map — auth.py → AuthService, login(), verify_token()
L2 (25% budget): Expanded signatures for dependency-connected files
L3 (70% budget): Full source code for the most relevant fragments

Your AI sees ALL 500 files. The important ones in detail. The rest in summary.

Step 5: Learn From Outcomes

After the AI generates a response, Entroly scores how well the context worked:

Counterfactual Shapley credit: How much did each fragment contribute?
Spectral natural gradient update: Adjust the 4D weight vector using Jacobi eigendecomposition of the gradient covariance
TD(λ) eligibility traces: Credit cascades across a 3-request window

Over time, your context selection gets better without any manual tuning.

The Numbers

78% fewer tokens per request
<10ms overhead (Rust engine)
304 unit tests in Rust, 100+ in Python
24 Rust modules, ~850KB of optimized code
Works with any OpenAI-compatible API

Try It

pip install entroly
entroly go

GitHub: github.com/juyterman1000/entroly

MIT licensed. PRs welcome.

DEV Community