DEV Community

Ashu
Ashu

Posted on

How We Cut Our AI API Bill by 78% (And Let Cursor See Our Entire Codebase)

The Problem Nobody Talks About

When you ask Cursor to "fix the login bug in my app," here's what actually happens:

  1. Your query gets embedded into a vector
  2. The embedding is compared to every file in your codebase (cosine similarity)
  3. The top 5-10 most similar files are stuffed into the context window
  4. Everything else is invisible

Your AI has no idea about your database schema, your configuration, your test patterns, your middleware. It's working blind on 95% of your codebase.

The Information-Theoretic Solution

We built Entroly — a context engineering engine that approaches this as an optimization problem, not a search problem.

Instead of "find the most similar files," we ask: "What's the mathematically optimal set of fragments to include in the context, given a token budget?"

Step 1: Score Every Fragment

Every piece of code gets scored by Shannon entropy — measuring information density:

H(X) = -Σ p(xᵢ) · log(p(xᵢ))
Enter fullscreen mode Exit fullscreen mode

High-entropy code (complex logic, unique algorithms) scores high. Low-entropy code (boilerplate, imports, comments) scores low.

We also measure:

  • Recency: was this file recently modified?
  • Frequency: is this file frequently accessed?
  • Semantic relevance: how related to the current query?

Step 2: Build the Dependency Graph

Code is not independent. auth.py depends on auth_config.py. Your API routes call functions defined in models.py.

Entroly automatically extracts:

  • Import relationships
  • Function call chains
  • Type references
  • Module dependencies

When a fragment is selected, its dependencies get a relevance boost. This is the graph-constrained knapsack — NP-hard in general, but tractable for typical code graphs.

Step 3: Solve the Optimization Problem

This is where it gets mathematically interesting.

We use KKT bisection to find the exact Lagrange multiplier for the token-budget constraint:

f(th) = Σᵢ σ((sᵢ − th) / τ) · tokensᵢ − B = 0
Enter fullscreen mode Exit fullscreen mode

30 steps of bisection give us th* — the exact dual variable. Then we greedily fill the hard budget.

The beautiful part: the same σ(·/τ) appears in the REINFORCE backward pass. Zero train/test mismatch.

Step 4: Compress at Three Levels

Not every file needs full source code:

  • L1 (5% budget): Skeleton map — auth.py → AuthService, login(), verify_token()
  • L2 (25% budget): Expanded signatures for dependency-connected files
  • L3 (70% budget): Full source code for the most relevant fragments

Your AI sees ALL 500 files. The important ones in detail. The rest in summary.

Step 5: Learn From Outcomes

After the AI generates a response, Entroly scores how well the context worked:

  1. Counterfactual Shapley credit: How much did each fragment contribute?
  2. Spectral natural gradient update: Adjust the 4D weight vector using Jacobi eigendecomposition of the gradient covariance
  3. TD(λ) eligibility traces: Credit cascades across a 3-request window

Over time, your context selection gets better without any manual tuning.

The Numbers

  • 78% fewer tokens per request
  • <10ms overhead (Rust engine)
  • 304 unit tests in Rust, 100+ in Python
  • 24 Rust modules, ~850KB of optimized code
  • Works with any OpenAI-compatible API

Try It

pip install entroly
entroly go
Enter fullscreen mode Exit fullscreen mode

GitHub: github.com/juyterman1000/entroly

MIT licensed. PRs welcome.

Top comments (0)