Create account

DEV Community

Rudekwydra

Posted on May 4

How I cut my multi-turn LLM API costs by 90% (O(N ) O(N))

#ai #api #llm #performance

How I cut my multi-turn LLM API costs by 90% (O(N²) → O(N))

If you build multi-turn AI agents, you know the pain: API costs don't grow linearly, they grow quadratically.

Every turn in a standard agent loop replays the full conversation history. Token cost on turn N is proportional to N, so total cost across N turns is Θ(N²). I hit a wall where a single heavy day of coding consumed 97% of my weekly Anthropic quota.

So I built Burnless — an open protocol and orchestration layer that flips the cost curve from O(N²) to O(N).

The result: It cut my real-world API consumption by ~16x, and benchmarks at 90.3% cheaper against naive Claude Opus.

Rudekwydra / burnless

Multi-turn agent loops cost O(N²). Burnless makes them O(N). 88% cheaper at turn 10. MIT, provider-agnostic.

Burnless

Intent-compressed intelligence orchestration.

A maestro that orchestrates any LLM from any vendor. Multi-turn agent loops cost O(N²) — Burnless makes them O(N).

Burnless is a vendor-agnostic orchestration layer for multi-agent workflows. You pick the model that conducts the orchestra (Maestro / Brain) — Claude, GPT, Gemini, Mistral, a local Llama, anything — and the models that execute each task (Workers). Tiers are quality/cost bands, not vendors: gold/silver/bronze map to whatever CLI you put in config.yaml. Mix providers freely. Run encoder and decoder on a local Ollama model for zero marginal cost on the cheap stages.

On top of that independence, Burnless flips the cost curve. Every turn in a standalone agent loop replays the full conversation as input — token cost on turn N is proportional to N, so total cost across N turns is Θ(N²). Burnless keeps only short capsules in…

View on GitHub

How it works

The math relies on two specific mechanisms working together:

Shared Prefix Cache: The persistent system prompt (which can be 20k+ tokens) is cached using Anthropic's prompt caching (ttl: 1h). Switching models from the same provider mid-session does not invalidate this cache if the prefix is byte-identical.
Capsule History: Instead of keeping raw transcripts in the agent's memory, the "Maestro" model only holds ~80-character compressed "capsules" of prior turns.

The result is that your quadratic history term collapses into a tiny linear one, while the massive system prompt is billed at cache-read prices (which is roughly 10x cheaper than fresh input on Anthropic).

The Benchmark (No Mocks)

If you want the formal derivation, I published a reproducible benchmark that uses the Anthropic SDK directly and reads raw response.usage.

Against Claude 3 Opus on a 10-turn session:

Standalone (no cache): $4.66
Standalone (+ cache): $0.65
Burnless Maestro: $0.45 (-90.3%)

And this math applies to any provider that exposes prompt caching and charges per input token.

Vendor Agnostic Orchestration

Burnless isn't a wrapper for a single API. Tiers are quality/cost bands, not vendors. You can mix and match any CLI you already have installed:

agents:
  gold:    { command: "claude --model claude-sonnet-4-6 -p" } # The Brain
  silver:  { command: "codex exec --sandbox workspace-write" } # Execution
  bronze:  { command: "ollama run qwen2.5-coder" } # Local, zero marginal cost

The CLI works today. If you're building agents and burning through tokens, give it a try:

pip install burnless
burnless setup

Would love to hear your thoughts on the architecture, especially if you're working on local encoding/decoding for privacy!