<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rudekwydra</title>
    <description>The latest articles on DEV Community by Rudekwydra (@rudekwydra).</description>
    <link>https://dev.to/rudekwydra</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3912568%2F8c8a6c77-ace2-4e6d-abb7-10c6340112e5.png</url>
      <title>DEV Community: Rudekwydra</title>
      <link>https://dev.to/rudekwydra</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rudekwydra"/>
    <language>en</language>
    <item>
      <title>How I cut my multi-turn LLM API costs by 90% (O(N ) O(N))</title>
      <dc:creator>Rudekwydra</dc:creator>
      <pubDate>Mon, 04 May 2026 19:20:40 +0000</pubDate>
      <link>https://dev.to/rudekwydra/how-i-cut-my-multi-turn-llm-api-costs-by-90-on-on-49df</link>
      <guid>https://dev.to/rudekwydra/how-i-cut-my-multi-turn-llm-api-costs-by-90-on-on-49df</guid>
      <description>&lt;p&gt;How I cut my multi-turn LLM API costs by 90% (O(N²) → O(N))&lt;/p&gt;

&lt;p&gt;If you build multi-turn AI agents, you know the pain: API costs don't grow linearly, they grow quadratically. &lt;/p&gt;

&lt;p&gt;Every turn in a standard agent loop replays the full conversation history. Token cost on turn &lt;code&gt;N&lt;/code&gt; is proportional to &lt;code&gt;N&lt;/code&gt;, so total cost across &lt;code&gt;N&lt;/code&gt; turns is &lt;code&gt;Θ(N²)&lt;/code&gt;. I hit a wall where a single heavy day of coding consumed 97% of my weekly Anthropic quota.&lt;/p&gt;

&lt;p&gt;So I built &lt;strong&gt;Burnless&lt;/strong&gt; — an open protocol and orchestration layer that flips the cost curve from &lt;code&gt;O(N²)&lt;/code&gt; to &lt;code&gt;O(N)&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The result: It cut my real-world API consumption by ~16x, and benchmarks at 90.3% cheaper against naive Claude Opus.&lt;/strong&gt;&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/Rudekwydra" rel="noopener noreferrer"&gt;
        Rudekwydra
      &lt;/a&gt; / &lt;a href="https://github.com/Rudekwydra/burnless" rel="noopener noreferrer"&gt;
        burnless
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Multi-turn agent loops cost O(N²). Burnless makes them O(N). 88% cheaper at turn 10. MIT, provider-agnostic.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Burnless&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Intent-compressed intelligence orchestration.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A maestro that orchestrates any LLM from any vendor. Multi-turn agent loops cost O(N²) — Burnless makes them O(N).&lt;/p&gt;
&lt;p&gt;Burnless is a vendor-agnostic orchestration layer for multi-agent workflows. You pick the model that &lt;strong&gt;conducts&lt;/strong&gt; the orchestra (Maestro / Brain) — Claude, GPT, Gemini, Mistral, a local Llama, anything — and the models that &lt;strong&gt;execute&lt;/strong&gt; each task (Workers). Tiers are quality/cost bands, not vendors: &lt;code&gt;gold&lt;/code&gt;/&lt;code&gt;silver&lt;/code&gt;/&lt;code&gt;bronze&lt;/code&gt; map to whatever CLI you put in &lt;code&gt;config.yaml&lt;/code&gt;. Mix providers freely. Run encoder and decoder on a local Ollama model for zero marginal cost on the cheap stages.&lt;/p&gt;
&lt;p&gt;On top of that independence, Burnless flips the cost curve. Every turn in a standalone agent loop replays the full conversation as input — token cost on turn &lt;code&gt;N&lt;/code&gt; is proportional to &lt;code&gt;N&lt;/code&gt;, so total cost across &lt;code&gt;N&lt;/code&gt; turns is &lt;code&gt;Θ(N²)&lt;/code&gt;. Burnless keeps only short capsules in…&lt;/p&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/Rudekwydra/burnless" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;h3&gt;
  
  
  How it works
&lt;/h3&gt;

&lt;p&gt;The math relies on two specific mechanisms working together:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Shared Prefix Cache:&lt;/strong&gt; The persistent system prompt (which can be 20k+ tokens) is cached using Anthropic's prompt caching (&lt;code&gt;ttl: 1h&lt;/code&gt;). Switching models from the same provider mid-session does not invalidate this cache if the prefix is byte-identical.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capsule History:&lt;/strong&gt; Instead of keeping raw transcripts in the agent's memory, the "Maestro" model only holds ~80-character compressed "capsules" of prior turns.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The result is that your quadratic history term collapses into a tiny linear one, while the massive system prompt is billed at cache-read prices (which is roughly 10x cheaper than fresh input on Anthropic).&lt;/p&gt;

&lt;h3&gt;
  
  
  The Benchmark (No Mocks)
&lt;/h3&gt;

&lt;p&gt;If you want the formal derivation, I published a reproducible benchmark that uses the Anthropic SDK directly and reads raw &lt;code&gt;response.usage&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;Against Claude 3 Opus on a 10-turn session:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standalone (no cache): $4.66&lt;/li&gt;
&lt;li&gt;Standalone (+ cache): $0.65&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Burnless Maestro:&lt;/strong&gt; $0.45 (-90.3%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;And this math applies to any provider that exposes prompt caching and charges per input token.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Vendor Agnostic Orchestration
&lt;/h3&gt;

&lt;p&gt;Burnless isn't a wrapper for a single API. Tiers are quality/cost bands, not vendors. You can mix and match any CLI you already have installed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;agents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;gold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;--model&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;-p"&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt; &lt;span class="c1"&gt;# The Brain&lt;/span&gt;
  &lt;span class="na"&gt;silver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;codex&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;exec&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;--sandbox&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;workspace-write"&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt; &lt;span class="c1"&gt;# Execution&lt;/span&gt;
  &lt;span class="na"&gt;bronze&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ollama&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;run&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;qwen2.5-coder"&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt; &lt;span class="c1"&gt;# Local, zero marginal cost&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CLI works today. If you're building agents and burning through tokens, give it a try:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;burnless
burnless setup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Would love to hear your thoughts on the architecture, especially if you're working on local encoding/decoding for privacy!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>llm</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
