How a DeepSeek-only agent framework hit 85% prefix cache rate (and saved 93% vs Claude)

#javascript #ai #opensource #showdev

I've been running DeepSeek behind LangChain for a few months for a side project. Worked fine, except one day I noticed
something weird: DeepSeek's pricing page advertises cached input tokens at ~10% of the miss rate, but my bills didn't
reflect that at all.

I dug in. The cache is byte-prefix based. The moment your request's prefix differs from the previous one by even a single
character, you pay full price. And LangChain — along with every generic agent framework I checked — rebuilds the prompt
every turn. Timestamps get injected. History gets reordered. Tool schemas re-serialize with different whitespace. The prefix
drifts, the cache never hits.

So I wrote something opinionated: Reasonix — a TypeScript agent framework built only for DeepSeek. No multi-provider
abstraction, no orchestration graph, no RAG. Just three things done deeply.

📦 npm install -g reasonix && reasonix chat
🔗 GitHub: esengine/reasonix
📜 MIT License

## The numbers up front

Measured against the live DeepSeek API, not marketing math:

| Scenario | Model | Turns | Cache hit | Cost | Same on Claude Sonnet 4.6 | Savings |
|---|---|---|---|---|---|---|
| Multi-turn chat | deepseek-chat | 5 | 85.2% | $0.000923 | $0.015174 | 93.9% |
| Tool-use (calculator) | deepseek-chat | 2 | 94.9% | $0.000142 | $0.003351 | 95.8% |
| R1 reasoning + harvest | deepseek-reasoner | 1 | 72.7% | $0.006478 | $0.044484 | 85.4% |

Numbers come straight from usage.prompt_cache_hit_tokens on real API responses. You can install Reasonix and verify in 2
minutes.

## Pillar 1 — Cache-First Loop

The problem again: DeepSeek's cache only fires on identical byte prefix. Generic frameworks rebuild prompts, so the prefix
drifts, so the cache rarely hits.

The fix is structural. Every request's context gets partitioned into three regions with strict invariants:

  ┌─────────────────────────────────────┐
  │ IMMUTABLE PREFIX                    │ ← frozen at session start
  │   system + tool_specs + few_shots   │   this is the cache target
  ├─────────────────────────────────────┤
  │ APPEND-ONLY LOG                     │ ← grows monotonically
  │   [user₁][assistant₁][tool₁]...     │   prior turns preserve as prefix
  ├─────────────────────────────────────┤
  │ VOLATILE SCRATCH                    │ ← reset each turn
  │   R1 thoughts, transient state      │   never sent upstream
  └─────────────────────────────────────┘

In code, the prefix is hashed at construction and pinned. The log's append() method refuses any mutation. The scratch gets
wiped at every turn boundary.

That's it. That single discipline is enough to push cache hit rates to 85-95% on real sessions. Nothing else in the
framework would matter if this was wrong.

## Pillar 2 — R1 Thought Harvesting

DeepSeek's reasoning model deepseek-reasoner (aka R1) emits extensive reasoning_content — often 1000+ tokens of
step-by-step thinking. DeepSeek's own docs recommend not feeding it back to the next turn (it hurts quality). So most
frameworks just display it or drop it.

That's leaving a plan on the table. R1's reasoning trace is literally the model thinking out loud about subgoals,
hypotheses, and uncertainties. I pipe it through a cheap secondary V3 call in JSON mode and extract structured state:

  interface TypedPlanState {
    subgoals: string[];      // concrete intermediate objectives
    hypotheses: string[];    // candidate approaches being weighed
    uncertainties: string[]; // things R1 flags as unclear
    rejectedPaths: string[]; // approaches considered and abandoned
  }

Here's R1 on a classic logic puzzle — "3 boxes with swapped labels; pick one fruit to determine all three contents":

  ‹ subgoals (3): enumerate label-content permutations · decide which box to sample · verify uniqueness
  ‹ hypotheses (3): sample from "apple" box · sample from "orange" box · sample from "mixed" box
  ‹ uncertainties (2): can a single pick uniquely determine all? · does "mixed" contain equal ratios?
  ‹ rejected (2): sampling from "apple" box (ambiguous) · sampling from "orange" box (symmetric)

Every field maps to actual content in R1's reasoning trace. V3 is cheap enough (~$0.0001/turn) that this is essentially
free. Opt-in via reasonix chat --harvest or /harvest on inside the TUI.

## Pillar 3 — Tool-Call Repair

DeepSeek has several known tool-use quirks that generic frameworks don't handle:

Deep or wide schemas drop arguments. Tool schemas with more than ~10 leaf parameters or more than 2 levels of nesting cause V3/R1 to silently omit fields.
R1 leaks tool calls into <think>. The model writes tool-call JSON inside its reasoning trace and forgets to surface it in the actual tool_calls field.
JSON gets truncated. Long arguments payloads hit max_tokens mid-structure.
Call storms. The model hammers the same tool with identical arguments in an infinite loop.

Reasonix's repair layer has four passes running on every turn:

  // 1. Auto-flatten deep/wide schemas
  ToolRegistry.register({
    name: "updateProfile",
    parameters: {
      type: "object",
      properties: {
        user: { type: "object", properties: {
          profile: { type: "object", properties: {
            name: { type: "string" },
            age: { type: "integer" },
          }},
        }},
      },
    },
    fn: ({ user }) => updateInDB(user),
  });
  // Internally shown to the model as a flat schema:
  //   {"user.profile.name": "...", "user.profile.age": ...}
  // On dispatch, args re-nested back to { user: { profile: { ... } } }

  // 2. Scavenge: regex + JSON parser sweeps reasoning_content for missed calls
  // 3. Truncation recovery: close braces, trim trailing commas, fill dangling keys
  // 4. Storm breaker: sliding-window dedup of (tool, args) tuples

All four are always on. No user configuration.

## Bonus: Self-Consistency Branching

Here's the fun one. DeepSeek is roughly 20× cheaper than Claude Sonnet 4.6. That means three parallel R1 samples per turn
is still cheaper than a single Claude call. What was a research luxury (self-consistency sampling) becomes a practical
default.

  reasonix chat --branch 3
  # or inside the TUI:
  > /preset max

Three samples fire in parallel at temperatures 0.0 / 0.5 / 1.0. Each one's reasoning is harvested. The default selector
picks whichever sample has the fewest flagged uncertainties (tie-break on shorter answer length — Occam's razor as a
heuristic).

TUI shows this live:

  🔀 branched 3 samples → picked #1   #0 T=0.0 u=2   ▸#1 T=0.5 u=0   #2 T=1.0 u=3

Anecdotally it lifts accuracy 10-15 percentage points on medium-difficulty reasoning, at roughly 1/5 the cost of a single
Claude pass. I haven't run a formal benchmark yet — that's next.

## What it's explicitly not

Not a LangChain replacement. No multi-provider, no graph orchestration, no RAG.
Not a drop-in for OpenAI-compatible code. The whole point is DeepSeek-specific.
Not production-ready. v0.0.6 pre-alpha, 135 passing tests, no formal benchmarks yet.

## Quick start

  npm install -g reasonix
  reasonix chat

First launch prompts for your DeepSeek API key and saves it to ~/.reasonix/config.json. Sessions auto-persist, so chat 2
hours of work, quit, come back tomorrow, type reasonix chat — you're back where you left off.

Inside the TUI, slash commands cover everything:

  /preset fast|smart|max    one-tap config (fast = default)
  /model <id>               deepseek-chat or deepseek-reasoner
  /harvest [on|off]         Pillar 2 toggle
  /branch <N|off>           N parallel samples (>=2)
  /sessions                 list saved sessions
  /forget                   delete current session
  /help                     full list

No flag-soup to memorize. A command strip under the prompt shows the top-level commands at all times.

## Library usage

  import {
    CacheFirstLoop,
    DeepSeekClient,
    ImmutablePrefix,
    ToolRegistry,
  } from "reasonix";

  const client = new DeepSeekClient(); // reads DEEPSEEK_API_KEY
  const tools = new ToolRegistry();

  tools.register({
    name: "add",
    parameters: {
      type: "object",
      properties: { a: { type: "integer" }, b: { type: "integer" } },
      required: ["a", "b"],
    },
    fn: ({ a, b }: { a: number; b: number }) => a + b,
  });

  const loop = new CacheFirstLoop({
    client,
    tools,
    prefix: new ImmutablePrefix({
      system: "You are a math helper.",
      toolSpecs: tools.specs(),
    }),
    harvest: true,
    branch: 3,
    session: "math-tutor",
  });

  for await (const ev of loop.step("What is 17 + 25?")) {
    if (ev.role === "assistant_final") console.log(ev.content);
  }

  console.log(loop.stats.summary());
  // { turns: 2, totalCostUsd: 0.0003, savingsVsClaudePct: 94, cacheHitRatio: 0.87 }

## Open questions I'd love feedback on

Branching selector heuristic. The default is min(uncertainties.length) with length tie-break. That's obviously
naive. What signals would you combine? Cross-sample answer similarity? Tool-call success rate per sample? An LLM-judge pass?
Harvest cost/value trade-off. The $0.0001/turn V3 call feels negligible but it's a floor on per-turn cost. Has anyone
tried fine-tuning R1 to output structured plan state directly?
Cache continuity across config changes. Right now changing the system prompt mid-session invalidates the prefix
cache. Is there a migration path that preserves the existing log's value?

Full source: github.com/esengine/reasonix

Install: npm install -g reasonix

Issues, PRs, and benchmarks especially welcome.