DEV Community

Mahika jadhav
Mahika jadhav

Posted on

How I cut my LangChain agent's token costs by 93% with one import

My agent was generating the same weekly security report for the same three clients every Monday. Same context. Same reasoning structure. Same output format. I was paying full Anthropic API price every single time.

I checked the logs. Across 45 runs of three recurring workflow types — security audits, invoice processing, weekly reports — the structure of the generated plan was materially identical run after run. The LLM was re-deriving the same skeleton every time. 93% of the tokens I was spending were redundant.

This isn't a prompt engineering problem. It's a structural one.


The Problem With Stateless Frameworks

Every major agent framework — LangChain, LangGraph, CrewAI, AutoGen — is stateless by default. There is no memory of
previous executions at the plan level. Each invocation starts from zero.

This is fine for one-off queries. For recurring workflows — scheduled reports, compliance checks, data pipelines, anything that runs the same class of task repeatedly — it means you pay full LLM price every time, forever.

Prompt caching (Anthropic's and OpenAI's built-in feature) helps with input tokens on identical prompts. It doesn't help when your inputs vary slightly per run. It doesn't eliminate the API call. And it does nothing for the reasoning and plan generation that happens downstream.

What you actually need is execution caching — caching at the plan level, not the prompt level.


The Solution: Cache the Execution Plan

The idea: on first run, fingerprint the execution plan and store it as segments. On subsequent runs with the same or semantically similar goal, serve the plan from cache. Skip the LLM entirely.

Two modes:

System 1 — Exact match. SHA-256 fingerprint of goal + context + inputs. If it matches a stored plan, reconstruct from local SQLite in ~2.66ms. Zero API calls. Zero tokens.

System 2 — Semantic match. Goal is similar but not identical — same workflow, different client name or date. Match the stored plan by embedding similarity. Diff the segments. Regenerate only the parts that changed. Pay for the delta, not the full plan.

A background process called the Retrospector quarantines failed segments so bad patterns never get reused. A signal
bus tracks latency baselines and failure rates per workflow type and feeds that back to strengthen or weaken cached patterns. The cache gets smarter over time, not just bigger.


In Practice

import mnemon
mnemon.init()

# everything below is unchanged
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-sonnet-4-6")
response = llm.invoke("Generate weekly security report for Acme Corp")

That's it. mnemon.init() patches BaseChatModel.invoke and ainvoke at import time. The first call goes to the LLM and gets cached. Every subsequent call with the same or semantically equivalent goal is served from local SQLite.

For explicit control over what gets cached:

import mnemon
from anthropic import Anthropic

client = Anthropic()
m = mnemon.init()

def generate_report(goal, inputs, context, capabilities, constraints):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": goal}],
)
return response.content[0].text

result = m.run(
goal="weekly security audit for Acme Corp",
inputs={"client": "Acme Corp", "week": "2026-05-14"},
generation_fn=generate_report,
)

print(result["output"]) # the actual result
print(result["cache_level"]) # "system1" | "system2" | "miss"
print(result["tokens_saved"]) # 1250 on a hit, 0 on first run
print(result["latency_saved_ms"]) # 20000.0 on a hit

generation_fn is only called on a cache miss. On a hit, it's never invoked.


Benchmark Results
Tested across 45 runs of three recurring workflow types on claude-sonnet-4-6:

┌───────────────────────────────────┬───────────┐
│ Metric │ Result │
├───────────────────────────────────┼───────────┤
│ Cache misses (first run per type) │ 3 │
├───────────────────────────────────┼───────────┤
│ System 2 hits │ 12 │
├───────────────────────────────────┼───────────┤
│ System 1 hits │ 30 │
├───────────────────────────────────┼───────────┤
│ Token reduction │ 93.3% │
├───────────────────────────────────┼───────────┤
│ LLM call reduction │ 93% │
├───────────────────────────────────┼───────────┤
│ System 1 hit latency │ 2.66ms │
├───────────────────────────────────┼───────────┤
│ Fresh generation latency │ ~20,000ms │
├───────────────────────────────────┼───────────┤
│ Speedup │ 7,500× │
└───────────────────────────────────┴───────────┘

50 concurrent agents serving the same workflow type in a single burst: 0 LLM calls, 62,500 tokens saved, 0.18 seconds total wall time.

At scale with 80% System 1 and 15% System 2 hit rates:

┌─────────────┬────────────────────┐
│ Daily plans │ Monthly cost saved │
├─────────────┼────────────────────┤
│ 1,000 │ $503 │
├─────────────┼────────────────────┤
│ 10,000 │ $5,034 │
├─────────────┼────────────────────┤
│ 100,000 │ $50,344 │
└─────────────┴────────────────────┘

Raw data and methodology in /reports (https://github.com/smartass-4ever/Mnemon/tree/main/reports).


What It Supports

Auto-instruments at import time — no code changes needed:

┌───────────────┬─────────────────────────────────┐
│ Framework │ What gets patched │
├───────────────┼─────────────────────────────────┤
│ Anthropic SDK │ client.messages.create │
├───────────────┼─────────────────────────────────┤
│ OpenAI SDK │ client.chat.completions.create │
├───────────────┼─────────────────────────────────┤
│ LangChain │ BaseChatModel.invoke / ainvoke │
├───────────────┼─────────────────────────────────┤
│ LangGraph │ CompiledGraph.invoke / ainvoke │
├───────────────┼─────────────────────────────────┤
│ CrewAI │ crew kickoff via event bus │
├───────────────┼─────────────────────────────────┤
│ AutoGen │ ConversableAgent.generate_reply │
└───────────────┴─────────────────────────────────┘


Honest Caveats

System 2 segment-level savings require sentence-transformers. Without pip install mnemon-ai[embeddings], System 2 still works — it serves the full cached plan when goal similarity clears the threshold — but you don't get
partial-segment delta savings. System 1 is unaffected.

This doesn't help for novel one-off queries. If every invocation is genuinely unique, there's nothing to cache.
The savings compound on scheduled or event-triggered workflows running the same class of task repeatedly.

The 2.66ms latency is for warm cache hits. Cold start (first run per workflow type) still goes to the LLM.


Diagnostics

m = mnemon.get() # retrieve from anywhere in your codebase
m.get_stats() # EME hits/misses, bus signals, DB size
m.drift_report() # cross-session latency degradation
m.waste_report # repeated queries and cumulative cost

mnemon doctor # health check
mnemon demo # live demo, no API key needed


Try It

pip install mnemon-ai
mnemon demo

No API key needed to run the demo. Source and full benchmark data on GitHub
(https://github.com/smartass-4ever/Mnemon).

Happy to answer questions on the segment diffing logic or the failure quarantine mechanism — those are the interesting parts architecturally.

Top comments (0)