<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mahika jadhav</title>
    <description>The latest articles on DEV Community by Mahika jadhav (@smartass4ever).</description>
    <link>https://dev.to/smartass4ever</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3931774%2F4d77ad9c-0871-489f-a805-0c3e1dafb4cc.png</url>
      <title>DEV Community: Mahika jadhav</title>
      <link>https://dev.to/smartass4ever</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/smartass4ever"/>
    <language>en</language>
    <item>
      <title>How I cut my LangChain agent's token costs by 93% with one import</title>
      <dc:creator>Mahika jadhav</dc:creator>
      <pubDate>Thu, 14 May 2026 18:05:38 +0000</pubDate>
      <link>https://dev.to/smartass4ever/how-i-cut-my-langchain-agents-token-costs-by-93-with-one-import-2nc9</link>
      <guid>https://dev.to/smartass4ever/how-i-cut-my-langchain-agents-token-costs-by-93-with-one-import-2nc9</guid>
      <description>&lt;p&gt;My agent was generating the same weekly security report for the same three clients every Monday. Same context. Same reasoning structure. Same output format. I was paying full Anthropic API price every single time.&lt;/p&gt;

&lt;p&gt;I checked the logs. Across 45 runs of three recurring workflow types — security audits, invoice processing, weekly reports — the structure of the generated plan was materially identical run after run. The LLM was re-deriving the same skeleton every time. 93% of the tokens I was spending were redundant.&lt;/p&gt;

&lt;p&gt;This isn't a prompt engineering problem. It's a structural one.&lt;/p&gt;




&lt;p&gt;The Problem With Stateless Frameworks&lt;/p&gt;

&lt;p&gt;Every major agent framework — LangChain, LangGraph, CrewAI, AutoGen — is stateless by default. There is no memory of&lt;br&gt;
previous executions at the plan level. Each invocation starts from zero.&lt;/p&gt;

&lt;p&gt;This is fine for one-off queries. For recurring workflows — scheduled reports, compliance checks, data pipelines, anything that runs the same class of task repeatedly — it means you pay full LLM price every time, forever.&lt;/p&gt;

&lt;p&gt;Prompt caching (Anthropic's and OpenAI's built-in feature) helps with input tokens on identical prompts. It doesn't help when your inputs vary slightly per run. It doesn't eliminate the API call. And it does nothing for the reasoning and plan generation that happens downstream.&lt;/p&gt;

&lt;p&gt;What you actually need is execution caching — caching at the plan level, not the prompt level.&lt;/p&gt;




&lt;p&gt;The Solution: Cache the Execution Plan&lt;/p&gt;

&lt;p&gt;The idea: on first run, fingerprint the execution plan and store it as segments. On subsequent runs with the same or semantically similar goal, serve the plan from cache. Skip the LLM entirely.&lt;/p&gt;

&lt;p&gt;Two modes:&lt;/p&gt;

&lt;p&gt;System 1 — Exact match. SHA-256 fingerprint of goal + context + inputs. If it matches a stored plan, reconstruct from local SQLite in ~2.66ms. Zero API calls. Zero tokens.&lt;/p&gt;

&lt;p&gt;System 2 — Semantic match. Goal is similar but not identical — same workflow, different client name or date. Match the stored plan by embedding similarity. Diff the segments. Regenerate only the parts that changed. Pay for the delta, not the full plan.&lt;/p&gt;

&lt;p&gt;A background process called the Retrospector quarantines failed segments so bad patterns never get reused. A signal&lt;br&gt;
bus tracks latency baselines and failure rates per workflow type and feeds that back to strengthen or weaken cached patterns. The cache gets smarter over time, not just bigger.&lt;/p&gt;




&lt;p&gt;In Practice&lt;/p&gt;

&lt;p&gt;import mnemon&lt;br&gt;
  mnemon.init()&lt;/p&gt;

&lt;p&gt;# everything below is unchanged&lt;br&gt;
  from langchain_anthropic import ChatAnthropic&lt;br&gt;
  llm = ChatAnthropic(model="claude-sonnet-4-6")&lt;br&gt;
  response = llm.invoke("Generate weekly security report for Acme Corp")&lt;/p&gt;

&lt;p&gt;That's it. mnemon.init() patches BaseChatModel.invoke and ainvoke at import time. The first call goes to the LLM and gets cached. Every subsequent call with the same or semantically equivalent goal is served from local SQLite.&lt;/p&gt;

&lt;p&gt;For explicit control over what gets cached:&lt;/p&gt;

&lt;p&gt;import mnemon&lt;br&gt;
  from anthropic import Anthropic&lt;/p&gt;

&lt;p&gt;client = Anthropic()&lt;br&gt;
  m = mnemon.init()&lt;/p&gt;

&lt;p&gt;def generate_report(goal, inputs, context, capabilities, constraints):&lt;br&gt;
      response = client.messages.create(&lt;br&gt;
          model="claude-sonnet-4-6",&lt;br&gt;
          max_tokens=1024,&lt;br&gt;
          messages=[{"role": "user", "content": goal}],&lt;br&gt;
      )&lt;br&gt;
      return response.content[0].text&lt;/p&gt;

&lt;p&gt;result = m.run(&lt;br&gt;
      goal="weekly security audit for Acme Corp",&lt;br&gt;
      inputs={"client": "Acme Corp", "week": "2026-05-14"},&lt;br&gt;
      generation_fn=generate_report,&lt;br&gt;
  )&lt;/p&gt;

&lt;p&gt;print(result["output"])            # the actual result&lt;br&gt;
  print(result["cache_level"])       # "system1" | "system2" | "miss"&lt;br&gt;
  print(result["tokens_saved"])      # 1250 on a hit, 0 on first run&lt;br&gt;
  print(result["latency_saved_ms"])  # 20000.0 on a hit&lt;/p&gt;

&lt;p&gt;generation_fn is only called on a cache miss. On a hit, it's never invoked.&lt;/p&gt;




&lt;p&gt;Benchmark Results&lt;br&gt;
Tested across 45 runs of three recurring workflow types on claude-sonnet-4-6:&lt;/p&gt;

&lt;p&gt;┌───────────────────────────────────┬───────────┐&lt;br&gt;
  │              Metric               │  Result   │&lt;br&gt;
  ├───────────────────────────────────┼───────────┤&lt;br&gt;
  │ Cache misses (first run per type) │ 3         │&lt;br&gt;
  ├───────────────────────────────────┼───────────┤&lt;br&gt;
  │ System 2 hits                     │ 12        │&lt;br&gt;
  ├───────────────────────────────────┼───────────┤&lt;br&gt;
  │ System 1 hits                     │ 30        │&lt;br&gt;
  ├───────────────────────────────────┼───────────┤&lt;br&gt;
  │ Token reduction                   │ 93.3%     │&lt;br&gt;
  ├───────────────────────────────────┼───────────┤&lt;br&gt;
  │ LLM call reduction                │ 93%       │&lt;br&gt;
  ├───────────────────────────────────┼───────────┤&lt;br&gt;
  │ System 1 hit latency              │ 2.66ms    │&lt;br&gt;
  ├───────────────────────────────────┼───────────┤&lt;br&gt;
  │ Fresh generation latency          │ ~20,000ms │&lt;br&gt;
  ├───────────────────────────────────┼───────────┤&lt;br&gt;
  │ Speedup                           │ 7,500×    │&lt;br&gt;
  └───────────────────────────────────┴───────────┘&lt;/p&gt;

&lt;p&gt;50 concurrent agents serving the same workflow type in a single burst: 0 LLM calls, 62,500 tokens saved, 0.18 seconds total wall time.&lt;/p&gt;

&lt;p&gt;At scale with 80% System 1 and 15% System 2 hit rates:&lt;/p&gt;

&lt;p&gt;┌─────────────┬────────────────────┐&lt;br&gt;
  │ Daily plans │ Monthly cost saved │&lt;br&gt;
  ├─────────────┼────────────────────┤&lt;br&gt;
  │ 1,000       │ $503               │&lt;br&gt;
  ├─────────────┼────────────────────┤&lt;br&gt;
  │ 10,000      │ $5,034             │&lt;br&gt;
  ├─────────────┼────────────────────┤&lt;br&gt;
  │ 100,000     │ $50,344            │&lt;br&gt;
  └─────────────┴────────────────────┘&lt;/p&gt;

&lt;p&gt;Raw data and methodology in /reports (&lt;a href="https://github.com/smartass-4ever/Mnemon/tree/main/reports" rel="noopener noreferrer"&gt;https://github.com/smartass-4ever/Mnemon/tree/main/reports&lt;/a&gt;).&lt;/p&gt;




&lt;p&gt;What It Supports&lt;/p&gt;

&lt;p&gt;Auto-instruments at import time — no code changes needed:&lt;/p&gt;

&lt;p&gt;┌───────────────┬─────────────────────────────────┐&lt;br&gt;
  │   Framework   │        What gets patched        │&lt;br&gt;
  ├───────────────┼─────────────────────────────────┤&lt;br&gt;
  │ Anthropic SDK │ client.messages.create          │&lt;br&gt;
  ├───────────────┼─────────────────────────────────┤&lt;br&gt;
  │ OpenAI SDK    │ client.chat.completions.create  │&lt;br&gt;
  ├───────────────┼─────────────────────────────────┤&lt;br&gt;
  │ LangChain     │ BaseChatModel.invoke / ainvoke  │&lt;br&gt;
  ├───────────────┼─────────────────────────────────┤&lt;br&gt;
  │ LangGraph     │ CompiledGraph.invoke / ainvoke  │&lt;br&gt;
  ├───────────────┼─────────────────────────────────┤&lt;br&gt;
  │ CrewAI        │ crew kickoff via event bus      │&lt;br&gt;
  ├───────────────┼─────────────────────────────────┤&lt;br&gt;
  │ AutoGen       │ ConversableAgent.generate_reply │&lt;br&gt;
  └───────────────┴─────────────────────────────────┘&lt;/p&gt;




&lt;p&gt;Honest Caveats&lt;/p&gt;

&lt;p&gt;System 2 segment-level savings require sentence-transformers. Without pip install mnemon-ai[embeddings], System 2 still works — it serves the full cached plan when goal similarity clears the threshold — but you don't get&lt;br&gt;
partial-segment delta savings. System 1 is unaffected.&lt;/p&gt;

&lt;p&gt;This doesn't help for novel one-off queries. If every invocation is genuinely unique, there's nothing to cache. &lt;br&gt;
The savings compound on scheduled or event-triggered workflows running the same class of task repeatedly.&lt;/p&gt;

&lt;p&gt;The 2.66ms latency is for warm cache hits. Cold start (first run per workflow type) still goes to the LLM.&lt;/p&gt;




&lt;p&gt;Diagnostics&lt;/p&gt;

&lt;p&gt;m = mnemon.get()       # retrieve from anywhere in your codebase&lt;br&gt;
  m.get_stats()          # EME hits/misses, bus signals, DB size&lt;br&gt;
  m.drift_report()       # cross-session latency degradation&lt;br&gt;
  m.waste_report         # repeated queries and cumulative cost&lt;/p&gt;

&lt;p&gt;mnemon doctor          # health check&lt;br&gt;
  mnemon demo            # live demo, no API key needed&lt;/p&gt;




&lt;p&gt;Try It&lt;/p&gt;

&lt;p&gt;pip install mnemon-ai&lt;br&gt;
  mnemon demo&lt;/p&gt;

&lt;p&gt;No API key needed to run the demo. Source and full benchmark data on GitHub&lt;br&gt;
(&lt;a href="https://github.com/smartass-4ever/Mnemon" rel="noopener noreferrer"&gt;https://github.com/smartass-4ever/Mnemon&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Happy to answer questions on the segment diffing logic or the failure quarantine mechanism — those are the interesting parts architecturally.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>langchain</category>
    </item>
  </channel>
</rss>
