<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mahika jadhav</title>
    <description>The latest articles on DEV Community by Mahika jadhav (@smartass4ever).</description>
    <link>https://dev.to/smartass4ever</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3931774%2F4d77ad9c-0871-489f-a805-0c3e1dafb4cc.png</url>
      <title>DEV Community: Mahika jadhav</title>
      <link>https://dev.to/smartass4ever</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/smartass4ever"/>
    <language>en</language>
    <item>
      <title>How to Add Caching to Any AutoGen Workflow in 2 Lines</title>
      <dc:creator>Mahika jadhav</dc:creator>
      <pubDate>Sat, 06 Jun 2026 10:45:36 +0000</pubDate>
      <link>https://dev.to/smartass4ever/how-to-add-caching-to-any-autogen-workflow-in-2-lines-3lf3</link>
      <guid>https://dev.to/smartass4ever/how-to-add-caching-to-any-autogen-workflow-in-2-lines-3lf3</guid>
      <description>&lt;p&gt;AutoGen doesn't have a built-in execution cache. Every &lt;code&gt;GroupChat&lt;/code&gt;, every &lt;code&gt;ConversableAgent&lt;/code&gt; run starts fresh. If your multi-agent workflow runs similar tasks repeatedly — research pipelines, code review agents, scheduled reports — you're paying full LLM price every time.&lt;/p&gt;

&lt;p&gt;Here's how to fix it without touching your AutoGen code.&lt;/p&gt;




&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
bash
pip install mnemon-ai

import mnemon
mnemon.init()

# your existing AutoGen code — completely unchanged
import autogen

assistant = autogen.AssistantAgent(
    name="assistant",
    llm_config={"model": "gpt-4o", "api_key": "..."},
)
user_proxy = autogen.UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
)

user_proxy.initiate_chat(
    assistant,
    message="Analyze Q2 sales data for Acme Corp and generate a summary report",
)
# second run with same or similar message: 2.66ms · 0 tokens · $0.00

Mnemon's MOTH layer patches AutoGen at startup. No agent changes, no conversation changes.

---
What gets cached

Every LLM call your agents make is intercepted. On repeat runs:

- Exact match — same message, instant response from cache
- Semantic match — "Analyze Q2 sales for Acme Corp" matches "Generate Q2 sales analysis for Acme" — same task, different phrasing

For multi-agent workflows where agents pass messages between each other, common sub-tasks (data parsing, formatting, summarization) hit the cache across different top-level goals.

---
For structured recurring workflows

If your AutoGen setup runs the same workflow repeatedly with varying inputs, use m.run() for segment-level caching:

import autogen, mnemon

m = mnemon.init()

def run_analysis(goal, inputs, context, capabilities, constraints):
    user_proxy.initiate_chat(assistant, message=goal)
    return user_proxy.last_message()["content"]

result = m.run(
    goal="Q2 sales analysis for Acme Corp",
    inputs={"quarter": "Q2", "client": "Acme Corp"},
    generation_fn=run_analysis,
)

print(result["tokens_saved"])   # tokens saved on this run
print(result["cache_level"])    # "system1" | "system2" | "miss"

---
Numbers

┌─────────┬────────────┬────────────┐
│         │ First run  │ Cached run │
├─────────┼────────────┼────────────┤
│ Tokens  │ ~1,250     │ 0          │
├─────────┼────────────┼────────────┤
│ Latency │ ~20s       │ 2.66ms     │
├─────────┼────────────┼────────────┤
│ Cost    │ full price │ $0.00      │
└─────────┴────────────┴────────────┘

At 80% hit rate on recurring workflows: 93% token reduction.

---
Install

pip install mnemon-ai           # exact match only
pip install mnemon-ai[full]     # + semantic matching (local, no API key)

import mnemon
mnemon.init()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>agents</category>
      <category>llm</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>LangChain Already Has a Cache. Here's Why I Replaced It.</title>
      <dc:creator>Mahika jadhav</dc:creator>
      <pubDate>Sat, 06 Jun 2026 10:44:50 +0000</pubDate>
      <link>https://dev.to/smartass4ever/langchain-already-has-a-cache-heres-why-i-replaced-it-1bpn</link>
      <guid>https://dev.to/smartass4ever/langchain-already-has-a-cache-heres-why-i-replaced-it-1bpn</guid>
      <description>&lt;p&gt;LangChain's built-in cache is real and it works:&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
from langchain.globals import set_llm_cache
from langchain.cache import InMemoryCache
set_llm_cache(InMemoryCache())

Same input → instant response. I used it for months. Then I hit its ceiling.

---
The exact-match problem

LangChain's cache is a key-value store. The key is the exact prompt string. Change one character — a date, a name, a number — and it's a cache miss.

For a scheduled pipeline running weekly:

"Generate security report for Acme Corp, week of Jan 6"   → miss
"Generate security report for Acme Corp, week of Jan 13"  → miss
"Generate security report for Acme Corp, week of Jan 20"  → miss

Three different strings. Three full LLM calls. The structure of that report is 90% identical every week. I was paying for the same reasoning seven times a month.

---
What I needed: semantic matching

I switched to mnemon-ai (https://github.com/smartass-4ever/Mnemon). Same two-line setup:

pip install mnemon-ai[full]   # includes local semantic embedder

import mnemon
mnemon.init()

from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-sonnet-4-6")
response = llm.invoke("Generate security report for Acme Corp, week of Jan 13")
# hits the Jan 6 cache entry: 2.66ms · 0 tokens · $0.00

"Week of Jan 13" matches "week of Jan 6" because they're semantically the same task — weekly security report for the same client. Only genuinely novel inputs miss the cache.

---
Side-by-side comparison

┌───────────────────────────────────────┬────────────────────────┬───────────┐
│                                       │     set_llm_cache      │ mnemon-ai │
├───────────────────────────────────────┼────────────────────────┼───────────┤
│ Exact match caching                   │           ✅           │    ✅     │
├───────────────────────────────────────┼────────────────────────┼───────────┤
│ Semantic matching (similar inputs)    │           ❌           │    ✅     │
├───────────────────────────────────────┼────────────────────────┼───────────┤
│ Segment-level plan caching            │           ❌           │    ✅     │
├───────────────────────────────────────┼────────────────────────┼───────────┤
│ Zero code changes                     │ ❌ (need to set cache) │    ✅     │
├───────────────────────────────────────┼────────────────────────┼───────────┤
│ Works with CrewAI, AutoGen, LangGraph │           ❌           │    ✅     │
├───────────────────────────────────────┼────────────────────────┼───────────┤
│ Learning loop                         │           ❌           │    ✅     │
├───────────────────────────────────────┼────────────────────────┼───────────┤
│ Local, no external service            │           ✅           │    ✅     │
└───────────────────────────────────────┴────────────────────────┴───────────┘

---
When to use each

Use set_llm_cache if your inputs are truly identical across calls (same exact string every time) and you only use LangChain.

Use mnemon-ai if your inputs vary even slightly, you run recurring workflows, or you use multiple frameworks.

---
Token savings at scale

At 80% hit rate (typical for recurring workflows after ~10 runs):

┌────────────┬─────────────────┐
│ Daily runs │ Monthly savings │
├────────────┼─────────────────┤
│ 100        │ ~$56            │
├────────────┼─────────────────┤
│ 1,000      │ ~$503           │
├────────────┼─────────────────┤
│ 10,000     │ ~$5,034         │
└────────────┴─────────────────┘

---
Try it

pip install mnemon-ai

import mnemon
mnemon.init()
# drop-in for any existing LangChain code

GitHub: smartass-4ever/Mnemon (https://github.com/smartass-4ever/Mnemon)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>python</category>
    </item>
    <item>
      <title>How I Cut My LangGraph Agent's Token Costs by 93% with One Import</title>
      <dc:creator>Mahika jadhav</dc:creator>
      <pubDate>Sat, 06 Jun 2026 10:43:36 +0000</pubDate>
      <link>https://dev.to/smartass4ever/how-i-cut-my-langgraph-agents-token-costs-by-93-with-one-import-7b5</link>
      <guid>https://dev.to/smartass4ever/how-i-cut-my-langgraph-agents-token-costs-by-93-with-one-import-7b5</guid>
      <description>&lt;p&gt;I run a LangGraph pipeline that processes competitor intelligence reports every week. Same graph, same nodes, same conditional edges — just slightly different inputs each time. I was paying full LLM price on every run.&lt;/p&gt;

&lt;p&gt;After profiling it, I found that 90%+ of the graph traversal was identical across runs. The planner node always produced the same structure. The summarizer always took the same path. I was essentially paying to re-derive work my agent had already done.&lt;/p&gt;

&lt;p&gt;This is the core problem with LangGraph at scale: &lt;strong&gt;the graph is stateless by default&lt;/strong&gt;. Every invocation is a cold start.&lt;/p&gt;




&lt;h2&gt;
  
  
  The pattern that bleeds money
&lt;/h2&gt;

&lt;p&gt;If your LangGraph agent does any of the following, you're paying for redundant computation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scheduled pipelines (weekly reports, daily digests, recurring audits)&lt;/li&gt;
&lt;li&gt;Multi-step research agents that hit the same sources&lt;/li&gt;
&lt;li&gt;Document processing graphs with consistent structure&lt;/li&gt;
&lt;li&gt;Customer-facing agents that handle similar queries repeatedly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each run: full token cost. Full latency. Zero memory of previous executions.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I tried first
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prompt caching&lt;/strong&gt; — Anthropic and OpenAI both offer it. It helps with repeated &lt;em&gt;prefixes&lt;/em&gt;, not repeated &lt;em&gt;reasoning&lt;/em&gt;. When your graph re-derives a plan from slightly different inputs, prompt caching doesn't fire. You still pay.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Manual caching&lt;/strong&gt; — I added SQLite lookups at individual nodes. It worked but was brittle and broke every time I changed the graph structure.&lt;/p&gt;




&lt;h2&gt;
  
  
  The fix: execution-level caching
&lt;/h2&gt;

&lt;p&gt;I found &lt;a href="https://github.com/smartass-4ever/Mnemon" rel="noopener noreferrer"&gt;mnemon-ai&lt;/a&gt;, which caches at the plan level — not the prompt level.&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
bash
pip install mnemon-ai

Two lines. Your existing graph stays completely unchanged:

import mnemon
mnemon.init()

# your existing LangGraph code — untouched
from langgraph.graph import StateGraph

workflow = StateGraph(MyState)
workflow.add_node("planner", planner_node)
workflow.add_node("researcher", researcher_node)
workflow.add_node("summarizer", summarizer_node)
workflow.add_edge("planner", "researcher")
workflow.add_edge("researcher", "summarizer")

app = workflow.compile()
result = app.invoke({"goal": "Competitor analysis for Acme Corp Q2"})
# second run: 2.66ms · 0 tokens · $0.00

Mnemon auto-instruments LangGraph at import time. No wrappers, no graph restructuring.

---
How it works

Exact match (System 1) — fingerprint of your goal + context. If your agent has solved this before, it returns the cached result in ~2.66ms. Zero LLM calls.

Semantic match (System 2) — if the goal is similar but not identical, Mnemon finds the closest prior execution and only regenerates the segments that actually changed. You pay for the delta, not the whole run.

Enable semantic matching:
pip install mnemon-ai[full]  # local model, no API key needed

---
Results

┌────────────────────────────┬────────┬────────┐
│           Metric           │ Before │ After  │
├────────────────────────────┼────────┼────────┤
│ Tokens per run (avg)       │ ~1,250 │ ~84    │
├────────────────────────────┼────────┼────────┤
│ LLM calls per run          │ 4      │ 0.27   │
├────────────────────────────┼────────┼────────┤
│ Latency (cache hit)        │ 18–22s │ 2.66ms │
├────────────────────────────┼────────┼────────┤
│ Monthly cost (1k runs/day) │ ~$503  │ ~$34   │
└────────────────────────────┴────────┴────────┘

93.3% token reduction. 7,500× faster on cache hits.

The first run of a new goal still pays full cost. Every run after doesn't.

---
Try it

pip install mnemon-ai

import mnemon
mnemon.init()
# your LangGraph agent now has memory across runs

Run mnemon demo to see a live cache hit in 30 seconds — no API key needed.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>agents</category>
      <category>tokencost</category>
    </item>
    <item>
      <title>Your AI Agent Is Wasting Tokens on Pages That Haven't Changed</title>
      <dc:creator>Mahika jadhav</dc:creator>
      <pubDate>Wed, 27 May 2026 15:55:13 +0000</pubDate>
      <link>https://dev.to/smartass4ever/your-ai-agent-is-wasting-tokens-on-pages-that-havent-changed-2fo4</link>
      <guid>https://dev.to/smartass4ever/your-ai-agent-is-wasting-tokens-on-pages-that-havent-changed-2fo4</guid>
      <description>&lt;p&gt;You built a web monitoring agent. It checks competitor pricing every hour, scans news feeds for signals, watches supplier pages for stock changes.&lt;/p&gt;

&lt;p&gt;It's working. And it's burning your API budget — on nothing.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; ---
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;## The problem nobody talks about&lt;/p&gt;

&lt;p&gt;Here's what most web monitoring agents do on every run:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Fetch the page
2. Send the full content to the LLM
3. Ask "what changed?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The page hasn't changed. The LLM doesn't know that. It reads 1,500 tokens of HTML, thinks carefully, and tells you: nothing changed&lt;/p&gt;

&lt;p&gt;You paid for that. Every hour. For every URL.&lt;/p&gt;

&lt;p&gt;If you're monitoring 10 pages hourly, that's 240 LLM calls a day — most of them pointless.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;---
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;## What the right architecture looks like&lt;/p&gt;

&lt;p&gt;The fix is straightforward: &lt;strong&gt;don't call the LLM unless the page actually changed.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;```
Run 1:  fetch 10 pages → analyse all 10 → cache insights
Run 2:  fetch 10 pages → 8 unchanged → 2 changed → LLM fires twice
Run 3:  fetch 10 pages → 9 unchanged → 1 changed → LLM fires once
```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Over 24 hours of hourly checks across 10 pages: **240 fetches, ~20 LLM calls** instead of 240. That's an 80%+ reduction without losing a single real signal.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The mechanics:&lt;br&gt;
    - Fetch the page live on every run (you need fresh data, not a cache)&lt;br&gt;
    - Hash the content (SHA-256, fast, free)&lt;br&gt;
    - Compare to the last stored hash&lt;br&gt;
    - If identical → return cached insight, skip the LLM entirely&lt;br&gt;
    - If changed → extract the diff, send &lt;strong&gt;only the delta&lt;/strong&gt; to the LLM&lt;/p&gt;

&lt;p&gt;The second part matters as much as the first. When a page does change, you still shouldn't send the full page — you should send what's different. If a pricing page updates one number, the LLM needs 50 tokens, not 1,500.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;---
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Why most agents don't do this
&lt;/h2&gt;

&lt;p&gt;Two reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bot detection.&lt;/strong&gt; Consistent fetching from a static IP gets blocked fast. Most developers hit a wall here and either throttle to the point of uselessness or pay for proxy infrastructure separately. The fetching layer is genuinely hard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Boilerplate.&lt;/strong&gt; Hashing, diffing, caching, storing snapshots — it's not hard code, but it's 200 lines you have to write, test, and maintain before you've done anything useful. Most people skip it and just call the LLM every time.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;---
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;## Indra&lt;/p&gt;

&lt;p&gt;I built &lt;a href="https://github.com/smartass-4ever/Indra" rel="noopener noreferrer"&gt;Indra&lt;/a&gt; to solve both problems in one library.&lt;/p&gt;

&lt;p&gt;It uses &lt;a href="https://brightdata.com" rel="noopener noreferrer"&gt;Bright Data&lt;/a&gt;'s Web Unlocker on every fetch — bot detection, CAPTCHAs, geo-blocks, JavaScript rendering, all handled transparently. And it write the full hash-diff-cache pipeline so you don't have to.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;```python
import indra
import anthropic

client = anthropic.Anthropic()

def llm(prompt: str) -&amp;gt; str:
    msg = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return msg.content[0].text

agent = indra.init(brightdata_api_key="your-key")

result = agent.watch(
    url="https://competitor.com/pricing",
    question="Did any prices change? What are the implications?",
    generation_fn=llm,
)

print(result.changed)        # True / False
print(result.insight)        # LLM analysis, or cached answer if unchanged
print(result.tokens_saved)   # tokens skipped this run
print(result.cost_saved_usd) # dollar value of what was skipped
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

The `generation_fn` is only called when the page actually changed. On unchanged runs it returns instantly with the cached answer — zero tokens, sub-millisecond.

   ---
    ## Watching multiple pages


   ```python
    results = agent.watch_all(
      urls=[
           "https://openai.com/api/pricing/",
        "https://anthropic.com/pricing",
           "https://competitor.com/pricing",
       ],
       question="Did any prices change?",
       generation_fn=llm,
   )

   agent.print_stats()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output after a few runs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   ──────────────────────────────────────────────────
     Indra Session Summary
   ──────────────────────────────────────────────────
     Bright Data fetches : 15
     Changes detected    : 1
     LLM calls fired     : 6
     Cache hits          : 9
     Tokens saved        : 12,000
     Cost saved          : $0.0360
     Efficiency          : 80%
   ──────────────────────────────────────────────────

   ---

   ## Watching search results

Indra also supports SERP monitoring — fire the LLM only when the ranking itself changes:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;python&lt;br&gt;
   result = agent.search_watch(&lt;br&gt;
       query="openai new model announcement",&lt;br&gt;
       question="Is there a major new release?",&lt;br&gt;
       generation_fn=llm,&lt;br&gt;
   )&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Same pattern: fetch live SERP results, hash them, skip the LLM if rankings haven't shifted.

---

  ## Install

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;br&gt;
  pip install indra-ai&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


It's early — rough edges exist, and we're fixing them fast. If you're building web monitoring into an agent and burning tokens on unchanged pages, give it a try.

 [GitHub](https://github.com/smartass-4ever/Indra)
 [PyPI](https://pypi.org/project/indra-ai/)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>agents</category>
      <category>mcp</category>
    </item>
    <item>
      <title>The Decision Machine</title>
      <dc:creator>Mahika jadhav</dc:creator>
      <pubDate>Sat, 23 May 2026 15:41:11 +0000</pubDate>
      <link>https://dev.to/smartass4ever/the-decision-machine-3kbo</link>
      <guid>https://dev.to/smartass4ever/the-decision-machine-3kbo</guid>
      <description>&lt;p&gt;by Mahika Jadhav, may 2026&lt;br&gt;
There are two paths artificial intelligence can take from here.&lt;/p&gt;

&lt;p&gt;The first is the one most labs are racing down: larger models, more training data, wider context windows. Tokens get cheaper. Attention spans get longer. The bet is that scale solves everything.&lt;/p&gt;

&lt;p&gt;It does not. Larger context windows lose the small details — the edge case from last month, the pattern that appeared once and mattered. More parameters hit diminishing returns. The architecture is the same. The ceiling is real.&lt;/p&gt;

&lt;p&gt;The second path is harder to see, but it is the only one that leads somewhere worth going.&lt;/p&gt;




&lt;p&gt;*&lt;em&gt;What Intelligence Actually Is&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
The clearest measure of intelligence is adaptability — the ability to handle a situation never encountered before, using context, environment, and accumulated experience to make a decision.&lt;/p&gt;

&lt;p&gt;Not retrieve. Not predict. Decide.&lt;/p&gt;

&lt;p&gt;A decision is:&lt;/p&gt;

&lt;p&gt;(context + experience + goal) + core priors&lt;/p&gt;

&lt;p&gt;Core priors are the fixed layer — the equivalent of instinct. Everything else is learned. A system that can combine what it knows, what it has experienced, what it is trying to achieve, and what it fundamentally is — and produce a decision under novel conditions — is an adaptable system. And an adaptable system, by the only definition that matters, is an intelligent one.&lt;/p&gt;

&lt;p&gt;This is what AGI requires. Not more data. Not a bigger window. A self-growing, self-learning system that accumulates experience and uses it to make decisions.&lt;/p&gt;




&lt;p&gt;Mnemon Is One Piece&lt;/p&gt;

&lt;p&gt;Mnemon handles the experience component.&lt;/p&gt;

&lt;p&gt;Every time an agent runs a task, Mnemon records what happened. When a similar goal appears, the agent does not start from zero — it draws on accumulated execution memory. It reuses what worked. It learns from what failed. It gets better every run without retraining, without a larger model, without more parameters.&lt;/p&gt;

&lt;p&gt;The token savings are real. But the architecture is what matters: an agent that grows from experience.&lt;/p&gt;

&lt;p&gt;That is one component. The decision formula has others.&lt;/p&gt;




&lt;p&gt;Look Forward to EROS&lt;/p&gt;

&lt;p&gt;My vision is to build every component of this system — context, experience, goal alignment, core priors — into something complete.&lt;/p&gt;

&lt;p&gt;When it is finished, it will not just be an agent tool. It will be a human-machine interface — the layer through which humans and artificial intelligence interact as genuine partners, each contributing what the other cannot.&lt;/p&gt;

&lt;p&gt;EROS is that system.&lt;/p&gt;

&lt;p&gt;When it arrives, it will not just change how agents run. It will change how humans live alongside intelligence.&lt;/p&gt;

&lt;p&gt;We are building it now, one component at a time. Mnemon is the first.&lt;/p&gt;




&lt;p&gt;pip install mnemon-ai&lt;br&gt;
  &lt;a href="https://github.com/smartass-4ever/Mnemon" rel="noopener noreferrer"&gt;https://github.com/smartass-4ever/Mnemon&lt;/a&gt;&lt;br&gt;
  &lt;a href="mailto:mahikajadhav22@gmail.com"&gt;mahikajadhav22@gmail.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>python</category>
    </item>
    <item>
      <title>How I Cut My LangGraph Agent's Token Costs by 93% with One Import</title>
      <dc:creator>Mahika jadhav</dc:creator>
      <pubDate>Sun, 17 May 2026 12:07:24 +0000</pubDate>
      <link>https://dev.to/smartass4ever/how-i-cut-my-langgraph-agents-token-costs-by-93-with-one-import-4kii</link>
      <guid>https://dev.to/smartass4ever/how-i-cut-my-langgraph-agents-token-costs-by-93-with-one-import-4kii</guid>
      <description>&lt;p&gt;I run a LangGraph pipeline that processes competitor intelligence reports every week. Same graph, same nodes, same conditional edges — just slightly different inputs each time. I was paying full LLM price on every run.&lt;/p&gt;

&lt;p&gt;After profiling it, I found that 90%+ of the graph traversal was identical across runs. The planner node always produced the same structure. The summarizer always took the same path. I was essentially paying to re-derive work my agent had already done.&lt;/p&gt;

&lt;p&gt;This is the core problem with LangGraph at scale: the graph is stateless by default. Every invocation is a cold start.&lt;/p&gt;




&lt;p&gt;The pattern that bleeds money&lt;/p&gt;

&lt;p&gt;If your LangGraph agent does any of the following, you're paying for redundant computation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scheduled pipelines (weekly reports, daily digests, recurring audits)&lt;/li&gt;
&lt;li&gt;Multi-step research agents that hit the same sources&lt;/li&gt;
&lt;li&gt;Document processing graphs with consistent structure&lt;/li&gt;
&lt;li&gt;Customer-facing agents that handle similar queries repeatedly&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;What I tried first&lt;/p&gt;

&lt;p&gt;Prompt caching — helps with repeated prefixes, not repeated reasoning. When your graph re-derives a plan from slightly different inputs, prompt caching doesn't fire. You still pay.&lt;/p&gt;

&lt;p&gt;Manual caching — I added SQLite lookups at individual nodes. Brittle, framework-specific, broke every time I changed&lt;br&gt;
the graph.&lt;/p&gt;




&lt;p&gt;The fix&lt;/p&gt;

&lt;p&gt;pip install mnemon-ai&lt;/p&gt;

&lt;p&gt;import mnemon&lt;br&gt;
  mnemon.init()&lt;/p&gt;

&lt;p&gt;# your existing LangGraph code — untouched&lt;br&gt;
  from langgraph.graph import StateGraph&lt;/p&gt;

&lt;p&gt;workflow = StateGraph(MyState)&lt;br&gt;
  workflow.add_node("planner", planner_node)&lt;br&gt;
  workflow.add_node("researcher", researcher_node)&lt;br&gt;
  workflow.add_node("summarizer", summarizer_node)&lt;br&gt;
  workflow.add_edge("planner", "researcher")&lt;br&gt;
  workflow.add_edge("researcher", "summarizer")&lt;/p&gt;

&lt;p&gt;app = workflow.compile()&lt;br&gt;
  result = app.invoke({"goal": "Competitor analysis for Acme Corp Q2"})&lt;/p&gt;

&lt;p&gt;Mnemon auto-instruments LangGraph at import. No wrappers, no graph restructuring.&lt;/p&gt;




&lt;p&gt;How it works&lt;/p&gt;

&lt;p&gt;System 1 (exact match) — SHA-256 fingerprint of goal + context + inputs. Cache hit returns in ~2.66ms. Zero LLM calls.&lt;br&gt;
System 2 (semantic match) — Similar but not identical goal? Finds the closest prior execution, regenerates only what&lt;br&gt;
  changed. You pay for the delta, not the full run.&lt;/p&gt;




&lt;p&gt;Results&lt;/p&gt;

&lt;p&gt;45 executions across similar inputs:&lt;/p&gt;

&lt;p&gt;┌────────────────────────────┬────────┬────────┐&lt;br&gt;
  │           Metric           │ Before │ After  │&lt;br&gt;
  ├────────────────────────────┼────────┼────────┤&lt;br&gt;
  │ Tokens per run             │ ~1,250 │ ~84    │&lt;br&gt;
  ├────────────────────────────┼────────┼────────┤&lt;br&gt;
  │ LLM calls per run          │ 4      │ 0.27   │&lt;br&gt;
  ├────────────────────────────┼────────┼────────┤&lt;br&gt;
  │ Latency (cache hit)        │ 18–22s │ 2.66ms │&lt;br&gt;
  ├────────────────────────────┼────────┼────────┤&lt;br&gt;
  │ Monthly cost (1k runs/day) │ ~$503  │ ~$34   │&lt;br&gt;
  └────────────────────────────┴────────┴────────┘&lt;/p&gt;

&lt;p&gt;93.3% token reduction. 7,500× faster on cache hits.&lt;/p&gt;




&lt;p&gt;What it doesn't fix&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Genuinely novel queries&lt;/li&gt;
&lt;li&gt;Real-time agents where freshness matters&lt;/li&gt;
&lt;li&gt;Cold starts — first run still hits the LLM&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;GitHub: smartass-4ever/Mnemon (&lt;a href="https://github.com/smartass-4ever/Mnemon" rel="noopener noreferrer"&gt;https://github.com/smartass-4ever/Mnemon&lt;/a&gt;)&lt;/p&gt;

</description>
      <category>llm</category>
      <category>langchain</category>
      <category>python</category>
      <category>ai</category>
    </item>
    <item>
      <title>How I cut my LangChain agent's token costs by 93% with one import</title>
      <dc:creator>Mahika jadhav</dc:creator>
      <pubDate>Thu, 14 May 2026 18:05:38 +0000</pubDate>
      <link>https://dev.to/smartass4ever/how-i-cut-my-langchain-agents-token-costs-by-93-with-one-import-2nc9</link>
      <guid>https://dev.to/smartass4ever/how-i-cut-my-langchain-agents-token-costs-by-93-with-one-import-2nc9</guid>
      <description>&lt;p&gt;My agent was generating the same weekly security report for the same three clients every Monday. Same context. Same reasoning structure. Same output format. I was paying full Anthropic API price every single time.&lt;/p&gt;

&lt;p&gt;I checked the logs. Across 45 runs of three recurring workflow types — security audits, invoice processing, weekly reports — the structure of the generated plan was materially identical run after run. The LLM was re-deriving the same skeleton every time. 93% of the tokens I was spending were redundant.&lt;/p&gt;

&lt;p&gt;This isn't a prompt engineering problem. It's a structural one.&lt;/p&gt;




&lt;p&gt;The Problem With Stateless Frameworks&lt;/p&gt;

&lt;p&gt;Every major agent framework — LangChain, LangGraph, CrewAI, AutoGen — is stateless by default. There is no memory of&lt;br&gt;
previous executions at the plan level. Each invocation starts from zero.&lt;/p&gt;

&lt;p&gt;This is fine for one-off queries. For recurring workflows — scheduled reports, compliance checks, data pipelines, anything that runs the same class of task repeatedly — it means you pay full LLM price every time, forever.&lt;/p&gt;

&lt;p&gt;Prompt caching (Anthropic's and OpenAI's built-in feature) helps with input tokens on identical prompts. It doesn't help when your inputs vary slightly per run. It doesn't eliminate the API call. And it does nothing for the reasoning and plan generation that happens downstream.&lt;/p&gt;

&lt;p&gt;What you actually need is execution caching — caching at the plan level, not the prompt level.&lt;/p&gt;




&lt;p&gt;The Solution: Cache the Execution Plan&lt;/p&gt;

&lt;p&gt;The idea: on first run, fingerprint the execution plan and store it as segments. On subsequent runs with the same or semantically similar goal, serve the plan from cache. Skip the LLM entirely.&lt;/p&gt;

&lt;p&gt;Two modes:&lt;/p&gt;

&lt;p&gt;System 1 — Exact match. SHA-256 fingerprint of goal + context + inputs. If it matches a stored plan, reconstruct from local SQLite in ~2.66ms. Zero API calls. Zero tokens.&lt;/p&gt;

&lt;p&gt;System 2 — Semantic match. Goal is similar but not identical — same workflow, different client name or date. Match the stored plan by embedding similarity. Diff the segments. Regenerate only the parts that changed. Pay for the delta, not the full plan.&lt;/p&gt;

&lt;p&gt;A background process called the Retrospector quarantines failed segments so bad patterns never get reused. A signal&lt;br&gt;
bus tracks latency baselines and failure rates per workflow type and feeds that back to strengthen or weaken cached patterns. The cache gets smarter over time, not just bigger.&lt;/p&gt;




&lt;p&gt;In Practice&lt;/p&gt;

&lt;p&gt;import mnemon&lt;br&gt;
  mnemon.init()&lt;/p&gt;

&lt;p&gt;# everything below is unchanged&lt;br&gt;
  from langchain_anthropic import ChatAnthropic&lt;br&gt;
  llm = ChatAnthropic(model="claude-sonnet-4-6")&lt;br&gt;
  response = llm.invoke("Generate weekly security report for Acme Corp")&lt;/p&gt;

&lt;p&gt;That's it. mnemon.init() patches BaseChatModel.invoke and ainvoke at import time. The first call goes to the LLM and gets cached. Every subsequent call with the same or semantically equivalent goal is served from local SQLite.&lt;/p&gt;

&lt;p&gt;For explicit control over what gets cached:&lt;/p&gt;

&lt;p&gt;import mnemon&lt;br&gt;
  from anthropic import Anthropic&lt;/p&gt;

&lt;p&gt;client = Anthropic()&lt;br&gt;
  m = mnemon.init()&lt;/p&gt;

&lt;p&gt;def generate_report(goal, inputs, context, capabilities, constraints):&lt;br&gt;
      response = client.messages.create(&lt;br&gt;
          model="claude-sonnet-4-6",&lt;br&gt;
          max_tokens=1024,&lt;br&gt;
          messages=[{"role": "user", "content": goal}],&lt;br&gt;
      )&lt;br&gt;
      return response.content[0].text&lt;/p&gt;

&lt;p&gt;result = m.run(&lt;br&gt;
      goal="weekly security audit for Acme Corp",&lt;br&gt;
      inputs={"client": "Acme Corp", "week": "2026-05-14"},&lt;br&gt;
      generation_fn=generate_report,&lt;br&gt;
  )&lt;/p&gt;

&lt;p&gt;print(result["output"])            # the actual result&lt;br&gt;
  print(result["cache_level"])       # "system1" | "system2" | "miss"&lt;br&gt;
  print(result["tokens_saved"])      # 1250 on a hit, 0 on first run&lt;br&gt;
  print(result["latency_saved_ms"])  # 20000.0 on a hit&lt;/p&gt;

&lt;p&gt;generation_fn is only called on a cache miss. On a hit, it's never invoked.&lt;/p&gt;




&lt;p&gt;Benchmark Results&lt;br&gt;
Tested across 45 runs of three recurring workflow types on claude-sonnet-4-6:&lt;/p&gt;

&lt;p&gt;┌───────────────────────────────────┬───────────┐&lt;br&gt;
  │              Metric               │  Result   │&lt;br&gt;
  ├───────────────────────────────────┼───────────┤&lt;br&gt;
  │ Cache misses (first run per type) │ 3         │&lt;br&gt;
  ├───────────────────────────────────┼───────────┤&lt;br&gt;
  │ System 2 hits                     │ 12        │&lt;br&gt;
  ├───────────────────────────────────┼───────────┤&lt;br&gt;
  │ System 1 hits                     │ 30        │&lt;br&gt;
  ├───────────────────────────────────┼───────────┤&lt;br&gt;
  │ Token reduction                   │ 93.3%     │&lt;br&gt;
  ├───────────────────────────────────┼───────────┤&lt;br&gt;
  │ LLM call reduction                │ 93%       │&lt;br&gt;
  ├───────────────────────────────────┼───────────┤&lt;br&gt;
  │ System 1 hit latency              │ 2.66ms    │&lt;br&gt;
  ├───────────────────────────────────┼───────────┤&lt;br&gt;
  │ Fresh generation latency          │ ~20,000ms │&lt;br&gt;
  ├───────────────────────────────────┼───────────┤&lt;br&gt;
  │ Speedup                           │ 7,500×    │&lt;br&gt;
  └───────────────────────────────────┴───────────┘&lt;/p&gt;

&lt;p&gt;50 concurrent agents serving the same workflow type in a single burst: 0 LLM calls, 62,500 tokens saved, 0.18 seconds total wall time.&lt;/p&gt;

&lt;p&gt;At scale with 80% System 1 and 15% System 2 hit rates:&lt;/p&gt;

&lt;p&gt;┌─────────────┬────────────────────┐&lt;br&gt;
  │ Daily plans │ Monthly cost saved │&lt;br&gt;
  ├─────────────┼────────────────────┤&lt;br&gt;
  │ 1,000       │ $503               │&lt;br&gt;
  ├─────────────┼────────────────────┤&lt;br&gt;
  │ 10,000      │ $5,034             │&lt;br&gt;
  ├─────────────┼────────────────────┤&lt;br&gt;
  │ 100,000     │ $50,344            │&lt;br&gt;
  └─────────────┴────────────────────┘&lt;/p&gt;

&lt;p&gt;Raw data and methodology in /reports (&lt;a href="https://github.com/smartass-4ever/Mnemon/tree/main/reports" rel="noopener noreferrer"&gt;https://github.com/smartass-4ever/Mnemon/tree/main/reports&lt;/a&gt;).&lt;/p&gt;




&lt;p&gt;What It Supports&lt;/p&gt;

&lt;p&gt;Auto-instruments at import time — no code changes needed:&lt;/p&gt;

&lt;p&gt;┌───────────────┬─────────────────────────────────┐&lt;br&gt;
  │   Framework   │        What gets patched        │&lt;br&gt;
  ├───────────────┼─────────────────────────────────┤&lt;br&gt;
  │ Anthropic SDK │ client.messages.create          │&lt;br&gt;
  ├───────────────┼─────────────────────────────────┤&lt;br&gt;
  │ OpenAI SDK    │ client.chat.completions.create  │&lt;br&gt;
  ├───────────────┼─────────────────────────────────┤&lt;br&gt;
  │ LangChain     │ BaseChatModel.invoke / ainvoke  │&lt;br&gt;
  ├───────────────┼─────────────────────────────────┤&lt;br&gt;
  │ LangGraph     │ CompiledGraph.invoke / ainvoke  │&lt;br&gt;
  ├───────────────┼─────────────────────────────────┤&lt;br&gt;
  │ CrewAI        │ crew kickoff via event bus      │&lt;br&gt;
  ├───────────────┼─────────────────────────────────┤&lt;br&gt;
  │ AutoGen       │ ConversableAgent.generate_reply │&lt;br&gt;
  └───────────────┴─────────────────────────────────┘&lt;/p&gt;




&lt;p&gt;Honest Caveats&lt;/p&gt;

&lt;p&gt;System 2 segment-level savings require sentence-transformers. Without pip install mnemon-ai[embeddings], System 2 still works — it serves the full cached plan when goal similarity clears the threshold — but you don't get&lt;br&gt;
partial-segment delta savings. System 1 is unaffected.&lt;/p&gt;

&lt;p&gt;This doesn't help for novel one-off queries. If every invocation is genuinely unique, there's nothing to cache. &lt;br&gt;
The savings compound on scheduled or event-triggered workflows running the same class of task repeatedly.&lt;/p&gt;

&lt;p&gt;The 2.66ms latency is for warm cache hits. Cold start (first run per workflow type) still goes to the LLM.&lt;/p&gt;




&lt;p&gt;Diagnostics&lt;/p&gt;

&lt;p&gt;m = mnemon.get()       # retrieve from anywhere in your codebase&lt;br&gt;
  m.get_stats()          # EME hits/misses, bus signals, DB size&lt;br&gt;
  m.drift_report()       # cross-session latency degradation&lt;br&gt;
  m.waste_report         # repeated queries and cumulative cost&lt;/p&gt;

&lt;p&gt;mnemon doctor          # health check&lt;br&gt;
  mnemon demo            # live demo, no API key needed&lt;/p&gt;




&lt;p&gt;Try It&lt;/p&gt;

&lt;p&gt;pip install mnemon-ai&lt;br&gt;
  mnemon demo&lt;/p&gt;

&lt;p&gt;No API key needed to run the demo. Source and full benchmark data on GitHub&lt;br&gt;
(&lt;a href="https://github.com/smartass-4ever/Mnemon" rel="noopener noreferrer"&gt;https://github.com/smartass-4ever/Mnemon&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Happy to answer questions on the segment diffing logic or the failure quarantine mechanism — those are the interesting parts architecturally.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>langchain</category>
    </item>
  </channel>
</rss>
