DEV Community: Mahika jadhav

How to Add Caching to Any AutoGen Workflow in 2 Lines

Mahika jadhav — Sat, 06 Jun 2026 10:45:36 +0000

AutoGen doesn't have a built-in execution cache. Every GroupChat, every ConversableAgent run starts fresh. If your multi-agent workflow runs similar tasks repeatedly — research pipelines, code review agents, scheduled reports — you're paying full LLM price every time.

Here's how to fix it without touching your AutoGen code.

The setup


bash
pip install mnemon-ai

import mnemon
mnemon.init()

# your existing AutoGen code — completely unchanged
import autogen

assistant = autogen.AssistantAgent(
    name="assistant",
    llm_config={"model": "gpt-4o", "api_key": "..."},
)
user_proxy = autogen.UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
)

user_proxy.initiate_chat(
    assistant,
    message="Analyze Q2 sales data for Acme Corp and generate a summary report",
)
# second run with same or similar message: 2.66ms · 0 tokens · $0.00

Mnemon's MOTH layer patches AutoGen at startup. No agent changes, no conversation changes.

---
What gets cached

Every LLM call your agents make is intercepted. On repeat runs:

- Exact match — same message, instant response from cache
- Semantic match — "Analyze Q2 sales for Acme Corp" matches "Generate Q2 sales analysis for Acme" — same task, different phrasing

For multi-agent workflows where agents pass messages between each other, common sub-tasks (data parsing, formatting, summarization) hit the cache across different top-level goals.

---
For structured recurring workflows

If your AutoGen setup runs the same workflow repeatedly with varying inputs, use m.run() for segment-level caching:

import autogen, mnemon

m = mnemon.init()

def run_analysis(goal, inputs, context, capabilities, constraints):
    user_proxy.initiate_chat(assistant, message=goal)
    return user_proxy.last_message()["content"]

result = m.run(
    goal="Q2 sales analysis for Acme Corp",
    inputs={"quarter": "Q2", "client": "Acme Corp"},
    generation_fn=run_analysis,
)

print(result["tokens_saved"])   # tokens saved on this run
print(result["cache_level"])    # "system1" | "system2" | "miss"

---
Numbers

┌─────────┬────────────┬────────────┐
│         │ First run  │ Cached run │
├─────────┼────────────┼────────────┤
│ Tokens  │ ~1,250     │ 0          │
├─────────┼────────────┼────────────┤
│ Latency │ ~20s       │ 2.66ms     │
├─────────┼────────────┼────────────┤
│ Cost    │ full price │ $0.00      │
└─────────┴────────────┴────────────┘

At 80% hit rate on recurring workflows: 93% token reduction.

---
Install

pip install mnemon-ai           # exact match only
pip install mnemon-ai[full]     # + semantic matching (local, no API key)

import mnemon
mnemon.init()

LangChain Already Has a Cache. Here's Why I Replaced It.

Mahika jadhav — Sat, 06 Jun 2026 10:44:50 +0000

LangChain's built-in cache is real and it works:


python
from langchain.globals import set_llm_cache
from langchain.cache import InMemoryCache
set_llm_cache(InMemoryCache())

Same input → instant response. I used it for months. Then I hit its ceiling.

---
The exact-match problem

LangChain's cache is a key-value store. The key is the exact prompt string. Change one character — a date, a name, a number — and it's a cache miss.

For a scheduled pipeline running weekly:

"Generate security report for Acme Corp, week of Jan 6"   → miss
"Generate security report for Acme Corp, week of Jan 13"  → miss
"Generate security report for Acme Corp, week of Jan 20"  → miss

Three different strings. Three full LLM calls. The structure of that report is 90% identical every week. I was paying for the same reasoning seven times a month.

---
What I needed: semantic matching

I switched to mnemon-ai (https://github.com/smartass-4ever/Mnemon). Same two-line setup:

pip install mnemon-ai[full]   # includes local semantic embedder

import mnemon
mnemon.init()

from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-sonnet-4-6")
response = llm.invoke("Generate security report for Acme Corp, week of Jan 13")
# hits the Jan 6 cache entry: 2.66ms · 0 tokens · $0.00

"Week of Jan 13" matches "week of Jan 6" because they're semantically the same task — weekly security report for the same client. Only genuinely novel inputs miss the cache.

---
Side-by-side comparison

┌───────────────────────────────────────┬────────────────────────┬───────────┐
│                                       │     set_llm_cache      │ mnemon-ai │
├───────────────────────────────────────┼────────────────────────┼───────────┤
│ Exact match caching                   │           ✅           │    ✅     │
├───────────────────────────────────────┼────────────────────────┼───────────┤
│ Semantic matching (similar inputs)    │           ❌           │    ✅     │
├───────────────────────────────────────┼────────────────────────┼───────────┤
│ Segment-level plan caching            │           ❌           │    ✅     │
├───────────────────────────────────────┼────────────────────────┼───────────┤
│ Zero code changes                     │ ❌ (need to set cache) │    ✅     │
├───────────────────────────────────────┼────────────────────────┼───────────┤
│ Works with CrewAI, AutoGen, LangGraph │           ❌           │    ✅     │
├───────────────────────────────────────┼────────────────────────┼───────────┤
│ Learning loop                         │           ❌           │    ✅     │
├───────────────────────────────────────┼────────────────────────┼───────────┤
│ Local, no external service            │           ✅           │    ✅     │
└───────────────────────────────────────┴────────────────────────┴───────────┘

---
When to use each

Use set_llm_cache if your inputs are truly identical across calls (same exact string every time) and you only use LangChain.

Use mnemon-ai if your inputs vary even slightly, you run recurring workflows, or you use multiple frameworks.

---
Token savings at scale

At 80% hit rate (typical for recurring workflows after ~10 runs):

┌────────────┬─────────────────┐
│ Daily runs │ Monthly savings │
├────────────┼─────────────────┤
│ 100        │ ~$56            │
├────────────┼─────────────────┤
│ 1,000      │ ~$503           │
├────────────┼─────────────────┤
│ 10,000     │ ~$5,034         │
└────────────┴─────────────────┘

---
Try it

pip install mnemon-ai

import mnemon
mnemon.init()
# drop-in for any existing LangChain code

GitHub: smartass-4ever/Mnemon (https://github.com/smartass-4ever/Mnemon)

How I Cut My LangGraph Agent's Token Costs by 93% with One Import

Mahika jadhav — Sat, 06 Jun 2026 10:43:36 +0000

I run a LangGraph pipeline that processes competitor intelligence reports every week. Same graph, same nodes, same conditional edges — just slightly different inputs each time. I was paying full LLM price on every run.

After profiling it, I found that 90%+ of the graph traversal was identical across runs. The planner node always produced the same structure. The summarizer always took the same path. I was essentially paying to re-derive work my agent had already done.

This is the core problem with LangGraph at scale: the graph is stateless by default. Every invocation is a cold start.

The pattern that bleeds money

If your LangGraph agent does any of the following, you're paying for redundant computation:

Scheduled pipelines (weekly reports, daily digests, recurring audits)
Multi-step research agents that hit the same sources
Document processing graphs with consistent structure
Customer-facing agents that handle similar queries repeatedly

Each run: full token cost. Full latency. Zero memory of previous executions.

What I tried first

Prompt caching — Anthropic and OpenAI both offer it. It helps with repeated prefixes, not repeated reasoning. When your graph re-derives a plan from slightly different inputs, prompt caching doesn't fire. You still pay.

Manual caching — I added SQLite lookups at individual nodes. It worked but was brittle and broke every time I changed the graph structure.

The fix: execution-level caching

I found mnemon-ai, which caches at the plan level — not the prompt level.


bash
pip install mnemon-ai

Two lines. Your existing graph stays completely unchanged:

import mnemon
mnemon.init()

# your existing LangGraph code — untouched
from langgraph.graph import StateGraph

workflow = StateGraph(MyState)
workflow.add_node("planner", planner_node)
workflow.add_node("researcher", researcher_node)
workflow.add_node("summarizer", summarizer_node)
workflow.add_edge("planner", "researcher")
workflow.add_edge("researcher", "summarizer")

app = workflow.compile()
result = app.invoke({"goal": "Competitor analysis for Acme Corp Q2"})
# second run: 2.66ms · 0 tokens · $0.00

Mnemon auto-instruments LangGraph at import time. No wrappers, no graph restructuring.

---
How it works

Exact match (System 1) — fingerprint of your goal + context. If your agent has solved this before, it returns the cached result in ~2.66ms. Zero LLM calls.

Semantic match (System 2) — if the goal is similar but not identical, Mnemon finds the closest prior execution and only regenerates the segments that actually changed. You pay for the delta, not the whole run.

Enable semantic matching:
pip install mnemon-ai[full]  # local model, no API key needed

---
Results

┌────────────────────────────┬────────┬────────┐
│           Metric           │ Before │ After  │
├────────────────────────────┼────────┼────────┤
│ Tokens per run (avg)       │ ~1,250 │ ~84    │
├────────────────────────────┼────────┼────────┤
│ LLM calls per run          │ 4      │ 0.27   │
├────────────────────────────┼────────┼────────┤
│ Latency (cache hit)        │ 18–22s │ 2.66ms │
├────────────────────────────┼────────┼────────┤
│ Monthly cost (1k runs/day) │ ~$503  │ ~$34   │
└────────────────────────────┴────────┴────────┘

93.3% token reduction. 7,500× faster on cache hits.

The first run of a new goal still pays full cost. Every run after doesn't.

---
Try it

pip install mnemon-ai

import mnemon
mnemon.init()
# your LangGraph agent now has memory across runs

Run mnemon demo to see a live cache hit in 30 seconds — no API key needed.

Your AI Agent Is Wasting Tokens on Pages That Haven't Changed

Mahika jadhav — Wed, 27 May 2026 15:55:13 +0000

You built a web monitoring agent. It checks competitor pricing every hour, scans news feeds for signals, watches supplier pages for stock changes.

It's working. And it's burning your API budget — on nothing.

---

## The problem nobody talks about

Here's what most web monitoring agents do on every run:

1. Fetch the page
2. Send the full content to the LLM
3. Ask "what changed?"

The page hasn't changed. The LLM doesn't know that. It reads 1,500 tokens of HTML, thinks carefully, and tells you: nothing changed

You paid for that. Every hour. For every URL.

If you're monitoring 10 pages hourly, that's 240 LLM calls a day — most of them pointless.

---

## What the right architecture looks like

The fix is straightforward: don't call the LLM unless the page actually changed.

```
Run 1:  fetch 10 pages → analyse all 10 → cache insights
Run 2:  fetch 10 pages → 8 unchanged → 2 changed → LLM fires twice
Run 3:  fetch 10 pages → 9 unchanged → 1 changed → LLM fires once
```

Over 24 hours of hourly checks across 10 pages: **240 fetches, ~20 LLM calls** instead of 240. That's an 80%+ reduction without losing a single real signal.

The mechanics:
- Fetch the page live on every run (you need fresh data, not a cache)
- Hash the content (SHA-256, fast, free)
- Compare to the last stored hash
- If identical → return cached insight, skip the LLM entirely
- If changed → extract the diff, send only the delta to the LLM

The second part matters as much as the first. When a page does change, you still shouldn't send the full page — you should send what's different. If a pricing page updates one number, the LLM needs 50 tokens, not 1,500.

---

Why most agents don't do this

Two reasons.

Bot detection. Consistent fetching from a static IP gets blocked fast. Most developers hit a wall here and either throttle to the point of uselessness or pay for proxy infrastructure separately. The fetching layer is genuinely hard.

Boilerplate. Hashing, diffing, caching, storing snapshots — it's not hard code, but it's 200 lines you have to write, test, and maintain before you've done anything useful. Most people skip it and just call the LLM every time.

---

## Indra

I built Indra to solve both problems in one library.

It uses Bright Data's Web Unlocker on every fetch — bot detection, CAPTCHAs, geo-blocks, JavaScript rendering, all handled transparently. And it write the full hash-diff-cache pipeline so you don't have to.

```python
import indra
import anthropic

client = anthropic.Anthropic()

def llm(prompt: str) -> str:
    msg = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return msg.content[0].text

agent = indra.init(brightdata_api_key="your-key")

result = agent.watch(
    url="https://competitor.com/pricing",
    question="Did any prices change? What are the implications?",
    generation_fn=llm,
)

print(result.changed)        # True / False
print(result.insight)        # LLM analysis, or cached answer if unchanged
print(result.tokens_saved)   # tokens skipped this run
print(result.cost_saved_usd) # dollar value of what was skipped



The `generation_fn` is only called when the page actually changed. On unchanged runs it returns instantly with the cached answer — zero tokens, sub-millisecond.

   ---
    ## Watching multiple pages


   ```python
    results = agent.watch_all(
      urls=[
           "https://openai.com/api/pricing/",
        "https://anthropic.com/pricing",
           "https://competitor.com/pricing",
       ],
       question="Did any prices change?",
       generation_fn=llm,
   )

   agent.print_stats()

Output after a few runs:

   ──────────────────────────────────────────────────
     Indra Session Summary
   ──────────────────────────────────────────────────
     Bright Data fetches : 15
     Changes detected    : 1
     LLM calls fired     : 6
     Cache hits          : 9
     Tokens saved        : 12,000
     Cost saved          : $0.0360
     Efficiency          : 80%
   ──────────────────────────────────────────────────

   ---

   ## Watching search results

Indra also supports SERP monitoring — fire the LLM only when the ranking itself changes:

python
result = agent.search_watch(
query="openai new model announcement",
question="Is there a major new release?",
generation_fn=llm,
)


Same pattern: fetch live SERP results, hash them, skip the LLM if rankings haven't shifted.

---

  ## Install

bash
pip install indra-ai




It's early — rough edges exist, and we're fixing them fast. If you're building web monitoring into an agent and burning tokens on unchanged pages, give it a try.

 [GitHub](https://github.com/smartass-4ever/Indra)
 [PyPI](https://pypi.org/project/indra-ai/)

The Decision Machine

Mahika jadhav — Sat, 23 May 2026 15:41:11 +0000

by Mahika Jadhav, may 2026
There are two paths artificial intelligence can take from here.

The first is the one most labs are racing down: larger models, more training data, wider context windows. Tokens get cheaper. Attention spans get longer. The bet is that scale solves everything.

It does not. Larger context windows lose the small details — the edge case from last month, the pattern that appeared once and mattered. More parameters hit diminishing returns. The architecture is the same. The ceiling is real.

The second path is harder to see, but it is the only one that leads somewhere worth going.

*What Intelligence Actually Is
*
The clearest measure of intelligence is adaptability — the ability to handle a situation never encountered before, using context, environment, and accumulated experience to make a decision.

Not retrieve. Not predict. Decide.

A decision is:

(context + experience + goal) + core priors

Core priors are the fixed layer — the equivalent of instinct. Everything else is learned. A system that can combine what it knows, what it has experienced, what it is trying to achieve, and what it fundamentally is — and produce a decision under novel conditions — is an adaptable system. And an adaptable system, by the only definition that matters, is an intelligent one.

This is what AGI requires. Not more data. Not a bigger window. A self-growing, self-learning system that accumulates experience and uses it to make decisions.

Mnemon Is One Piece

Mnemon handles the experience component.

Every time an agent runs a task, Mnemon records what happened. When a similar goal appears, the agent does not start from zero — it draws on accumulated execution memory. It reuses what worked. It learns from what failed. It gets better every run without retraining, without a larger model, without more parameters.

The token savings are real. But the architecture is what matters: an agent that grows from experience.

That is one component. The decision formula has others.

Look Forward to EROS

My vision is to build every component of this system — context, experience, goal alignment, core priors — into something complete.

When it is finished, it will not just be an agent tool. It will be a human-machine interface — the layer through which humans and artificial intelligence interact as genuine partners, each contributing what the other cannot.

EROS is that system.

When it arrives, it will not just change how agents run. It will change how humans live alongside intelligence.

We are building it now, one component at a time. Mnemon is the first.

pip install mnemon-ai
https://github.com/smartass-4ever/Mnemon
mahikajadhav22@gmail.com

How I Cut My LangGraph Agent's Token Costs by 93% with One Import

Mahika jadhav — Sun, 17 May 2026 12:07:24 +0000

This is the core problem with LangGraph at scale: the graph is stateless by default. Every invocation is a cold start.

The pattern that bleeds money

If your LangGraph agent does any of the following, you're paying for redundant computation:

Scheduled pipelines (weekly reports, daily digests, recurring audits)
Multi-step research agents that hit the same sources
Document processing graphs with consistent structure
Customer-facing agents that handle similar queries repeatedly

What I tried first

Prompt caching — helps with repeated prefixes, not repeated reasoning. When your graph re-derives a plan from slightly different inputs, prompt caching doesn't fire. You still pay.

Manual caching — I added SQLite lookups at individual nodes. Brittle, framework-specific, broke every time I changed
the graph.

The fix

pip install mnemon-ai

import mnemon
mnemon.init()

# your existing LangGraph code — untouched
from langgraph.graph import StateGraph

workflow = StateGraph(MyState)
workflow.add_node("planner", planner_node)
workflow.add_node("researcher", researcher_node)
workflow.add_node("summarizer", summarizer_node)
workflow.add_edge("planner", "researcher")
workflow.add_edge("researcher", "summarizer")

app = workflow.compile()
result = app.invoke({"goal": "Competitor analysis for Acme Corp Q2"})

Mnemon auto-instruments LangGraph at import. No wrappers, no graph restructuring.

How it works

System 1 (exact match) — SHA-256 fingerprint of goal + context + inputs. Cache hit returns in ~2.66ms. Zero LLM calls.
System 2 (semantic match) — Similar but not identical goal? Finds the closest prior execution, regenerates only what
changed. You pay for the delta, not the full run.

Results

45 executions across similar inputs:

┌────────────────────────────┬────────┬────────┐
│ Metric │ Before │ After │
├────────────────────────────┼────────┼────────┤
│ Tokens per run │ ~1,250 │ ~84 │
├────────────────────────────┼────────┼────────┤
│ LLM calls per run │ 4 │ 0.27 │
├────────────────────────────┼────────┼────────┤
│ Latency (cache hit) │ 18–22s │ 2.66ms │
├────────────────────────────┼────────┼────────┤
│ Monthly cost (1k runs/day) │ ~$503 │ ~$34 │
└────────────────────────────┴────────┴────────┘

93.3% token reduction. 7,500× faster on cache hits.

What it doesn't fix

Genuinely novel queries
Real-time agents where freshness matters
Cold starts — first run still hits the LLM

GitHub: smartass-4ever/Mnemon (https://github.com/smartass-4ever/Mnemon)

How I cut my LangChain agent's token costs by 93% with one import

Mahika jadhav — Thu, 14 May 2026 18:05:38 +0000

My agent was generating the same weekly security report for the same three clients every Monday. Same context. Same reasoning structure. Same output format. I was paying full Anthropic API price every single time.

I checked the logs. Across 45 runs of three recurring workflow types — security audits, invoice processing, weekly reports — the structure of the generated plan was materially identical run after run. The LLM was re-deriving the same skeleton every time. 93% of the tokens I was spending were redundant.

This isn't a prompt engineering problem. It's a structural one.

The Problem With Stateless Frameworks

Every major agent framework — LangChain, LangGraph, CrewAI, AutoGen — is stateless by default. There is no memory of
previous executions at the plan level. Each invocation starts from zero.

This is fine for one-off queries. For recurring workflows — scheduled reports, compliance checks, data pipelines, anything that runs the same class of task repeatedly — it means you pay full LLM price every time, forever.

Prompt caching (Anthropic's and OpenAI's built-in feature) helps with input tokens on identical prompts. It doesn't help when your inputs vary slightly per run. It doesn't eliminate the API call. And it does nothing for the reasoning and plan generation that happens downstream.

What you actually need is execution caching — caching at the plan level, not the prompt level.

The Solution: Cache the Execution Plan

The idea: on first run, fingerprint the execution plan and store it as segments. On subsequent runs with the same or semantically similar goal, serve the plan from cache. Skip the LLM entirely.

Two modes:

System 1 — Exact match. SHA-256 fingerprint of goal + context + inputs. If it matches a stored plan, reconstruct from local SQLite in ~2.66ms. Zero API calls. Zero tokens.

System 2 — Semantic match. Goal is similar but not identical — same workflow, different client name or date. Match the stored plan by embedding similarity. Diff the segments. Regenerate only the parts that changed. Pay for the delta, not the full plan.

A background process called the Retrospector quarantines failed segments so bad patterns never get reused. A signal
bus tracks latency baselines and failure rates per workflow type and feeds that back to strengthen or weaken cached patterns. The cache gets smarter over time, not just bigger.

In Practice

import mnemon
mnemon.init()

# everything below is unchanged
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-sonnet-4-6")
response = llm.invoke("Generate weekly security report for Acme Corp")

That's it. mnemon.init() patches BaseChatModel.invoke and ainvoke at import time. The first call goes to the LLM and gets cached. Every subsequent call with the same or semantically equivalent goal is served from local SQLite.

For explicit control over what gets cached:

import mnemon
from anthropic import Anthropic

client = Anthropic()
m = mnemon.init()

def generate_report(goal, inputs, context, capabilities, constraints):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": goal}],
)
return response.content[0].text

result = m.run(
goal="weekly security audit for Acme Corp",
inputs={"client": "Acme Corp", "week": "2026-05-14"},
generation_fn=generate_report,
)

print(result["output"]) # the actual result
print(result["cache_level"]) # "system1" | "system2" | "miss"
print(result["tokens_saved"]) # 1250 on a hit, 0 on first run
print(result["latency_saved_ms"]) # 20000.0 on a hit

generation_fn is only called on a cache miss. On a hit, it's never invoked.

Benchmark Results
Tested across 45 runs of three recurring workflow types on claude-sonnet-4-6:

┌───────────────────────────────────┬───────────┐
│ Metric │ Result │
├───────────────────────────────────┼───────────┤
│ Cache misses (first run per type) │ 3 │
├───────────────────────────────────┼───────────┤
│ System 2 hits │ 12 │
├───────────────────────────────────┼───────────┤
│ System 1 hits │ 30 │
├───────────────────────────────────┼───────────┤
│ Token reduction │ 93.3% │
├───────────────────────────────────┼───────────┤
│ LLM call reduction │ 93% │
├───────────────────────────────────┼───────────┤
│ System 1 hit latency │ 2.66ms │
├───────────────────────────────────┼───────────┤
│ Fresh generation latency │ ~20,000ms │
├───────────────────────────────────┼───────────┤
│ Speedup │ 7,500× │
└───────────────────────────────────┴───────────┘

50 concurrent agents serving the same workflow type in a single burst: 0 LLM calls, 62,500 tokens saved, 0.18 seconds total wall time.

At scale with 80% System 1 and 15% System 2 hit rates:

┌─────────────┬────────────────────┐
│ Daily plans │ Monthly cost saved │
├─────────────┼────────────────────┤
│ 1,000 │ $503 │
├─────────────┼────────────────────┤
│ 10,000 │ $5,034 │
├─────────────┼────────────────────┤
│ 100,000 │ $50,344 │
└─────────────┴────────────────────┘

Raw data and methodology in /reports (https://github.com/smartass-4ever/Mnemon/tree/main/reports).

What It Supports

Auto-instruments at import time — no code changes needed:

┌───────────────┬─────────────────────────────────┐
│ Framework │ What gets patched │
├───────────────┼─────────────────────────────────┤
│ Anthropic SDK │ client.messages.create │
├───────────────┼─────────────────────────────────┤
│ OpenAI SDK │ client.chat.completions.create │
├───────────────┼─────────────────────────────────┤
│ LangChain │ BaseChatModel.invoke / ainvoke │
├───────────────┼─────────────────────────────────┤
│ LangGraph │ CompiledGraph.invoke / ainvoke │
├───────────────┼─────────────────────────────────┤
│ CrewAI │ crew kickoff via event bus │
├───────────────┼─────────────────────────────────┤
│ AutoGen │ ConversableAgent.generate_reply │
└───────────────┴─────────────────────────────────┘

Honest Caveats

System 2 segment-level savings require sentence-transformers. Without pip install mnemon-ai[embeddings], System 2 still works — it serves the full cached plan when goal similarity clears the threshold — but you don't get
partial-segment delta savings. System 1 is unaffected.

This doesn't help for novel one-off queries. If every invocation is genuinely unique, there's nothing to cache.
The savings compound on scheduled or event-triggered workflows running the same class of task repeatedly.

The 2.66ms latency is for warm cache hits. Cold start (first run per workflow type) still goes to the LLM.

Diagnostics

m = mnemon.get() # retrieve from anywhere in your codebase
m.get_stats() # EME hits/misses, bus signals, DB size
m.drift_report() # cross-session latency degradation
m.waste_report # repeated queries and cumulative cost

mnemon doctor # health check
mnemon demo # live demo, no API key needed

Try It

pip install mnemon-ai
mnemon demo

No API key needed to run the demo. Source and full benchmark data on GitHub
(https://github.com/smartass-4ever/Mnemon).

Happy to answer questions on the segment diffing logic or the failure quarantine mechanism — those are the interesting parts architecturally.