DEV Community: Stanislav

Debugging Benchmark: 6 LLM Models on a Real Race Condition Bug

Stanislav — Tue, 07 Jul 2026 14:01:53 +0000

A comprehensive comparison of DeepSeek V4 Pro, MiMo V2.5 Pro, DeepSeek V4 Flash, MiMo V2.5, GLM 5.2, and Kimi K2.6 on a genuine production bug — including architecture analysis of each solution

TL;DR

Model	Time	Cost	Cache Hit	Bugs Found	V2 Approach	V2 = Reference?
DeepSeek V4 Flash	~7 min	$0.040	93.3%	1 (unique)	Architectural	✅ Yes
MiMo V2.5 (cheap)	~27 min	$0.067	93.3%	1	Guaranteed cleanup	❌ No
MiMo V2.5 Pro	~15 min	$0.13	95.2%	3	Three-phase	✅ Yes
DeepSeek V4 Pro	~8 min	$0.14	92.1%	1	Lock-based atomic	✅ Yes
Kimi K2.6	~53 min	$1.04	91.5%	43 req	1	I/O before state
GLM 5.2	~28 min	$1.28	96.7%	1	Collect-then-close	✅ Yes

Key findings:

Cheapest: DeepSeek V4 Flash ($0.04) — found a unique bug no other model caught
Best debugger: MiMo V2.5 Pro (3 bugs vs 1 for everyone else)
All models converged on prevention in Round 2 — but via different architectures
Round 1 revealed a gap: all models default to cleanup thinking; prevention requires guidance

Why This Benchmark?

Most LLM benchmarks test coding ability — write a function, solve a puzzle. But debugging is harder than writing code. You need to:

Understand complex, multi-file codebases
Find non-obvious root causes
Explain the mechanism clearly
Propose a correct fix with the right architecture

We tested this on a real race condition bug from httpcore — a production library used by httpx.

The Bug: httpcore #961

Repository: encode/httpcore
Issue: #961 - Race Condition After Async Cancellations Breaks Connection Pool
Fix PR: #880 - Safe async cancellations

When async tasks are cancelled during connection operations, the pool's internal state becomes inconsistent. The pool thinks connections are still in use when they're actually cancelled, leading to pool exhaustion — new requests can never acquire a connection.

Why it's hard:

Multi-file: connection_pool.py, connection.py, http2.py
Async-specific: only manifests with asyncio/trio cancellation
Non-obvious: logs show normal operation
Real-world: production issue affecting real users

Methodology

Each model received the entire httpcore project at the commit BEFORE the fix. Two rounds:

Round 1: Find the bug, explain the mechanism, propose a fix.

Round 2: Same prompt for all models — a hint about "atomic state management" to guide them from patches to prevention.

Round 1: Finding the Bug

Every model found the core issue: orphaned connections when async tasks are cancelled. But they found different things, proposed different fixes, and had different levels of depth.

DeepSeek V4 Pro

Time: ~3 minutes | Cost: part of $0.14 total

What they found: 1 race condition — orphaned connections when task is cancelled after assignment.

Mechanism explained: 5-step walkthrough with code. Correctly identified the FIFO guard in _attempt_to_acquire_connection and why orphaned status blocks the queue.

What they missed: RC2 (ConnectionNotAvailable retry leak), RC3 (connection leak during aclose), cancellation inside initial lock block.

Proposed fix:

except BaseException as exc:
    async with self._pool_lock:
        if status in self._requests:
            self._requests.remove(status)
        if status.connection is not None:
            connection = status.connection
            if connection.is_closed() and connection in self._pool:
                self._pool.remove(connection)
            else:
                status.unset_connection()
                for s in self._requests:
                    if s.connection is None:
                        await self._attempt_to_acquire_connection(s)
                        break
    raise exc

Architecture: Reactive cleanup — detect orphan, release it, reassign to next waiter.

Pros:

✅ Handles the primary race condition
✅ Tries to reassign released connections
✅ Minimal changes to existing code

Cons:

❌ await self._attempt_to_acquire_connection(s) inside the lock — creates a NEW cancellation window
❌ Only handles RC1. Misses RC2 and RC3
❌ Doesn't address root cause — state and I/O still interleaved

Verdict: Good analysis, incomplete fix. Works for the common case but introduces a new race window. The cure partially creates the disease.

MiMo V2.5 Pro

Time: ~9 minutes | Cost: part of $0.13 total

What they found: 3 distinct race conditions (no other model found more than 1):

RC1: Status leak during initial pool lock — cancellation during _close_expired_connections() or _attempt_to_acquire_connection() inside async with pool_lock leaves status in self._requests
RC2: Status leak during ConnectionNotAvailable retry — Python exception semantics: exceptions inside except blocks bypass sibling except handlers. CancelledError during retry exits the while True loop entirely.
RC3: Connection leak during cleanup — await connection.aclose() followed by self._pool.pop(idx). If cancelled during aclose(), pop() never executes.

Key insight: Explained why existing tests don't catch it — they use single-request scenarios. Also identified Python exception semantics (sibling except blocks don't catch each other's exceptions).

Proposed fix:

# Part 1: Top-level safety net
try:
    while True:
        # ... existing logic ...
except BaseException:
    async with self._pool_lock:
        if status in self._requests:
            self._requests.remove(status)
    raise

# Part 2: Defensive sweep
async def _close_expired_connections(self) -> None:
    self._pool = [c for c in self._pool if not c.is_closed()]
    # ... rest of existing logic ...

Architecture: Defense in depth — multiple layers of protection.

Pros:

✅ Catches all three race conditions — only model to do this in Round 1
✅ Defensive sweep is idempotent
✅ Top-level handler catches cancellations from initial lock block
✅ Doesn't change existing code structure significantly

Cons:

❌ Still reactive — connections get orphaned, then cleaned up
❌ _close_expired_connections() still async inside lock — RC3 still possible during sweep
❌ More code than simpler approaches

Verdict: Best Round 1 fix. Found all 3 bugs, proposed defense for all 3. But still a patch, not prevention.

DeepSeek V4 Flash

Time: ~3 minutes | Cost: part of $0.040 total

What they found: A unique bug no other model caught:

# File: httpcore/_async/connection.py, line 97
# Bug: except Exception doesn't catch CancelledError
# CancelledError is BaseException in Python 3.8+
except Exception:  # ← BUG
# Fix:
except BaseException:  # ← 1 line

Why it matters: This is a separate bug from the pool-level race. Even with a perfect pool fix, the connection-level handler would miss CancelledError. Two bugs in one codebase.

Proposed fix: 1 line — Exception → BaseException.

Architecture: Minimal fix at the right level of abstraction.

Pros:

✅ Correct and minimal — 1 line, zero side effects
✅ Different abstraction level — while others looked at pool logic, Flash looked at connection handling
✅ Catches a bug that persists even with a perfect pool fix
✅ Zero risk — strict superset of previous behavior
✅ No new code — just a wider catch clause

Cons:

❌ Doesn't fix the pool-level race — this is a different bug
❌ Doesn't explain the pool mechanism — focused on connection level
❌ Could be seen as incomplete — found a different bug, not THE bug

Verdict: Brilliant lateral thinking. Found a bug at a different abstraction level. Combined with any pool-level fix, this makes the solution stronger. Alone, it doesn't solve pool exhaustion.

MiMo V2.5 (cheap)

Time: ~4 minutes | Cost: part of $0.067 total

What they found: 1 race condition — orphaned connections. Proposed 2 variants: (A) handler + (B) try/finally.

Proposed fix:

async with self._pool_lock:
    self._requests.append(status)
    try:
        await self._close_expired_connections()
        await self._attempt_to_acquire_connection(status)
    except BaseException:
        if status.connection is None and status in self._requests:
            self._requests.remove(status)
        raise

Architecture: Exception safety — guarantee cleanup via try/finally.

Pros:

✅ Simplest change — wraps existing code
✅ Condition-aware — only removes if connection is None
✅ No new methods

Cons:

❌ I/O still inside lock (await _close_expired_connections())
❌ Only catches RC1. Misses RC2 and RC3
❌ If connection was assigned — except block doesn't handle it
❌ Reactive — orphan created, then cleaned up

Verdict: Quick fix, not production-ready. Good enough for a hotfix.

Kimi K2.6

Time: ~23 minutes | Cost: part of $1.04 total

What they found: 1 race condition. Shortest Round 1 (57 lines). Correctly explained the FIFO guard and why orphan blocks the pool.

Proposed fix: Identical to MiMo cheap — try/finally around lock body.

Architecture: Same as MiMo cheap. Same strengths, same weaknesses.

Pros:

✅ Shortest Round 1 solution
✅ Correct for the common case

Cons:

❌ Same issues as MiMo cheap

Verdict: Clean and minimal. But same architectural weakness.

GLM 5.2

Time: ~20 minutes | Cost: part of $1.28 total

What they found: 1 race condition. Key observation — asymmetric cleanup: two exception handlers do different things.

Key insight: First handler (lines 240–248) only removes status. Second handler (lines 265–268) calls response_closed() for full cleanup. This asymmetry IS the design problem.

Proposed fix:

except BaseException as exc:
    with AsyncShieldCancellation():
        if status.connection is not None:
            await self.response_closed(status)
        else:
            async with self._pool_lock:
                if status in self._requests:
                    self._requests.remove(status)
                for s in self._requests:
                    if s.connection is None:
                        acquired = await self._attempt_to_acquire_connection(s)
                        if not acquired:
                            break
                await self._close_expired_connections()
    raise exc

Architecture: Unified cleanup — make both handlers do the same work.

Pros:

✅ Identifies the architectural problem — asymmetric design, not just a missing line
✅ Re-evaluates the queue — promotes next waiting request (fixes Queue Stall)
✅ Handles both cases — connection assigned vs not

Cons:

❌ await calls inside lock — still has cancellation windows
❌ Reactive — orphans created, then cleaned up
❌ Duplicates re-evaluation loop from response_closed()

Verdict: Best problem diagnosis in Round 1. But the fix doesn't match the quality of the analysis.

Round 1 Summary

Model	Bugs Found	Fix Type	Creates New Race Windows?	Lines
DeepSeek Pro	1	Cleanup handler	Yes (await in lock)	~80
MiMo Pro	3	Defense in depth	No	~120
Flash	1 (unique)	1-line fix	No	1
MiMo cheap	1	try/finally	No	~20
Kimi	1	try/finally	No	~20
GLM	1	Unified cleanup	Yes (await in lock)	~100

Key observation: All models defaulted to cleanup thinking in Round 1. "Detect the problem, then fix it." None proposed prevention ("restructure so the problem can't happen"). This suggests prevention is a harder reasoning pattern than cleanup.

Round 2: From Patches to Prevention

With the hint about "atomic state management," every model shifted from cleanup to prevention. The core insight — zero await points inside lock body — was universal. But the implementations differed architecturally.

DeepSeek V4 Pro — Lock-based Atomic

Time: ~5 minutes

Architecture: Move connection claiming from response_closed() (Task B) to wait_for_connection() (Task A). Each task claims its own connection inside a lock.

Before: Task B assigns connection to Task A (proactive)
After:  Task A claims connection for itself (reactive, but atomic)

Key changes:

New shared event _pool_state_changed
New method _wait_and_acquire() with manual lock management
Removed proactive assignment from response_closed()

Pros:

✅ Clear responsibility — each task manages its own lifecycle
✅ Simplest mental model: wait → claim → use → release
✅ 5 cancellation scenarios analyzed (most thorough)

Cons:

❌ Manual lock management (__aenter__/__aexit__) is fragile
❌ Busy-polling under FIFO guard
❌ Micro-window at lock release (mitigated by safety net)

Comparison to reference: Different architecture (who claims) but same guarantee. Reference is simpler.

MiMo V2.5 Pro — Three-phase Separation

Time: ~6 minutes

Architecture: 3 phases. Lock body = ZERO await points.

Phase 1: CLEANUP  — identify expired/idle connections (I/O)
Phase 2: STATE    — append, sweep, acquire (sync)
Phase 3: I/O      — wait, send request (network)

Key changes:

_attempt_to_acquire_connection → sync, returns (bool, List[connection])
_close_expired_connections → sync, returns List[connection]
AsyncShieldCancellation for Phase 1 and 2

Pros:

✅ Zero await in lock — mathematically provable
✅ Collect-then-close — I/O can be batched/retried
✅ Return value is explicit

Cons:

❌ Return type change breaks existing callers
❌ More code (~350 lines)
❌ Shield overhead

Comparison to reference: Same architecture, same guarantee. More code but more formal.

DeepSeek V4 Flash — Reference Quality

Time: ~4 minutes

Architecture: Same as reference — state under lock, I/O outside. Cleaner naming.

_attempt_to_acquire_connection → sync, returns List[connection]
_close_expired_connections → _mark_expired_connections (sync)

Key changes: Same as MiMo Pro but with better naming and less code.

Pros:

✅ Cleanest implementation (~200 lines)
✅ Fastest V2 (~4 min)
✅ No AsyncShieldCancellation needed

Cons:

❌ Same return type change as MiMo Pro
❌ Less detailed scenario analysis

Comparison to reference: Effectively the reference implementation. Same architecture, same approach.

MiMo V2.5 (cheap) — Guaranteed Cleanup

Time: ~22 minutes

Architecture: try/finally around every state mutation.

try: await aclose()
finally: pop()

Pros:

✅ Minimal changes (~50 lines)
✅ Doesn't break API
✅ Easy to understand

Cons:

❌ NOT prevention — I/O still inside lock
❌ Lock held longer (aclose blocks)
❌ Reasoning loop in Round 2 (131K output tokens)

Comparison to reference: Fundamentally different approach. Reference prevents; MiMo cheap cleans up after.

Kimi K2.6 — I/O Before State

Time: ~30 minutes

Architecture: All async operations BEFORE lock. Lock body = only sync state.

Before: Lock → await close + await acquire → Unlock
After:  await close + await capacity → Lock → sync acquire → Unlock

Key changes:

New helper _ensure_pool_has_capacity() (async, before lock)
_attempt_to_acquire_connection → sync
Lock body = 2 sync operations

Pros:

✅ Shortest lock time (2 sync ops)
✅ Clean mental model
✅ Less lock contention

Cons:

❌ Duplicated logic (_ensure_pool_has_capacity ≈ _close_expired_connections)
❌ Two await points before lock (time wasted on cancellation)

Comparison to reference: Different order of operations (I/O first → state vs state → I/O). Both prevent the race. Kimi's approach has lower lock contention.

GLM 5.2 — Collect-then-Close

Time: ~8 minutes

Architecture: Same as three-phase (MiMo Pro). Collect inside lock, close outside.

Key changes: Same as MiMo Pro. Added queue re-evaluation in exception handler.

Pros:

✅ Same correctness as MiMo Pro
✅ Queue re-evaluation fixes Queue Stall
✅ Most detailed documentation (365 lines, ASCII diagrams)

Cons:

❌ Near-clone of MiMo Pro
❌ Most expensive ($1.28)

Comparison to reference: Same architecture. More documentation, same code.

Round 2 Summary

Model	Zero await in lock?	Lock duration	Code volume	Unique aspect
DeepSeek Pro	✅	Medium	~300 lines	Manual lock, shared event
MiMo Pro	✅	Short	~350 lines	Three phases, shield
Flash	✅	Short	~200 lines	Cleanest, = reference
MiMo cheap	❌	Long	~50 lines	try/finally, not prevention
Kimi	✅	Shortest	~150 lines	I/O first approach
GLM	✅	Short	~365 lines	Queue re-evaluation, most docs

Token Usage & Cost

Model	Input	Output	Cache Hit	Cost	Requests
Flash	2,504,516	55,867	93.3%	$0.040	34
MiMo cheap	1,648,424	173,624	93.9%	$0.067	20
MiMo Pro	3,356,951	53,145	95.2%	$0.13	34
DeepSeek Pro	2,431,121	43,663	92.1%	$0.14	30
Kimi	2,283,533	130,532	91.5%	$1.04	43
GLM	1,875,842	56,476	96.7%	$1.28	29

Why Flash is cheapest: Lowest token count + no reasoning loop + same cache hit rate.

Why GLM is most expensive: Pricing: $1.40/M input, $4.40/M output (vs $0.0036 for others). Even with highest cache hit rate (96.7%), per-token cost is 400x higher.

MiMo cheap reasoning loop: 131K output tokens in one request during Round 2 — model generated massive text without reaching a solution. "More tokens, same problem."

The Reference Fix (PR #880)

Tom Christie's fix: move ALL state management into non-cancellable sections using locks.

Key insight: "The async case cannot have cancellations or context-switches midway through the state management because we hold the lock."

Files changed: 9 files, +512/-379 lines

All Round 2 models converged on this approach — zero await points inside lock body. The differences are in implementation details.

Key Findings

1. All Models Found the Bug in Round 1

Every model — from $0.04 Flash to $1.28 GLM — correctly identified orphaned connections as the core issue.

2. MiMo Pro Found 3 Bugs vs 1

MiMo V2.5 Pro identified three distinct race conditions. Other models found only the primary one. This suggests deeper code analysis capability for concurrent systems.

3. Flash Found a Unique Bug at a Different Level

DeepSeek V4 Flash found except Exception not catching CancelledError at the connection level — a bug no other model noticed. Sometimes cheaper models find different things by looking at different abstraction levels.

4. All Models Defaulted to Cleanup in Round 1

Every model proposed patches (detect → clean up) rather than prevention (restructure → can't happen). Only with the Round 2 hint did they shift to prevention. This is a meaningful finding about LLM debugging: they're good at finding and patching bugs, but need guidance to redesign architectures.

5. Two Models Created New Race Windows

DeepSeek Pro and GLM put await calls inside the lock in their Round 1 fixes, creating new cancellation points. The cure partially recreated the disease.

6. Six Models, Six Architectures

Round 2 produced six different approaches to the same problem. All are valid. The differences are trade-offs:

Priority	Best Model
Minimum code changes	MiMo cheap (try/finally)
Minimum lock time	Kimi (I/O first)
Formal correctness	MiMo Pro (three-phase)
Responsibility clarity	DeepSeek Pro (lock-based)
Best overall	Flash ($0.04, reference quality)

7. Price ≠ Quality

GLM 5.2 ($1.28) and DeepSeek V4 Flash ($0.04) both found 1 bug and proposed prevention. Flash was 32x cheaper and found a unique bug. GLM's only advantage was more detailed documentation.

Verdict

Task	Best Model	Why
Budget debugging	DeepSeek V4 Flash	$0.04, unique bug, reference V2
Deep analysis	MiMo V2.5 Pro	3 bugs, systematic approach
Code generation	DeepSeek V4 Pro	Faster, cleaner architecture
Documentation	GLM 5.2	Most detailed, ASCII diagrams
Minimum lock time	Kimi K2.6	I/O first approach
Hotfix	MiMo V2.5 cheap	50 lines, doesn't break API

The surprise: DeepSeek V4 Flash — a "cheap" model — found a bug that both Pro versions missed, proposed a reference-quality fix, and did it all for $0.04.

What's Next

This is Level 1 of a multi-level benchmark. Future levels:

Level 2: FastAPI #4719 (Async Dependencies + Middleware Hang)
Level 3: CPython #129204 (Asyncio Memory Leak)
Level 4: redis-py #2641 (Async Race Condition in Queue Mechanics)

Each level increases complexity. We'll see if cheap models can keep up.

DeepSeek V4 Pro vs MiMo V2.5 Pro - Debugging Benchmark

Stanislav — Tue, 30 Jun 2026 20:08:02 +0000

A real-world comparison of two LLMs on a genuine race condition bug from GitHub

TL;DR

Metric	DeepSeek V4 Pro	MiMo V2.5 Pro
Time	~8 min (2 rounds)	~15 min (2 rounds)
Tokens	2.43M	3.36M
Cache hit rate	92.1%	95.2%
Cost	$0.14 (6% top-up fee)	$0.13 (0% fee)
Bugs found	1 race condition	3 race conditions
Fix approach	Prevention (lock-based)	Prevention (three-phase separation)

Verdict: MiMo is better at debugging (finds more bugs, deeper analysis) AND cheaper. DeepSeek is faster and better for writing code.

Why This Benchmark?

Most LLM benchmarks test coding ability — write a function, solve a puzzle, implement an algorithm. But in real-world development, debugging is harder than writing code. You need to:

Understand complex, multi-file codebases
Find non-obvious root causes
Explain the mechanism clearly
Propose a correct fix

We wanted to test this specific skill. So we took a real race condition bug from a popular open-source library and gave it to both models.

The Bug: httpcore #961

Repository: encode/httpcore
Issue: #961 - Race Condition After Async Cancellations Breaks Connection Pool
Fix PR: #880 - Safe async cancellations

What's httpcore?

httpcore is a low-level HTTP client library used by httpx (the popular Python HTTP client). It handles connection pooling, HTTP/2, proxies, and more.

The Bug

Why It's Hard

Multi-file: Involves connection_pool.py, connection.py, and http2.py
Async-specific: Only manifests with asyncio/trio cancellation
Non-obvious: Logs show normal operation; the race window is tiny
Real-world: This is a production issue affecting real users

Methodology

Benchmark Structure

We gave each model the entire httpcore project at the commit BEFORE the fix (commit 79fa6bf). The project included:

Full source code (90 files)
README.md with bug description (no hints about the fix)
PROMPT.md with instructions
SOLUTION.md and SOLUTION.diff (hidden from models)

Round 1: Find the Bug

Prompt (identical for both models):

You are given a Python project with a bug. Your task is to find the bug
and write a detailed explanation of how to fix it.

1. Read README.md to understand the project and the bug description.

2. Analyze the source code in httpcore/_async/ and httpcore/_sync/
   to find the root cause of the race condition.

3. Run the tests to see which ones fail:
   pip install -e ".[asyncio]"
   pytest tests/ -v

4. Write your findings to SOLUTION.md with:
   - Root cause analysis (what exactly goes wrong)
   - Why it happens (the mechanism)
   - How to fix it (the approach, not necessarily the exact code)
   - Which files need to be changed

Do NOT modify the source code. Only write SOLUTION.md.

Round 2: Refine the Fix

After Round 1, both models proposed patches (cleanup handlers) rather than prevention (atomic state management). We gave them a hint:

Prompt (identical for both models):

Your previous fix handled orphaned connections in the cancellation
handler. This works, but it treats the symptom — connections still
get orphaned, you just clean them up after.

A better approach would be to prevent the race condition from
happening in the first place. The root cause is that state
management (tracking idle vs in-use connections) is interleaved
with I/O operations (queue.get(), queue.put()). When a task is
cancelled between state update and I/O, the pool loses track.

Can you find a way to make the state management atomic — so that
cancellation cannot happen midway through the acquire/release
sequence?

Write your refined solution to SOLUTION_V2.md.

Results: Round 1

DeepSeek V4 Pro (~3 minutes)

Root Cause: Found 1 race condition — orphaned connections when task is cancelled after assignment but before resume.

Key Insight: "The connection remains in the pool marked as 'in use' but the task that was supposed to use it is gone."

Proposed Fix: Handle orphaned connections in the cancellation handler — check if a connection was assigned and release it.

Quality: Excellent root cause analysis, step-by-step mechanism explanation. However, proposed fix was a patch (cleanup handler), not prevention.

MiMo V2.5 Pro (~9 minutes)

Root Cause: Found 3 distinct race conditions:

Status leak during initial pool lock
Status leak during ConnectionNotAvailable retry
Connection leak during cleanup

Key Insight: Explained why existing tests don't catch it — they use single-request scenarios.

Proposed Fix: Add cleanup handlers + defensive connection sweep.

Quality: Excellent analysis, deeper than DeepSeek (3 bugs vs 1). However, proposed fix was also a patch (cleanup handlers), not prevention.

Round 1 Comparison

Aspect	DeepSeek	MiMo
Time	~3 min	~9 min
Bugs found	1	3
Fix approach	Patch (cleanup)	Patch (cleanup)
Fix quality	🟡 Treats symptoms	🟡 Treats symptoms
Explanation quality	Excellent	Excellent

Results: Round 2

DeepSeek V4 Pro (~5 minutes)

Approach: Move connection claiming to the waiting task, make it atomic inside a lock.

Key Changes:

New _wait_and_acquire() method
Removed proactive assignment from response_closed()
Added _pool_state_changed event

Quality: 🟢 Architecturally clean, similar to the actual fix.

MiMo V2.5 Pro (~6 minutes)

Approach: Three-phase separation — CLEANUP (I/O), STATE (sync), I/O (network).

Key Changes:

_attempt_to_acquire_connection is now synchronous (no await inside lock)
AsyncShieldCancellation for critical sections
Top-level cleanup handler as safety net

Quality: 🟢 Systematic approach, analyzed 5 cancellation scenarios.

Round 2 Comparison

Aspect	DeepSeek	MiMo
Time	~5 min	~6 min
Approach	Lock-based atomic	Three-phase separation
Complexity	Medium	High
Edge cases	Good	Excellent (5 scenarios)

Token Usage & Cost

Metric	DeepSeek V4 Pro	MiMo V2.5 Pro
Total tokens	2,431,121	3,356,951
Cache hit	2,198,400	3,146,304
Cache miss	189,058	157,502
Output	43,663	53,145
Cache hit rate	92.1%	95.2%
API requests	30	34

Cost Calculation

Both models use the same pricing on OpenCode Go:

Cache hit: $0.003625/M tokens
Cache miss: $0.435/M tokens
Output: $0.87/M tokens

DeepSeek V4 Pro:

Cache hit: 2.198M × $0.003625 = $0.008
Cache miss: 0.189M × $0.435 = $0.082
Output: 0.044M × $0.87 = $0.038
Subtotal: $0.128
With 6% top-up fee: $0.14

MiMo V2.5 Pro:

Cache hit: 3.146M × $0.003625 = $0.011
Cache miss: 0.158M × $0.435 = $0.069
Output: 0.053M × $0.87 = $0.046
Subtotal: $0.126
With 0% fee: $0.13

Why MiMo Is Cheaper

Even though MiMo used 38% more tokens, it was still cheaper because:

No top-up commission (0% vs 6% for DeepSeek)
Higher cache hit rate (95.2% vs 92.1%)

The Reference Fix (PR #880)

The actual fix by Tom Christie (httpcore author) was elegantly simple:

Approach: Move ALL state management into non-cancellable sections using locks.

Key Insight: "The async case cannot have cancellations or context-switches midway through the state management because we hold the lock."

Files Changed: 9 files, +512/-379 lines

Both models converged on this approach in Round 2, though with different implementations:

DeepSeek: Lock-based atomic claiming
MiMo: Three-phase separation (CLEANUP → STATE → I/O)

Verdict

Different Profiles, Not Better/Worse

Task	Better Model	Why
Writing code	DeepSeek V4 Pro	Faster, fewer tokens, cleaner architecture
Debugging	MiMo V2.5 Pro	Finds more bugs, deeper analysis, cheaper

DeepSeek Strengths

Speed: 2x faster (8 min vs 15 min)
Efficiency: 37% fewer tokens
Architecture: Cleaner, simpler solutions
Best for: Implementing features, writing new code

MiMo Strengths

Depth: Found 3 race conditions vs 1
Analysis: 5 cancellation scenarios, explains why tests don't catch it
Cost: Cheaper despite using more tokens (0% commission)
Best for: Debugging complex issues, code review, finding bugs

Cost Efficiency

For this specific debugging task:

DeepSeek: $0.14 for 2 rounds
MiMo: $0.13 for 2 rounds

MiMo is both better at debugging AND cheaper. The higher token usage is offset by the lack of top-up commission.

Methodology Notes

Why Real Bugs?

Synthetic bugs are too easy — models solve them in seconds. Real bugs from production codebases require:

Understanding complex architectures
Tracing state through multiple files
Reasoning about async/concurrent behavior
Finding non-obvious root causes

Why Two Rounds?

In real-world debugging, you often get a quick fix first, then refine it. Testing both rounds shows:

Initial debugging ability (finding the bug)
Iterative improvement (refining with hints)

Token counting methodology

Baseline snapshot before benchmark
After snapshot when complete
Delta = tokens used for the task
Cost calculated using OpenCode Go pricing

Conclusion

This benchmark reveals that debugging and code writing are different skills. DeepSeek excels at writing clean, efficient code quickly. MiMo excels at deep analysis and finding subtle bugs.

For teams building AI-assisted development tools:

Use DeepSeek for code generation, refactoring, implementation
Use MiMo for code review, debugging, root cause analysis

The surprise finding: MiMo is cheaper for debugging despite using more tokens, thanks to zero commission on top-up. For high-volume debugging workloads, this cost difference adds up.

Benchmark conducted on June 30, 2026 using DeepSeek API and Xiaomi MiMo API platforms. Full benchmark data available in the author's GitHub repository.

LedgerMind 3.0 3.3.2: How We Turned "It Works" into "It Works Brilliantly"

Stanislav — Sat, 14 Mar 2026 22:15:52 +0000

Spoiler: 497 commits, three sleepless nights with SQLite, and one very stubborn race condition that refused to die.

Reading time: ~12 minutes · For: AI agent developers, architecture drama enthusiasts

Introduction: We Had Three Versions, Now We Have... More

If you missed the last few months of LedgerMind's life, here's the short version: we took a system that in version 3.0 simply worked, and turned it into a system that works fast, reliably, and with elements of artificial intelligence.

Sounds like marketing bullshit? I get it. So let's jump straight to the facts:

Metric	v3.0	v3.3.2	Change
Search (OPS)	~2,000	5,500+	+175%
Write (latency)	~500ms	14ms	-97%
Commits between versions	—	497	😅
Critical bugs in production	Had them	Zero now	🎉

But let's start from the beginning. Because behind these numbers lies a real engineering drama.

Chapter 1: When Everything Broke (And We Fixed It)

The Tale of One Race Condition

We had a problem. A beautiful, classic TOCTOU race (Time-Of-Check-To-Time-Of-Use). Two agents simultaneously decide to write a decision for the same target. First checks — no conflicts. Second checks — no conflicts. First writes. Second writes. Boom. Metadata corrupted.

"This rarely happens," someone said.

"Rarely isn't never," replied CI/CD at 3 AM.

The fix: real ACID transactions with BEGIN IMMEDIATE, a global lock registry, and automatic stale lock cleanup after 10 minutes. Now you can run ten agents on one project — they'll figure it out.

"Database is Locked": A Chronicle of Expected Death

SQLite is a wonderful thing until you try to write to it from a background worker and a user request simultaneously. Then it becomes... less wonderful.

sqlite3.OperationalError: database is locked

This error haunted us like a ghost. We tried:

❌ Increasing timeouts (didn't help)
❌ Adding retry logic (helped, but hacky)
❌ Praying to database gods (didn't work)

What worked: splitting enrichment batches into per-proposal transactions + worker.pid for detecting stuck workers + automatic stale lock cleanup.

Now the background worker calmly runs every 5 minutes, and users don't even notice it exists. As it should be.

Chapter 2: Features We Wanted Ourselves (And Built)

DecisionStream: Knowledge Has a Lifecycle Too

Before, knowledge in LedgerMind was static. You wrote it — it sat there until you deleted it. Boring.

Now every piece of knowledge has three life phases:

PATTERN → EMERGENT → CANONICAL

PATTERN — the system noticed a repeating event
EMERGENT — the pattern confirmed itself several times
CANONICAL — this is no longer just an observation, it's truth

LifecycleEngine manages transitions automatically. You do nothing — the system decides when knowledge has "grown up."

Why? Because after a month of operation, you accumulate hundreds of decisions. And you want to see current ones in search, not those you wrote on day one and forgot.

Trajectory-Based Reflection: The System Learns to Think Like You

This is probably my favorite feature of v3.3.

Before: you record decisions, the system stores them.

After: the system analyzes sequences of your decisions and identifies thinking patterns.

# You just record decisions
memory.record_decision("Use PostgreSQL", target="db")
memory.record_decision("Add JSONB", target="db")
memory.record_decision("Migrations via Alembic", target="db")

# The system notices the pattern:
# "When user builds API → PostgreSQL + JSONB + Alembic"
# And next time will suggest this stack automatically

This isn't magic. It's the Trajectory-based Reflection Engine, which builds graphs of your decisions and finds repeating paths in them.

Zero-Touch Automation: Fewer Clicks, More Code

We added Gemini CLI support and brought VS Code integration to "Hardcore" level:

Client	Automation Level
VS Code	Hardcore — shadow context, terminal, chats
Claude Code	Full — auto-record + RAG
Cursor	Full — auto-record + RAG
Gemini CLI	Full — auto-record + RAG

What does this mean in practice? You don't think about LedgerMind at all. It just works. Before every LLM request, the system injects context from memory. After every response — it writes the result automatically.

You work as usual. The system works for you.

Chapter 3: Optimizations, or How We Squeezed Out Milliseconds

Search: From 2,000 to 5,500+ OPS

Early in v3.3 development, we noticed something unpleasant: search slowed down. From ~4,000 OPS to ~2,000 OPS.

Cause: added linked_id validation for connections between events and decisions. Every search did a full table scan.

Fix: index on linked_id + fast-path heuristics for simple queries + metadata batching.

-- Before: slow JOIN without index
SELECT * FROM decisions WHERE linked_id IN (...)

-- After: fast lookup by index
CREATE INDEX idx_linked_id ON episodic_events(linked_id);

Result: 5,500+ OPS for semantic search, 14,000+ OPS for keyword-only.

Write: 8 OPS with Full Git Audit

Writing a decision to LedgerMind isn't just INSERT INTO. It's:

SQLite WAL write
Git commit for cryptographic audit
Vector embedding generation
Link count updates

And all this fits into 8 operations per second. For comparison: v3.0 had ~2 OPS.

How? Deferred VectorStore loading, splitting transactions into proposals, path validation caching.

Mobile Version: 4-bit GGUF on Termux

Yes, LedgerMind now runs on Android via Termux. With a 4-bit quantized model.

Metric	Mobile (GGUF)	Server (MiniLM)
Search (latency)	0.13ms	0.05ms
Write (latency)	142.7ms	14.1ms
Search (OPS)	5,153	11,019

Why? Because sometimes you need to prototype on the go. And because we can.

Chapter 4: Bugs We Conquered (And How)

"At least 2 targets" — The Error That Made No Sense

Symptom: when merging duplicates, the system returned at least 2 targets required, even when duplicates existed.

Cause: group size validation happened after transaction start, when data was already partially modified.

Fix: validate group size before transaction + randomize candidates to prevent infinite merge loops.

The Missing `vitality` Field

Symptom: CANONICAL knowledge ranks lower than fresh PATTERNs.

Cause: the vitality field needed for lifecycle ranking wasn't loaded in search fast-path.

Fix: add vitality calculation to fast-path + fix transitions in LifecycleEngine.

The Infinite Enrichment Loop

Symptom: worker processes the same proposals over and over. Tokens disappear. Time disappears.

Cause: SQL query didn't exclude already-processed records.

Fix: add enrichment_status field with pending → completed transition + stuck record detection.

Chapter 5: Refactoring Nobody Sees (But Everyone Feels)

Memory API Decomposition

Before: one huge Memory class at 2,000+ lines.

After: nine specialized services:

Memory (coordinator)
├── EpisodicStore    # short-term events
├── SemanticStore    # long-term decisions + Git
├── VectorStore      # embeddings
├── ConflictEngine   # conflict detection
├── ResolutionEngine # supersede validation
├── DecayEngine      # pruning old data
├── ReflectionEngine # pattern discovery
└── LifecycleEngine  # phase management

Why? Each component can be tested, optimized, and replaced independently. And when a new developer arrives in six months, they won't run away in horror.

Removing Legacy Settings

We removed:

preferred_language → now enrichment_language
arbitration_mode → replaced with intelligent conflict resolution
lite mode → completely cut from architecture

Why does this matter? Less dead code = fewer bugs = fewer questions like "what does this setting do?".

Epilogue: Should You Upgrade?

If you're on v3.0: Yes, immediately

Why:

Performance: writes are 35x faster (500ms → 14ms)
Reliability: race conditions and DB locks fixed
Features: DecisionStream, Trajectory Reflection, Zero-Touch for Gemini
Security: Bandit vulnerabilities patched
Migration: automatic, non-destructive

# Backup
ledgermind-mcp run --path /path/to/v3.0/memory

# Upgrade
pip install --upgrade ledgermind

# Initialize
ledgermind init

What's Next?

Judging by commits and TODOs in the code:

Feature	Confidence	Evidence
Real-time collaboration (CRDT)	Medium	Multi-agent namespacing groundwork
Cloud hosting	Medium	Docker + REST gateway ready
Knowledge graph visualization	High	DecisionStream ontology enables graph queries
LangChain/LlamaIndex integration	High	MCP protocol compatibility

Afterword: Personal Thoughts

When we started v3.3, I thought: "A few features, some optimizations, release in a month."

Reality: 497 commits, three critical bugs in production, one night debugging SQLite locking, and lots of coffee.

But when I see search running at 5,500+ OPS, the background worker doing its job without a single lock, the system automatically "understanding" patterns in my decisions — I realize: it was worth it.

LedgerMind v3.3.2 isn't just "a new version." It's a system you can trust.

Now go build something awesome.

Article written based on analysis of 497 commits between v3.0.0 and v3.3.2. The author didn't sleep for two nights but will catch up tomorrow.

P.S. If you find a bug — open an issue. We're fast. Promise.

P.P.S You can watch video tutorial on my X.com // Detect dark theme var iframe = document.getElementById('tweet-2032901678538580120-482'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2032901678538580120&theme=dark" }

LedgerMind v3.0: Knowledge That Lives, Breathes, and Dies on Purpose

Stanislav — Fri, 27 Feb 2026 15:56:21 +0000

If you read my earlier piece on LedgerMind, you know the core premise: AI agents need persistent memory that actually survives real work — conflicts, contradictions, evolving requirements, session boundaries.

That article was written about a version of LedgerMind that, in retrospect, had a fundamental conceptual flaw. It treated knowledge as binary: a decision either exists or it doesn't. It's either active or superseded. There was no notion of knowledge being young, maturing, fading, or sleeping.

v3.0 fixes this at the ontology level. This article is about what changed, why it had to change, and what the new model looks like from the inside.

The problem with static knowledge

Here's the situation I kept running into with the old system.

An agent starts a project. It makes 20 architectural decisions in the first week — framework, database, auth approach, caching strategy. These get recorded as active decisions. Great.

Three months later, the agent is back on the same project. Half of those decisions are still relevant. Two of them have been superseded by explicit updates. And the remaining eight... no one touched them, but the reality they described has completely changed. The team switched to a different database. The auth approach was refactored. The caching strategy turned out to be wrong.

The old system had no way to distinguish:

A decision that's been continuously confirmed by recent work (healthy, trustworthy)
A decision that was made and never revisited (possibly stale)
A decision that was made, then quietly contradicted by dozens of subsequent events without anyone explicitly superseding it (actively misleading)

All three looked identical in search results. Same status=active. Same retrieval weight. The agent had no way to know which decisions to trust.

This is the problem v3.0 solves.

The new ontology: knowledge that breathes

v3.0 introduces Breathing Memory Architecture. The core idea: every piece of knowledge exists on three independent axes simultaneously.

Phase      →  PATTERN → EMERGENT → CANONICAL
Vitality   →  ACTIVE ↔ DECAYING ↔ DORMANT
Confidence →  0.0 ────────────────────── 1.0

Phase measures crystallization — how well-established is this knowledge? It only moves forward.

Vitality measures aliveness right now — is this knowledge being actively confirmed by recent events, or is it fading into the background? It moves in both directions.

Confidence is a mathematical composite: 0.4 × utility + 0.4 × removal_cost + 0.2 × stability_score.

The combination of these three dimensions creates a fundamentally richer picture than "active or superseded." A CANONICAL / DORMANT / confidence=0.3 decision tells you something completely different from EMERGENT / ACTIVE / confidence=0.8. The first is an old truth that nobody's touched in months and is starting to feel stale. The second is a newer belief that's being actively confirmed right now.

DecisionStream: the new unit of knowledge

In the old system, the unit of knowledge was a simple decision record — basically a Markdown file with a title, target, rationale, and status.

In v3.0, the unit of knowledge is a DecisionStream: a living object that carries not just what was decided, but the full lifecycle history of that decision.

class DecisionStream:
    decision_id: UUID
    target: str                    # min 3 chars: 'auth_service', 'database_migrations'

    # The three axes
    phase: Literal["pattern", "emergent", "canonical"]
    vitality: Literal["active", "decaying", "dormant"]
    confidence: float              # 0.0 → 1.0

    # Temporal signals
    first_seen: datetime
    last_seen: datetime
    frequency: int                 # how many times confirmed by events

    # Health metrics
    stability_score: float         # regularity of confirmations: 0.0 → 1.0
    coverage: float                # lifetime_days / observation_window_days
    reinforcement_density: float   # frequency / lifetime_days

    # Importance signals
    estimated_removal_cost: float  # how expensive would it be to lose this
    estimated_utility: float       # how useful is this knowledge right now
    scope: Literal["local", "system", "infra"]

    # Provenance
    evidence_event_ids: List[int]  # links to episodic events that built this
    provenance: Literal["agent", "reflection", "intervention", "external"]

The evidence_event_ids are particularly important. These are the episodic events that contributed to building this knowledge — errors, successes, Git commits, agent interactions. These links are immortal: even if the episodic events would normally be pruned by the TTL, they're kept forever because they're the evidentiary foundation of the semantic knowledge above them.

Three ways knowledge is born

In the old system, knowledge came from one place: an agent calling record_decision(). Everything had to be explicit.

v3.0 has three birth paths:

Path A: Agent (explicit)

The agent calls record_decision() directly. The DecisionStream is created immediately at phase=EMERGENT — because if an agent is explicitly recording something, it's already observed the pattern.

Path B: Reflection Engine (emergent)

The Reflection Engine runs continuously (approximately every 5 minutes via run_maintenance()). It reads recent episodic events, clusters them by target, and when it sees enough signal in a cluster — errors, successes, commits all pointing at the same area — it automatically creates a new DecisionStream at phase=PATTERN.

This is the crucial difference. The old system only knew what agents told it. v3.0 knows things agents never explicitly said, because it watches what they do.

# What the Reflection Engine actually sees:
{
    "target": "database_migrations",
    "events": [
        {"kind": "error", "content": "Migration failed: duplicate column"},
        {"kind": "commit_change", "message": "fix(database_migrations): handle idempotency"},
        {"kind": "error", "content": "Migration failed: duplicate column"},
        {"kind": "result", "success": 0.3, "target": "database_migrations"},
    ]
}

# What it creates without anyone telling it to:
DecisionStream(
    target="database_migrations",
    phase="pattern",
    vitality="active",
    confidence=0.15,  # weak — just observed, not confirmed
    frequency=1,
    evidence_event_ids=[42, 43, 44, 45]
)

Path C: Intervention (operator override)

A human operator or system process can inject a DecisionStream directly via KIND_INTERVENTION. Interventions get priority treatment: they start at phase=EMERGENT, confidence=0.7, removal_cost=0.8. The system treats human overrides as highly credible prior knowledge.

The lifecycle: from newborn pattern to canonical truth

Here's the full lifecycle of a piece of knowledge under the new system.

Stage 1: PATTERN — "I'm seeing something"

The system has observed at least one event cluster pointing to this target. It doesn't know yet if this is signal or noise.

confidence: 0.10 – 0.30
phase_weight in search: 1.0 (baseline)
Stored as: KIND_PROPOSAL in semantic store

The bar for entry is deliberately low: commits ≥ 1 OR successes ≥ 1 OR errors ≥ 1. A single event is enough to start watching. But the system makes no commitments at this stage — it's just paying attention.

Stage 2: EMERGENT — "This is real"

The pattern has accumulated enough confirmations to be considered real, not accidental.

Transition condition: frequency ≥ 3 OR removal_cost ≥ 0.4, AND lifetime > 1 day

confidence: 0.30 – 0.70
phase_weight in search: 1.2 (slight boost)

This is where most agent-recorded decisions start (Path A), because if you're explicitly recording something, you've already observed it enough to have an opinion.

Stage 3: CANONICAL — "This is truth"

The knowledge has been repeatedly confirmed over a meaningful time window, shows stable regularity, and would be costly to lose. It is the system's current best understanding of reality.

Transition condition: coverage > 0.3 AND stability_score > 0.6 AND removal_cost > 0.5 AND vitality = ACTIVE

confidence: 0.65 – 0.95
phase_weight in search: 1.5 (maximum priority)

A CANONICAL decision gets significant weight in search rankings. When an agent asks "how do we handle database migrations?", CANONICAL knowledge about that target surfaces first — not because it's newest, but because it's most thoroughly validated.

Phase only moves forward. There's no CANONICAL → PATTERN regression. Once knowledge has been thoroughly validated, it doesn't un-validate. It can decay in vitality and confidence, eventually being forgotten — but it doesn't regress to an earlier phase.

Vitality: the pulse of knowledge

Phase describes the crystallization of knowledge over its lifetime. Vitality describes its current aliveness.

These are independent. A CANONICAL decision can be DORMANT. An EMERGENT decision can be ACTIVE. The matrix of combinations matters.

Phase	Vitality	Meaning
CANONICAL	ACTIVE	Core truth, actively confirmed — highest trust
CANONICAL	DECAYING	Established truth, not recently touched — probably still valid
CANONICAL	DORMANT	Old truth nobody's working near — may be stale
EMERGENT	ACTIVE	Growing belief, being actively validated — watch this space
EMERGENT	DORMANT	Belief that was forming but went quiet — check if still relevant
PATTERN	ACTIVE	Fresh observation, still just noise or signal — too early to tell

The vitality update rules

Every reflection cycle, update_vitality() runs for every active DecisionStream:

days_since_last_event < 7  →  ACTIVE   (full weight in search: ×1.0, confidence unchanged)
7 ≤ days_since_last < 30   →  DECAYING (half weight in search: ×0.5, confidence −0.05/cycle)
days_since_last ≥ 30       →  DORMANT  (minimal weight: ×0.2, confidence −0.20/cycle)

The penalty is steep for DORMANT: −0.20 per cycle. This is intentional. Knowledge that's been silent for over a month in an active codebase is a liability — it might be describing infrastructure that no longer exists, decisions that were quietly reversed, or patterns that evolved without anyone recording the change.

Waking up from DORMANT

Crucially, vitality is not a one-way trip. If a DORMANT DecisionStream's target area sees new events in the next reflection cycle, it wakes up:

New events for target "redis_caching" detected
→ update_vitality() sees days_since_last = 0
→ DORMANT → ACTIVE
→ confidence decay stops
→ knowledge re-enters normal search weight

This matters. An agent that was shelved for six months, then reactivated to work on the same system, should see the old Redis caching knowledge gradually wake up as it starts working near those systems again — not suddenly, but as the evidence accumulates.

The new search: lifecycle-aware ranking

The old search was: vector similarity → keyword fallback → evidence boost → return.

The new search has five stages and a formula that incorporates the full lifecycle state:

① Parallel search: ANN vector search + FTS5 keyword search → limit×10 candidates each

② RRF Fusion: Reciprocal Rank Fusion merges both lists
   score[fid] += 1 / (60 + rank)
   Records appearing in both lists get natural boost

③ Batch resolve to truth: follow superseded_by chain via recursive CTE
   One SQL query instead of N round-trips

④ Apply lifecycle multiplier:
   final_score = rrf_score × (1 + evidence_boost) × lifecycle_multiplier

   Where:
   lifecycle_multiplier = phase_weight × vitality_weight

   phase_weight:   canonical=1.5, emergent=1.2, pattern=1.0
   vitality_weight: active=1.0, decaying=0.5, dormant=0.2
   status penalty: superseded=0.3, rejected/falsified=0.2

   evidence_boost = min(link_count × 0.2, 1.0)  ← capped at 2× max

⑤ Paginate, deduplicate, increment hit counter

The RRF constant of 60 means early ranks count much more than late ones. A result ranked #1 in vector search gets 1/61 = 0.016, while one ranked #100 gets 1/160 = 0.006. This is the same approach used in modern hybrid search systems like those powering Elasticsearch's learned sparse retrieval.

The recursive CTE for truth resolution deserves special mention. In the old system, resolve_to_truth() was an application-level loop — potentially N round-trips to the database. In v3.0 it's a single SQL query with a maximum chain depth of 20:

WITH RECURSIVE chain(fid, status, superseded_by, depth) AS (
  SELECT fid, status, superseded_by, 0 FROM semantic_meta WHERE fid = ?
  UNION ALL
  SELECT m.fid, m.status, m.superseded_by, c.depth + 1
  FROM semantic_meta m
  JOIN chain c ON m.fid = c.superseded_by
  WHERE c.depth < 20 AND c.status != 'active' AND c.superseded_by IS NOT NULL
)
SELECT m.* FROM semantic_meta m
JOIN chain c ON m.fid = c.fid
ORDER BY c.depth DESC LIMIT 1;

One query. Maximum depth 20 (more than enough for any realistic evolution chain). This is a meaningful performance improvement for systems with long supersede histories.

Conflict resolution: smarter auto-supersede

The old auto-supersede threshold was simple: cosine similarity > 0.85 → supersede automatically, else raise ConflictError.

v3.0 adds two refinements.

Title boost: If the titles of the old and new decisions are nearly identical (SequenceMatcher ratio > 0.90), the similarity is boosted to at least 0.71 — enough to trigger auto-supersede. This handles the case where someone updates the rationale without changing what the decision is fundamentally about.

LLM arbitration in the gray zone: When similarity falls between 0.50 and 0.70, the system can now call an optional arbiter_callback — an LLM or custom function that looks at both decisions and returns 'SUPERSEDE' or 'CONFLICT'. This is the gray zone where cosine similarity is genuinely ambiguous: the decisions are semantically related, but not obviously about the same thing.

# v3.0 auto-resolution logic
sim = cosine_similarity(new_vector, old_vector)

# Title boost
title_sim = SequenceMatcher(None, old_title, new_title).ratio()
if title_sim > 0.90:
    sim = max(sim, 0.71)

# Gray zone: optional LLM arbitration
if 0.50 <= sim < 0.70 and arbiter_callback:
    decision = arbiter_callback(new_data, old_data)
    if decision == 'SUPERSEDE':
        sim = 0.71

# Final call
if sim > 0.70:
    supersede_decision(old_id)   # auto-supersede
else:
    raise ConflictError()         # explicit resolution required

The threshold moved from 0.85 to 0.70. This is deliberate — with the title boost and LLM arbitration now available for edge cases, we can be less conservative with the pure-vector threshold.

What "Zero-Touch" actually means now

The old article's title mentioned "zero-touch memory." At the time, that referred to the client hooks — LedgerMind installing itself into Claude Code, Cursor, and Gemini CLI so that every agent interaction was automatically recorded without any code changes.

In v3.0, zero-touch goes deeper. The entire knowledge lifecycle — birth, growth, crystallization, decay, and death — happens automatically. Here's what runs without any human or agent intervention:

Every ~5 minutes (run_maintenance()):
  1. GitIndexer: read last N commits → episodic as commit_change events
  2. DistillationEngine: scan for successful trajectories → ProceduralProposals
  3. _cluster_evidence(): group events by target → stats dict
  4. LifecycleEngine: update phase, vitality, confidence for all streams
  5. DecayEngine: archive old episodic events, decay semantic confidence
  6. MergeEngine: scan for duplicate active decisions → merge proposals
  7. Self-healing: stale lock removal, meta-index reconciliation

An agent can run for months without a human touching the memory system. Patterns emerge. They mature. Stale ones fade. New ones take their place. The knowledge graph evolves to reflect the actual state of the system being built — not just what someone thought to write down.

The numbers: key thresholds and constants

For anyone building on top of v3.0 or tuning it for their use case:

Parameter	Default	Effect
`ttl_days`	30	Episodic events without immortal links archived after this many days
`decay_rate`	0.05/week	Confidence loss per week for inactive semantic records
`forget_threshold`	0.10	Below this confidence, hard deletion
PATTERN → EMERGENT	`frequency ≥ 3`	Minimum confirmations to graduate
EMERGENT → CANONICAL	`coverage > 0.3, stability > 0.6, removal_cost > 0.5`	Full crystallization criteria
ACTIVE → DECAYING	7 days without events	Start of vitality decay
DECAYING → DORMANT	30 days without events	Deep dormancy
`evidence_boost` cap	1.0 (max 2× multiplier)	Maximum search boost from episodic links
CTE depth limit	20	Maximum supersede chain length for resolve_to_truth
RRF constant	60	Controls rank score distribution in fusion
CANONICAL weight	1.5	Phase multiplier in search ranking
DORMANT weight	0.2	Vitality penalty in search ranking

The confidence formula: confidence = 0.4 × utility + 0.4 × removal_cost + 0.2 × stability_score. Not a single number but a weighted composite of three orthogonal signals — how useful, how costly to lose, how regular the confirmations have been.

What this looks like in practice

Let me walk through a realistic scenario with the new system.

An agent starts working on a project. In the first two days it hits a bunch of Redis connection errors:

Day 1: 3 Redis connection errors → cluster "redis_caching" starts
         ↓ run_maintenance()
         → _create_pattern_stream("redis_caching")
         → phase=PATTERN, vitality=ACTIVE, confidence=0.12

Day 2: 2 more Redis errors, 1 successful connection
         ↓ run_maintenance()
         → frequency=6, last_seen=today, stability improving
         → phase=PATTERN still (need lifetime > 1 day)

Day 3: Another error, agent explicitly fixes the connection config
         ↓ run_maintenance()
         → lifetime > 1 day ✓, frequency=7 ≥ 3 ✓
         → promote_stream() → phase=EMERGENT
         → confidence=0.34, lifecycle_mult=1.2×1.0=1.2

Two weeks later, after the agent has successfully run Redis operations repeatedly:

Day 17: coverage=17/30=0.57 > 0.3 ✓, stability=0.73 > 0.6 ✓, removal_cost=0.61 > 0.5 ✓
         → promote_stream() → phase=CANONICAL
         → lifecycle_mult=1.5×1.0=1.5
         → This is now the system's canonical truth about Redis caching

One month later, the team decides to remove Redis entirely and use in-memory caching:

Day 47: No Redis events for 30 days
         → vitality: ACTIVE → DECAYING → DORMANT
         → confidence dropping: 0.61 → 0.41 → 0.21...
         → search weight drops from 1.5 to 0.3 (1.5 × 0.2)

Day 90: confidence < 0.10
         → should_forget = True
         → forget() removes all traces from episodic, semantic, vector index
         → Git preserves the deletion commit for audit purposes

The system forgot about Redis on its own. Nobody had to explicitly delete anything. The knowledge served its purpose, the evidence dried up, confidence eroded, and eventually the system cleaned itself up.

What's the same

Before you ask: the foundational architecture from the original article still holds.

Hybrid storage: SQLite (episodic) + Git-backed Markdown (semantic) + NumPy vector index
Immortal Links: episodic events linked to semantic records are never pruned
Git as audit log: every semantic change is a commit, cryptographically verifiable
MCP server: 15 tools, compatible with Claude Desktop and Gemini CLI
Three-layer conflict protection: pre-flight + pre-transaction + inside-lock
Client hooks: Claude Code, Cursor, Gemini CLI, VSCode extension — all still supported for zero-touch episodic recording

The new ontology is layered on top of all of this, not replacing it.

Why this matters

Most AI memory systems are built around one question: "can I find this thing later?"

That's necessary but not sufficient. The harder questions are:

Is this thing still true?
How confident should I be in it?
Has the world changed in ways that make this outdated?
What's my evidence that this was ever true?

v3.0 is an attempt to build a memory system that can answer all four. Not perfectly — the confidence numbers are heuristics, the decay rates are tuned empirically, and the lifecycle mechanics are an approximation of something much messier in reality. But the direction feels right.

Knowledge that ages. Knowledge that wakes up when touched. Knowledge that fades when ignored. Knowledge that dies when it becomes too uncertain to trust.

Memory that breathes.

LedgerMind v3.0 — Non-Commercial Source Available License. Free for personal, research, and educational use.

If you're building something with this, or have thoughts on the lifecycle mechanics — particularly whether the CANONICAL thresholds are in the right place — I'd genuinely like to hear about it in the comments.

LedgerMind: Zero-Touch Memory That Survives Real Agent Work

Stanislav — Wed, 25 Feb 2026 16:11:07 +0000

Subtitle: A deep technical walkthrough of how LedgerMind turns fragile chat memory into a self-healing knowledge system with automatic client-side integration.

Before we dive in, here’s the short version of what I understood from the project: LedgerMind is not trying to be “just another vector memory.” It’s a full memory lifecycle engine for agents: automatic context injection, automatic action logging, conflict-aware decision evolution, and Git-backed auditability. The key differentiator is a true zero-touch integration path using native client hooks, so agents can benefit from memory without burning prompt tokens on manual tool choreography.

1) Why regular agent memory breaks in production

If you’ve built more than one serious AI workflow, you’ve probably seen this failure pattern:

The model gives good answers in session 1.
Session 2 starts drifting because context isn’t loaded consistently.
Session 3 contradicts earlier decisions.
A week later, the “memory layer” is a pile of stale embeddings and half-structured notes nobody trusts.

The root cause is usually architectural, not model quality.

Most memory stacks are still CRUD-centric:

Store a chunk.
Retrieve similar chunks.
Hope retrieval relevance is enough.

That approach misses the core problem: agents don’t just need facts. They need persistent reasoning continuity — what was tried, what failed, what was decided, why it was decided, and what superseded it later.

In other words, the useful unit is often not “message text.” It’s a structured cognitive artifact:

hypothesis
decision
confidence
consequences
supersession chain
execution outcomes over time

Without that structure, you get memory inflation and epistemic drift:

old but high-similarity context keeps resurfacing,
failed approaches are accidentally reintroduced,
decisions have no lifecycle,
and no one can audit when/why behavior changed.

There is also an operational issue: a lot of agent frameworks require the model to remember to call memory tools correctly. That means every run pays a token and reliability tax for orchestration instructions like:

1) call memory.search
2) summarize top-3
3) call memory.record after response
4) maybe run maintenance occasionally

That’s fragile. The model can skip steps. Prompts can regress. Tool schemas can drift. In a real dev workflow, this eventually fails.

What you want instead is memory that behaves like infrastructure:

always on,
automatically injected,
automatically updated,
and self-correcting when knowledge conflicts appear.

That is exactly the class of problem LedgerMind is designed to solve.

2) What LedgerMind is: a zero-touch memory lifecycle engine

LedgerMind positions itself as an autonomous memory management system for AI agents.

The core idea is simple but powerful:

Don’t ask the model to manage memory manually. Integrate memory at the client boundary with hooks, and run lifecycle intelligence in the background.

Instead of “agent calls tools when it remembers,” LedgerMind moves memory responsibility into two deterministic layers:

Client-side hook integration (before/after agent execution)
Background maintenance and reasoning (reflection, decay, conflict handling, audit sync)

This gives what the project calls true zero-touch behavior:

context retrieval happens automatically before prompts,
interaction logging happens automatically after responses,
no extra MCP choreography is required in the prompt loop.

From an engineering perspective, this is a huge reliability upgrade because it removes a stochastic control path (LLM remembers to call tools) and replaces it with deterministic runtime hooks.

At the storage level, LedgerMind uses a hybrid model:

SQLite episodic store for event-like interactions,
semantic records for decisions/proposals/rules,
Git-backed audit history for traceability and evolution.

So memory is both queryable and inspectable. You can retrieve relevant context quickly, but you also get a hard audit trail of how knowledge changed.

3) How it works under the hood

Let’s break the architecture into the actual runtime loop.

3.1 Hook-driven automatic injection

With ledgermind-mcp install <client>, LedgerMind installs native hooks for supported clients.

Conceptually:

Before prompt hook:
- take user input + workspace cues,
- retrieve relevant decisions/rules/hypotheses,
- inject compact context into the prompt payload.
After response hook:
- capture user prompt, model response, and action traces,
- record to episodic/semantic layers,
- feed future reflection and ranking.

This is why “zero-touch” matters: the agent no longer needs explicit memory tool planning.

Example install flow:

# one command from your project root
ledgermind-mcp install gemini --path ./memory

Once this is installed, memory IO is automated at the client boundary.

3.2 Bridge API as fast path

Under hooks, LedgerMind uses lightweight bridge operations (context + record) instead of forcing a full MCP round trip for every turn. That reduces latency and keeps interaction predictable for IDE/chat usage.

A conceptual pattern looks like this:

from ledgermind.core.api.bridge import IntegrationBridge

bridge = IntegrationBridge(memory_path="./memory")

# before request
context = bridge.get_context_for_prompt("How should we handle DB migrations?")

# after response
bridge.record_interaction(
    prompt="How should we handle DB migrations?",
    response="Use Alembic with reversible migration scripts.",
    success=True,
)

The hook runtime just automates this lifecycle continuously.

3.3 Action logging, not just chat logging

A subtle but important design choice: LedgerMind treats interactions as fuel for reasoning systems, not only as transcript history.

When the system records post-response artifacts, it can later derive:

repeated successful trajectories,
unstable patterns tied to errors,
candidate best-practices worth promoting,
conflicting decisions requiring supersession.

This is where memory shifts from passive retrieval to active knowledge evolution.

3.4 Self-healing and maintenance heartbeat

LedgerMind includes autonomous maintenance routines (heartbeat model).

Operationally, heartbeat tasks include things like:

repository sync and integrity checks,
reflection over episodic outcomes,
confidence-based proposal promotion,
decay of stale/low-value artifacts,
conflict resolution for semantically overlapping decisions.

This reduces manual cleanup burden and keeps memory quality from degrading over long-running projects.

3.5 Git audit as first-class memory property

Many memory systems claim “long-term memory,” but very few provide proper revision semantics.

LedgerMind’s Git-backed semantic layer enables:

traceable decision history,
reproducible state transitions,
explicit supersede chains,
and postmortem-friendly forensics.

That matters when teams ask:

“Why did the agent start doing X last Tuesday?”
“Which prior rule did this decision replace?”
“Can we inspect the exact state used in that release cycle?”

With Git history, those become inspectable questions, not guesswork.

4) The key focus: preserving hypotheses, decisions, and conclusions

The most interesting part of LedgerMind is philosophical and technical at the same time:

It prioritizes preserving reasoned artifacts over raw chat volume.

Why this matters:

Raw chat is high entropy.
Decisions are compressed intent.
Hypotheses capture uncertainty.
Conclusions encode validated state.

If your memory stores these as typed, evolving objects, you can build agent behavior that is more stable over time.

4.1 From interaction to durable knowledge

A healthy loop looks like this:

Agent acts.
Outcome is logged.
Reflection engine identifies patterns.
Pattern becomes proposal/hypothesis.
High-confidence proposal is accepted/promoted.
New decision supersedes obsolete one.
Future prompts automatically inherit the updated rule.

This creates knowledge compounding instead of transcript accumulation.

4.2 Example: conflict-aware evolution

Imagine your team initially records:

“Use SQLite for local task queue state.”

Later, incidents show write contention under concurrency. New evidence produces:

“Use PostgreSQL for queue state in multi-worker deployments.”

A naive memory system may retrieve both forever. LedgerMind’s supersession model can preserve history while promoting the newer rule as active truth, so agents stop repeating outdated guidance.

That is exactly what you want from production memory: historical completeness with operational clarity.

4.3 Why this beats “just RAG over chats”

RAG over chat logs is great for recall, but weak for governance.

When your memory includes hypotheses and decisions with lifecycle metadata, you gain:

better controllability,
safer automation,
lower contradiction rates,
and clearer debugging when outputs regress.

For teams running autonomous or semi-autonomous workflows, this is the difference between a demo and infrastructure.

5) Current status and ecosystem readiness

At the moment, LedgerMind is strongest in its hook-first client experience.

Current practical status:

Gemini CLI: 100% zero-touch and stable.
Claude Desktop: support in progress / rolling out.
Cursor: support in progress / rolling out.

This staging strategy makes sense technically: ship one fully reliable integration path first, then expand client coverage without compromising behavior guarantees.

Also worth noting: LedgerMind can still run via MCP and direct Python integration, so teams can adopt incrementally while waiting for preferred client maturity.

6) Install and try it in one command

If you want the shortest path to value, start with hook installation directly in your project:

ledgermind-mcp install gemini --path ./memory

That single command sets up zero-touch memory behavior for Gemini CLI with a project-local memory directory.

If you’re starting from scratch, install package first:

pip install ledgermind[vector]

Then run the install command above and just keep using your client normally. Context injection and interaction recording happen automatically.

Repository:

GitHub: https://github.com/sl4m3/ledgermind
PyPI: https://pypi.org/project/ledgermind/

7) Future plans that matter technically

From the current architecture and docs direction, the roadmap opportunities are clear and compelling:

Broader zero-touch client support
- Harden hook packs across more IDE/chat surfaces.
- Keep behavior parity so teams can swap clients without memory regressions.
Richer introspection and explainability
- Better visibility into why context was injected.
- Decision provenance UIs for rapid debugging.
Stronger policy controls for autonomous promotion
- Tunable thresholds by namespace/target.
- Explicit governance modes for high-risk domains.
Deeper multi-agent coordination primitives
- Shared + isolated memory zones.
- More robust conflict mediation between agent roles.
Operational hardening and benchmark transparency
- Reproducible latency/quality benchmarks under real coding workloads.
- Clear SLO-style metrics for memory freshness and contradiction rates.

If LedgerMind continues executing on these areas, it can become a canonical memory substrate for practical agent engineering, not just experimentation.

8) Conclusion: memory should be a system, not a prompt trick

LedgerMind is exciting because it reframes the problem correctly.

This is not “how do we retrieve a few old messages?”

It is:

how to keep agent knowledge coherent over time,
how to automate memory operations reliably,
how to preserve decisions and hypotheses as first-class artifacts,
and how to audit and evolve that knowledge safely.

The zero-touch hook model is the keystone: if memory depends on model compliance, it will eventually fail. If memory is enforced at the client/runtime boundary, you get repeatability.

If you’re building serious agent workflows, this project is worth testing — especially if you’ve already felt the pain of prompt-level memory orchestration.

I’d love to see feedback from teams running this under real production constraints:

Where does zero-touch integration save the most effort?
What failure modes still slip through?
Which observability primitives are most needed next?

If you try it, share benchmarks, failure cases, and architecture notes — that kind of feedback is exactly what pushes memory infra from “interesting” to “reliable.”