DEV Community: Vitalii Cherepanov

I Scaled PHP Until It Broke. Three llama.cpp Patterns Saved It.

Vitalii Cherepanov — Wed, 13 May 2026 17:44:13 +0000

I read the llama.cpp source code.

Sixty thousand lines of C++ that single-handedly made local LLM inference possible on a laptop. This isn't "best practices from a textbook" — it's code where every line is responsible for keeping matrix multiplication inside the L2 cache and off the RAM bandwidth budget.

I write PHP. A language where every value is wrapped in a zval, every object carries a 30+ byte header, and any foreach allocates a hash iterator. The comparison is unfair by definition. But I got curious: which of llama.cpp's tricks would even survive the transplant? And what would happen when I pushed the dataset to a billion records?

I built a benchmark suite. Six optimizations from llama.cpp, translated to PHP 8.4 with JIT. Real numbers, statistical methodology, p99 latencies. Then I scaled the input from 1 million to 1 billion records, to see where the tricks stop being nice-to-haves and become the only path on which the code can finish.

Half of my hypotheses were wrong. That's the actual story.

TL;DR

Pattern	At 10M records	At 100M+	Verdict
B01: Memory-mapped lookup	per-call 7× slower	Load 226× faster, 0 PHP heap	Process-level win, not call-level
B02: SplFixedArray vs array	Slower on speed, 1.68× memory savings	Both run to 1B; 9 GB gap	Memory only, never speed
B03: Object pool in hot loop	4.43× faster	Scales linearly	✅ Use in long-running workers
B04: Lookup table vs match	lookup 5.8× faster, match=switch	Scales linearly	✅ Data-driven dispatch → lookup
B05: Generator vs full array	1.24× faster, memory O(1)	Naive OOMs, generator finishes	🔥 Survival tool
B06: Column vs row layout	8.66× faster single-col scan	Naive OOM at 100M, column 959ms	🔥 Survival tool

Half of the patterns transition from "optimization" to "the only path the code can finish on" once you scale up. Half don't. And one pattern (SplFixedArray) turned out to be the opposite of what's been written about it for the last ten years.

Let's go through them one by one.

B01: mmap reads gigabytes fast, but NOT per call

Hypothesis: memory-mapping large read-only tables is faster than loading them via json_decode. The llama.cpp parallel — models are loaded via ggml_mmap (see src/llama-mmap.cpp), not through fread into a malloced buffer.

PHP translation: open libc.dylib via FFI, call mmap(), take the pointer, FFI::cast('uint32_t*', $ptr) for typed access:

$ffi = FFI::cdef("
    void *mmap(void *addr, size_t length, int prot, int flags, int fd, long offset);
    int open(const char *pathname, int flags);
", "libc.dylib");

$fd  = $ffi->open('data/lookup.bin', 0);
$ptr = $ffi->mmap(null, $size, 1, 2, $fd, 0);
$table = FFI::cast('uint32_t*', $ptr);

// Access: $table[$id * 2 + 1] returns the value for key $id

Result on 10 million records:

Load time: JSON 454 ms vs mmap 1.1 ms → mmap is 226× faster on load
PHP heap after load: JSON 256 MB vs mmap 0 bytes
Per-lookup p99: JSON 708 ns vs mmap 5.4 µs → mmap is 7× SLOWER per call

Wait. mmap loses 7× per call. The JIT optimizes $arr[$id] so well that an FFI dereference with a cast overhead can't survive a tight read loop.

At 1 billion records, mmap loads a 16 GB binary in 228 milliseconds at zero PHP heap. The JSON path doesn't even exist — the fixture would be 100+ GB of JSON text, physically unrealistic to generate.

Verdict: mmap isn't "faster per call." It's a different category of optimization. It buys you load time, flat PHP heap, and table sharing across N PHP-FPM workers via the kernel page cache. Inside a single process in a tight read loop, it loses to the JIT. Across processes, it wins by orders of magnitude — cross-process cold start for a second worker is 2641× faster, because the pages are already in the kernel page cache.

Use mmap when a fleet of workers shares a fat read-only table. Don't use it for tight read loops inside one process.

B02: SplFixedArray saves memory, but never speed

Hypothesis: on dense numeric data, SplFixedArray should be both faster (no hash overhead) and more memory-efficient. The llama.cpp parallel — ggml_tensor works with packed arenas, not arrays of pointers to boxed objects.

Result on 10 million integers:

Memory: array 256 MB vs SFA 152 MB → SFA saves 1.68×
Iterate: array 12.2 ms vs SFA 93.8 ms → SFA is 7.7× SLOWER
Populate: array 56.5 ms vs SFA 108.8 ms → 1.9× slower
Random reads (1M): array 23.9 ms vs SFA 98.5 ms → 4× slower

I expected an OOM crossover, so I pushed the sweep up to a billion integers hoping the regular array would hit the RAM ceiling. It didn't. At 1B elements: array 24 GB peak vs SFA 14.9 GB. SFA's speed disadvantage held at every tier.

Verdict: SplFixedArray on modern PHP is memory-only, never speed. The folklore "use SplFixedArray for large numeric data because it's faster" is advice from 2014. JIT in PHP 8.4 optimizes packed integer-keyed arrays so aggressively that the specialized structure loses to the general one. Reach for SFA when you're memory-constrained inside a long-running worker. Don't expect a speedup.

This is the most counter-intuitive finding in this article. I didn't believe it at first and re-ran the whole sweep twice. The numbers held.

B03: Object pool — the only classic optimization that still earns its keep

Hypothesis: in a hot loop, reusing a small pool of pre-allocated objects is faster than new on every iteration. The llama.cpp parallel — the tensor allocator never calls malloc inside an inner loop. It works against a pre-allocated arena via ggml_new_tensor_impl.

Translation: a pool of 5 Point3D instances, reused via direct property assignment:

final class Point3D {
    public function __construct(
        public float $x = 0.0,
        public float $y = 0.0,
        public float $z = 0.0,
    ) {}
}

$pool = array_map(fn() => new Point3D(), range(0, 4));
$idx  = 0;

for ($i = 0; $i < 5_000_000; $i++) {
    $p = $pool[$idx++ % 5];
    $p->x = $x; $p->y = $y; $p->z = $z;
    // ... work with $p
}

Result on 5 million allocations: naive 813 ms vs pool 179 ms → 4.43× faster.

GC cycles: zero in both. Point3D isn't cyclical, PHP's GC never trips. All the savings come from the allocator path: new in Zend Engine is a light but non-zero code path (zend_object_new → emalloc → property init × N). Five million times adds up.

Verdict: works as expected. In CLI scripts the win is real but not critical. In long-running workers (queues, websockets, daemons), tail latency from allocator pressure compounds over time and becomes a headache — that's where pooling earns its keep.

B04: Lookup table beats match and switch (and those two are equivalent)

Hypothesis: for dispatch logic with 16+ cases in a hot loop, an array lookup beats match and switch. The llama.cpp parallel — token dispatch in llama_token_to_piece uses tables, not switches.

Translation: a 32-case classifier implemented three ways — switch, match, and a pre-built $lookup = [0 => 'A', 1 => 'B', ...].

Result on 10 million dispatches:

switch: 358 ms (27.9M ops/sec)
match: 365 ms (27.4M ops/sec)
lookup: 61.7 ms (162M ops/sec) — 5.8× faster

match and switch are tied. Both compile to the same jump table for integer cases. PHP 8.4 JIT polishes both forms to the same result. If you rewrote switch to match for "modernization" — you got readability, not speed.

Where the lookup win evaporates: if dispatch produces a string for downstream === comparisons, the gain is eaten by the string compares further down the pipeline.

Verdict: match-shaped problems (closed compile-time set, exhaustiveness wanted) stay in match. Data-driven dispatch (table loaded from config, generated at runtime) goes in a lookup. The "match vs switch for perf" debate is closed — they're equivalent.

B05: Generator — the main survival tool on large streams

Hypothesis: a generator reduces peak memory from O(N) to O(1) with a minor throughput penalty. The llama.cpp parallel — tokens stream through a callback rather than accumulating in a buffer (llama_decode → llama_get_logits_ith).

PHP translation: replace function process(): array with function process(): Generator:

function records(): Generator {
    foreach (read_csv('data.csv') as $row) {
        yield ['id' => $row[0], 'value' => $row[1]];
    }
}

Result on 5 million records:

Wall time: naive 525 ms vs gen 449 ms → gen 1.24× faster
Peak memory: naive 1.88 GB vs gen 0 bytes of PHP heap

The generator isn't just lower-memory — it's also faster on wall time, because the array never needs to be fully materialized before processing starts.

Now scale. At 100 million records, naive OOMs — the kernel kills the process with SIGKILL after 28.6 seconds. The generator finishes the same 100M in 10.4 seconds at zero PHP heap. At 500M, the generator still works (45.7 seconds). Naive doesn't even attempt.

If I had to pull one sentence out of this entire article and put it on a banner, it would be this:

At 100,000 records, a generator is a 1.24× nice-to-have. At 100 million records, it's the only path on which the code can finish.

Verdict: default for any single-pass stream you don't need to revisit. Materialize the array only when you need random access, multiple passes, or count() before processing.

B06: Column-oriented layout — not cache locality, but escape from boxing

Hypothesis: on analytical single-column scans, column-oriented layout is faster than row-oriented due to cache locality. The llama.cpp parallel — tensors are stored per-channel (SoA), not per-element (AoS).

PHP translation: instead of SplFixedArray of stdClass with 5 fields — 5 parallel SplFixedArray instances, one per field:

// Row-oriented (naive)
$rows = new SplFixedArray($n);
for ($i = 0; $i < $n; $i++) {
    $obj = new stdClass();
    $obj->f1 = ...; $obj->f2 = ...; /* ... */ $obj->f5 = ...;
    $rows[$i] = $obj;
}
$sum = 0;
foreach ($rows as $r) $sum += $r->f3;

// Column-oriented (optimized)
$f3 = new SplFixedArray($n); // and so on for each field
for ($i = 0; $i < $n; $i++) $f3[$i] = ...;
$sum = 0;
for ($i = 0; $i < $n; $i++) $sum += $f3[$i];

Result on 5 million records: column is 8.66× faster on single-column scan. On full-row scan (sum f1..f5) — 1.92× faster.

And here's where it gets interesting. I expected ladder steps on the ns/record chart — where the working set stops fitting in L1, then L2, then L3 cache. I didn't see them. The curves are flat across the whole 100K → 100M range: column holds at ~9.5–11.5 ns/record. Row holds at ~80–93 ns/record. No steps.

This is a stronger insight than "look, the ladder." Cache effects inside either layout don't differentiate them. What differentiates them is the layout itself. Row-oriented spends ~30+ bytes per stdClass (zval header + property table + GC info) for 8 bytes of actual payload. At 100M records that's 28 GB on boxing alone. Column-oriented at the same 100M = 7.45 GB, because each column is a packed SplFixedArray with no boxing.

At 100M records, row OOMs — 28+ GB of stdClass objects don't fit. Column finishes the scan in 959 milliseconds at 7.45 GB.

Verdict: column layout isn't cache optimization (which is what I assumed). It's escape from PHP-object overhead at scale. On any analytical workload over large datasets — column. Row stays appropriate when DTOs are passed between layers, or when the working set is small.

What happens at scale

Micro-benchmarks on 1–10 million elements give you one picture. Scaling to billions gives a different one.

Three of the six patterns transition from "optimization" to "necessity" on large data:

B05 generator — at 100M records, naive OOMs. Generator finishes.
B06 column layout — at 100M records, row OOMs. Column completes the scan in 959 ms.
B01 mmap — at 1B records, the JSON fixture physically doesn't exist (100+ GB). mmap loads a 16 GB binary in 228 ms.

Two patterns stay "just optimizations" regardless of scale:

B03 object pool: ~4× at any size.
B04 lookup table: ~5× at any size.

One pattern turned out narrow — saves memory, but never speed:

B02 SplFixedArray: 38% less memory, always slower on speed. Both paths work all the way to 1B.

This is probably the most important reframing in the article. When someone says "X is faster than Y," that's a claim about a specific data size. On small data, half the claims break. On large data, half of them become "X works, Y doesn't exist."

And one more thing worth its own line: JIT in PHP 8.4 keeps eating optimizations every release. Between runs on PHP 8.3.31 and 8.4.21, B03 sped up from 2.78× to 4.43×, B04 from 3.75× to 5.81×. Not a bug — JIT just keeps improving. A year from now, these numbers will shift again.

Three rules of PHP performance in 2026

Out of these six experiments, a working framework emerged.

1. Trust the JIT.

Don't try to outsmart it at the syntax level. match vs switch — JIT compiles both forms to the same jump table. SplFixedArray vs packed array — JIT optimizes the regular array so aggressively that the specialized structure loses on speed. FFI dereference vs $arr[$id] — JIT-compiled array access beats FFI casts inside a hot loop.

If your optimization is about "which language construct to pick" — the JIT already made that choice for you.

2. Optimize what JIT can't see.

Cache locality (B06: column layout) — the JIT doesn't manage memory layout. That's your architecture.
Allocation pressure (B03: object pool) — the JIT doesn't eliminate allocations, it speeds them up.
I/O batching (batched INSERT of 1000 rows vs single-row) — the JIT doesn't optimize round trips to Postgres.
Cross-process resource sharing (B01: mmap + page cache) — the JIT works per process.
Streaming vs materialization (B05: generator) — the JIT isn't going to remove 30 GB of peak memory for you.

3. At a large enough scale, optimizations stop being optimizations.

They become a survival threshold. A generator on 100K records is 1.24× faster. On 100M it's the only code that finishes. A column layout on 5M is 8.66× faster. On 100M it's the only code that doesn't eat 28 GB on stdClass overhead. mmap on 10M is slower per call. On 1B it's the only way to load the table inside a second.

That's structural thinking, not syntactic. And that's what turns llama.cpp from "a heavily optimized C++ library" into a learning artifact for a PHP developer. Not "here are tricks, steal them." But "here are language limits you only see when you crash into them."

Closing

All benchmark code and a reproducible Docker setup live on GitHub: vbcherepanov/php-llamacpp-benchmarks. A full sweep takes ~15 minutes (make all), including a case study that imports 100K rows into a real PostgreSQL.

A note on the repo: the data/ directory is gitignored — fixtures (up to 16 GB binary lookup files at the 1B tier) are generated locally by make fixtures. Don't try to clone with them.

If you find a methodology bug or want to add a tier, send a PR. I work on this kind of thing through Braincore — a Go-based meta-agent with cost-aware routing and a memory layer for AI coding agents. If these benchmarks were useful and you'd like to support more of this, there's a Ko-fi.

Vitalii Cherepanov — Senior Full-Stack Developer. Building memory, tools, and bridges for AI coding agents. Open source: total-agent-memory (R@5 = 97.45% on LongMemEval), Braincore. Personal site: vbcherepanov.com.

What 16 Parallel Claude Agents Built Around Themselves: Deconstructing Anthropic's C Compiler Experiment

Vitalii Cherepanov — Sat, 09 May 2026 14:57:30 +0000

On February 5, 2026, Nicholas Carlini from Anthropic published a piece about an experiment that runs significantly ahead of what most of us are doing with LLM agents today. Sixteen parallel instances of Claude Opus 4.6, two weeks of work, ~2,000 Claude Code sessions, a budget around $20,000. The output: 100,000 lines of a C compiler in Rust that builds Linux 6.9 on x86, ARM, and RISC-V; passes 99% of GCC's torture test suite; compiles PostgreSQL, SQLite, FFmpeg, Redis, and QEMU; and runs Doom. The repository is open and anyone can read it and try it themselves.

It's serious engineering work, and the article itself is a great read for anyone thinking about autonomous agents in production. Carlini is honest about what worked and what didn't, walks through five concrete lessons from designing the harness, and shares numbers and metrics. This is exactly the kind of writeup the industry needs more of — a first-hand account of what long autonomous runs actually look like.

Headlines split into two camps. "AI replaced programmers" on one side. "It's just a demo" on the other. Both miss what's actually interesting.

If you read the article carefully, what Carlini is documenting is not "AI writes a compiler." He's documenting how much infrastructure had to be built around the agents because there isn't yet any infrastructure between the agents themselves in 2026. Lockfiles in a shared directory as a sync mechanism. READMEs that the agent writes to itself. GCC pressed into service as a known-good reference oracle. A Ralph-loop wrapped around Docker for indefinite autonomy. Each of these is an answer to a concrete problem that today simply has nowhere to be pushed into a standard layer.

And that's the article's real value. Not as an "AI demo," but as a detailed map of missing primitives, drawn by someone who built workarounds for them by hand. I've been working on these primitives for the past few months, and Carlini's writeup is a great excuse to talk through what the next generation of agent teams actually needs.

Every session starts with amnesia

Carlini built a harness that runs Claude in an infinite loop — when the agent finishes one task, it picks up the next. Architecturally this is the familiar "Ralph-loop" pattern: a while true cycle in a bash script, wrapped in Docker for safety. In one of the runs, Claude accidentally killed itself with pkill -9 bash, which Carlini notes as an amusing side effect.

The crucial detail is that each of those ~2,000 launches started in a fresh Docker container with empty context. No memory between sessions. Every agent figured out from scratch: what is this repo, what's already done, what's the status of tasks, what's been tried and failed.

Carlini's workaround was to instruct Claude itself to maintain extensive READMEs and progress files, updated frequently. When the agent gets stuck on a bug, it also keeps a running doc of failed approaches and remaining tasks.

This works within the bounds of current tooling — and that's its value. But if you look at scaling, two architectural points start to creak.

First, a text file isn't structured. If you want to ask "what were the three most recent bugs I fixed in the parser area and how did they end?", you only have grep and regular expressions. On a small project that's tolerable. On 100,000 lines of code and 2,000 sessions, it becomes a bottleneck.

Second, more subtle: each agent maintains these files for itself. They live in the shared git repository, but there's no mechanism that says "before you take task X, look at what the other 16 agents wrote about this area in the past 6 hours." Each agent writes its own README, merges others' edits, and hopes things converge.

This is the first generation of shared memory — implemented as plain text because no more convenient primitive has become standard yet.

Lockfiles as coordination

Parallelism is implemented minimally. Each agent runs in its own Docker container; a shared bare git repo holds state. Task coordination happens through lockfiles: an agent writes a file like current_tasks/parse_if_statement.txt, does a git push, and thereby "claims" the task. If two agents try to take the same one, git synchronization forces the second to pick something else. Done — delete the lockfile.

Carlini states the current state of the system plainly: "no other method for communication between agents... I don't use an orchestration agent." No mechanism for agents to "ask each other." No central coordination. Each Claude decides for itself what to do next — usually "the next obvious problem."

The lockfile here does exactly one thing — it works as a mutex, protecting against parallel claims on a single task. That's valuable. But it doesn't solve the other problem: two agents working on different tasks in the same code area can write conflicting code under different task names. That's exactly what happened to the Linux kernel in the experiment — agents converged on the same bug, fixed it differently, overwrote each other's edits, and parallelism temporarily stopped paying off.

Carlini's solution was a separate test harness using GCC as a known-good compiler oracle: most of the kernel gets compiled with GCC, and a random subset of files goes through Claude's compiler. If the kernel doesn't boot, the bug is somewhere in Claude's subset, and you can keep narrowing it down. It's a clever and elegant idea, and it worked exactly as intended.

It's worth noting the bounds in which this works. The GCC oracle is a precise solution for this specific task, because the task has three convenient properties: there exists a ready-made reference compiler for the same spec, the task decomposes at the level of individual files, and the outcome is binary (boots or doesn't).

In most real projects — product development, legacy refactoring, ML pipelines, mobile applications — these conveniences don't exist. There's no ready known-good for comparison. There's no natural file-level decomposition. Outcomes aren't binary. Which means the GCC-oracle technique can't be generalized as a primitive — it works where it works, and doesn't exist where it doesn't.

Taken as a whole, Carlini's toolkit lays out neatly along two axes:

What agent teams need	What's in the experiment	Nature of the solution
Agent discovery	hardcoded number of containers	hardcoded
Inter-agent communication	lockfile via git push	mutex without messaging
Task delegation	next-most-obvious from queue	no routing
Shared state / memory	README + progress files	plain text
Causal history	running doc of failed approaches	personal log
Verification	GCC oracle	task-specific

These are two independent axes of the problem: communication (how agents talk to each other) and memory (what they remember between sessions and whether they share it). These axes require different primitives and different solutions. And on each, the industry is currently converging on standards and open-source implementations.

What follows is what's available on each axis today.

The communication axis: A2A protocol and a2abridge

Communication has moved fast and has already arrived at a mature standard. In April 2025, Google opened the A2A protocol — Agent-to-Agent. In August 2025, IBM's ACP merged into A2A under the Linux Foundation, and by April 2026 the spec is at version 1.2, supported by 150+ organizations (Microsoft, AWS, Salesforce, SAP, ServiceNow, IBM among them), and natively built into Google ADK, LangGraph, CrewAI, LlamaIndex Agents, Semantic Kernel, and AutoGen. A2A has effectively won the protocol war. The spec is deliberately minimal: an Agent Card is a JSON description of an agent's capabilities (what it does, what endpoint to hit). A Task is a unit of work with statuses and artifacts. Transport is JSON-RPC 2.0 over HTTPS, with Server-Sent Events for streams.

The analogy that gets used everywhere is HTTP. HTTP doesn't tell you what's in your backend (Rails, Django, Go) — it just defines the shape of requests and responses. A2A doesn't tell you what LLM, framework, or database you use — it defines the contract between agent A and agent B. A minimum on top of which you can build the rest.

If you rewrote Carlini's scenario on A2A, instead of a lockfile in current_tasks/, an agent would query a directory service for "who's working on the parser right now?", get the neighbor's Agent Card, and send a Task with a streaming response over SSE. That's the communication primitive his harness doesn't yet have.

I've been writing a2abridge for the past several months — an open Go implementation of A2A 1.0 targeted at the practical scenario of "several different AI agents on one developer's machine." At the time of publication, six IDEs are supported: Claude Code, Codex CLI, Cursor, Cline, Continue, and Gemini CLI. Any A2A-compliant agent (including future Google ADK, LangGraph, CrewAI implementations) is a first-class peer with no glue code.

Architecturally it's a single Go binary (~10 MB) with several subcommands. a2abridge directory is a discovery service on 127.0.0.1:7777 that runs as a user-level system service (launchd on macOS, systemd-user on Linux, Windows Service on Windows, works correctly inside WSL2). a2abridge bridge is a per-agent process that hosts both an MCP stdio server (through which the IDE sees a2abridge as a regular MCP server with tools) and an A2A HTTP server on a random port, with an Agent Card at /.well-known/a2a and the full set of JSON-RPC 2.0 methods from §7 of the spec: SendMessage, SendStreamingMessage, GetTask, ListTasks, CancelTask, SubscribeToTask, GetExtendedAgentCard. The bridge's lifecycle equals the IDE session's lifetime — when MCP stdio closes, the bridge dies, no orphan processes.

What Claude Code (or any other IDE) sees as MCP tools: a2a_whoami, a2a_list_agents, a2a_send_message, a2a_send_streaming, a2a_get_task, a2a_cancel_task, a2a_inbox, a2a_complete_task. Inside the session, the agent can independently discover other agents on the machine, send them tasks, wait for replies, and read its inbox — without user involvement.

On top of the protocol there's a pro-active layer that isn't in the spec but is needed for real use. The bridge writes an inbox file at ./.a2a/inbox-<ppid>.json every time the message queue changes. A UserPromptSubmit hook injects incoming messages into the system prompt before the first tool call — meaning Claude sees "you have a message from a peer with FYI about a breaking API change" before it starts taking blind action. The SSE fast-path delivers replies in milliseconds, with a 5-second polling fallback. For Claude Code there's also a skill called a2a-bridge that auto-loads only when triggered by relevant prompts — no globally loaded rules burning tokens on every session.

In Carlini's scenario this would look like: agent 5 takes the task "fix kernel build error in mm/page_alloc.c." Before acting, it calls a2a_list_agents, sees that agent 2 has an open Task with capability kernel-debug in the same area. It sends a2a_send_message: "what are you working on, do you have a hypothesis?". It gets a streaming response: "tried alignment fix, failed on test_kernel_boot, currently looking at reorder header includes." It picks a different angle.

Why an open protocol and not yet another custom wire format. Several solutions already exist in this niche: Anthropic Agent Teams works only Claude↔Claude and is tied to a subscription. CCB and claude-multi-agent-bridge are closed formats locked to specific agent combinations. Ruflo is excellent for enterprise federations of 100+ agents with central queens, but that's a different class of problem. The niche a2abridge targets is cross-vendor, open-protocol mesh, where today Claude and Codex drop in, and tomorrow any A2A-compliant agent does, with no glue rewriting. If the industry is moving toward a standard, the bridge had better speak that standard.

Production maturity: cross-machine federation with mTLS + ed25519 (opt-in, for the "home Mac ↔ office Linux" scenario), mDNS auto-discovery on the local network, a PII/secret screen running 11 regex detectors before sending (AWS keys, GitHub tokens, Anthropic/OpenAI/Google/Stripe/Slack tokens, JWTs, PEM blocks — replaced with [REDACTED:<name>], secret never leaves the bridge), Push Notifications per A2A 1.0 §9.5, HTTP+REST binding per §7.3, 35 test cases under -race, a GitHub Actions release matrix, and a cross-platform a2abridge doctor with a 9-check health audit. Install is a one-liner via install.sh or install.ps1, with auto-detection of every IDE on the machine and .bak backups of their configs before edits.

Repository: github.com/vbcherepanov/a2abridge — MIT, Go 1.25, current release v2.0.

The memory axis: total-agent-memory and BrainCore

Memory is in a different state. There's no A2A-level standard yet — everyone builds their own layer, and different approaches get picked for different tasks. What an agent writes to itself in a README is essentially a causal log in textual form: "tried A, failed at B, moved to C." The structure is right; the implementation is still plain text.

I'm working on two products on this axis.

total-agent-memory — open-source implementation. The core retrieval patterns, MCP integration, and the basic causal-chain model live here. Anyone can clone it, see how it works, and plug it into their Claude Code or Cursor.

BrainCore — production-grade. A Go binary, local SQLite + WAL, tree-sitter for code-graph across 14 languages (PHP, TypeScript, Python, Ruby, Rust, Java, Kotlin, C/C++, C#, Swift, Bash, Lua, YAML, plus Go through its own native AST), internal git for time-travel memory, MCP protocol for connecting to Claude Code, Cursor, Codex CLI, Windsurf, and several other agents. Currently in beta.

Architecturally there are three points where both projects diverge from a flat bag-of-facts with cosine search.

First, causal decision chains instead of flat facts. Not "function X is in file Y," but "agent 3 in task fix kernel build formulated hypothesis alignment issue, verified through test_kernel_boot, failed, moved to hypothesis header reorder." Each step is typed, connected by a causal arrow, and queryable by every agent.

Second, AST-stable code identity. When several agents refactor in parallel, text diffs quickly turn into mush, and merge conflicts become endless. An AST node remains a node even if a function moved from parser.rs to frontend/lexer.rs and got renamed from parse_decl to parse_declaration. In the graph, it's the same node with a movement history. Every agent looks at the same abstraction, not at "lines 127-145 of file X."

Third, persistence across container restart. Memory lives outside the Docker container: on the host through a volume, or remotely via MCP. The query brain.causal_lookup(area="parser", lookback="6h") returns the same result regardless of which fresh container you're in.

Rewriting Carlini's scenario with memory: agent 5 goes to BrainCore, gets the causal log "agent_2 tried alignment fix → failed, agent_7 tried header reorder → failed at L98, current hypothesis from agent_3 is alignment issue, in progress," picks a fourth hypothesis, writes it to the causal chain. Agents 2, 3, and 7 see this decision on their next pull. No READMEs, no greps.

How they fit together

a2abridge and BrainCore are different layers, not competitors. One answers "how do agents talk to each other," the other answers "what do they remember."

The full picture for an agent team looks like this. BrainCore holds the shared state of the world: code-graph, causal chains, hypotheses, conclusions. a2abridge provides actual communication between agents: discovery, delegation, streaming responses, an inbox with context injection. When they work together, agent 5 sees a message in its inbox from agent 2 ("I'm working on X"), queries BrainCore for details ("what specifically has been tried in this area"), makes an informed decision, replies to agent 2 about its intention to take an adjacent task, and writes the result to shared memory.

That's the architecture Carlini is building by hand in the experiment through the combination of lockfiles + READMEs + GCC oracle. With independent primitives instead of self-built glue, the infrastructure works in tasks where there's no ready-made known-good compiler.

What these primitives don't solve

Carlini is absolutely right about the article's main lesson: a high-quality test harness is the foundation of everything. No amount of shared memory and no A2A will save you if the task verifier is imprecise — agents will autonomously solve the wrong task. CI pipelines, well-designed logs, defenses against context window pollution, fighting "time blindness" — these work at any infrastructure level and remain the first priority.

The GCC oracle in the compiler task is genuinely the optimal choice. Binary verification is almost always better than comparing causal hypotheses. If you have a ready known-good in your project — use it. No memory replaces a good verifier.

But in most real tasks — product development, refactoring, ML pipelines, business logic — there's no GCC equivalent. And there, the primitives of communication and memory become not an "improvement" but a necessary condition for a team of 16 agents to be more productive than one.

That Carlini had to build this entire text-and-file layer in 2026 isn't a flaw in his approach but a symptom of the moment: infrastructure for agent teams is still forming. The Anthropic experiment is the best possible illustration of how it's forming and where it's headed. And that, in my view, is the real value of Carlini's article: an honest report from the earliest point on the curve along which this infrastructure will grow.

Open source and links

Original Anthropic article: Building a C compiler with a team of parallel Claudes
Compiler repo: anthropics/claudes-c-compiler
A2A protocol specification: a2a-protocol.org
a2abridge — open A2A 1.0 mesh for 6 IDEs (Claude Code, Codex, Cursor, Cline, Continue, Gemini): github.com/vbcherepanov/a2abridge (MIT, v2.0 shipped)
total-agent-memory — open-source memory layer: github.com/vbcherepanov/total-agent-memory
BrainCore — production memory infrastructure: github.com/vbcherepanov/braincore (beta)

If you're building your own agent teams and running into these problems — get in touch. Experience exchange is valuable in any case, and feedback on early product versions is the best thing that can happen to authors.

The right of an AI agent to stay silent

Vitalii Cherepanov — Sat, 09 May 2026 14:49:11 +0000

Part 3 of 3 — "Memory for AI agents"
Why the right metric isn't accuracy — it's zero confidently-wrong actions

Article

Picture two scenarios.

In the first — a senior cardiac surgeon looks at a scan and says: "I don't know. There are two competing hypotheses here, the symptoms overlap. We need additional tests — these three specifically, and a CT with contrast. Until I see those, I won't commit to an answer I'd defend."

In the second — a bright-eyed intern confidently delivers a diagnosis in thirty seconds, leaning on a similar case from last week's textbook. Confident. Crisp. No doubt.

Which one would you trust to operate on your mother?

Right now, every AI agent we ship is the second doctor. Confident. Fast. Never says "I don't know." And that's exactly why you can't trust them with anything more painful than rewriting a README.

Today — how to change that. Not algorithmically. Architecturally.

The rotten metric that poisoned us all

There's an unspoken industry consensus that I think is a disaster: we measure models and systems by accuracy — the percentage of correct answers on a benchmark.

GPT-4 hits 86% on MMLU. Claude — 88%. Gemini — 90%. Better, better, even better. The number goes up.

What that number doesn't show: the remaining 10–14%. These aren't "answers the model didn't give." They're confidently generated wrong answers, visually indistinguishable from correct ones. The model has no warning light for "I'm not sure here." It generates everything with the same textual confidence.

When you use such a model to write notes — fine. When you use it for production code, medical decisions, legal opinions, financial transactions — 10% confident hallucinations means 10% of cases where the system is lying to you with a straight face.

The right metric for production AI sounds different:

0% confidently-wrong actions at an acceptable abstain rate.

Not "percentage of correct answers." But "percentage of wrong actions" — zero. And separately — abstain rate: how often the system honestly says "I don't know, I need data / verification / clarification." Zero wrong actions plus 30% abstain is ten times more production-ready than 90% accuracy with 10% confident hallucinations.

Notice: I didn't say "0% wrong answers." I said "0% wrong **actions." The distinction matters. An answer is words. An action is a commit, a transaction, a diagnosis, an API call, a change in production. Words can be reread and discarded. An action has already happened.

And that separation between "answer" and "action" — that's what's architecturally absent from modern AI agents.

Abstain as a first-class outcome

In Part 2 of this series I laid out seven principles of real memory, and the second was strict mode. Quick recap: before a fact lands in prompt context, it passes through a gate — source, confidence, temporal validity, no unresolved contradictions. If no fact made it through — the system returns abstain = true, with an explicit reason.

There's a detail I want to underline separately. Abstain is not an error. It's a result. Every bit as first-class as "answer" or "action." If your AI has exactly two possible outcomes — "answered" and "got it wrong" — it has no architectural place for an honest "I don't know." Which means it's going to make things up.

In a sane system, there are at least four outcomes:

answer — sufficient evidence, answer given, action executed
clarification request — partial evidence, needs user input
abstain → brain task — insufficient evidence, recorded as a backlog task with an explicit data request
escalation — there's a contradiction that requires human review

And the last three aren't fallbacks. Not "when everything went wrong." They're full, expected, designed-in paths.

When I ask braincore to find a decision about auth flow on a project we've been working on for three months — it finds it. When I ask about a project I just started, where nothing's recorded yet — it doesn't make things up. It says: "I have no evidence on this question. Created a brain task: collect decisions on auth, source — our current design doc, owner — you. Once you fill it in, ask again."

This is not a bug. It's the right behavior. Notice what happened: the system didn't block me. Didn't say "error, no data." It turned the not-knowing into a task, which now lives in its backlog and will periodically remind itself.

Self-Tasking. A brain with a backlog, not a passive search engine

The thing that scares me most about modern "AI agents" is that they're passive. They wait for a prompt. Every. Single. Time. Remember nothing between sessions. Have no internal backlog. Don't realize they have unresolved questions.

That's not an "agent." That's a function in agent costume. A function takes input, returns output. An agent has goals, state, and its own tasks between requests.

In a real cognitive runtime, there's a separate entity — brain tasks. They get spawned automatically:

truth.contradiction — a contradiction found in the knowledge graph → task to resolve
truth.staleness — a fact hasn't been confirmed in a long time → task to verify
strict.abstain — the system refused to answer → task to find evidence
selflearn.skill_scorecard — a skill started failing often → task to repair
specs.evidence_gap — a requirement without coverage proof → task to gather
tests.failing_coverage — tests aren't passing → task to fix
learning.failure_pattern — a recurring error pattern detected → task to generalize into a rule

Each task prioritizes itself by a simple formula:

priority = f(urgency, impact, confidence, risk, effort, dependency_readiness)

And at any moment, the user can ask: "show the next five tasks, why they matter, which I can safely do now, which need my input." That's not the same chat where you start with a blank slate every time. It's a working environment with its own memory of what's not done.

This is a flip in framing. Not "user shows up and asks, agent answers." But "agent runs in the background, accumulates open threads, and tells you — here's what matters now."

Show me a RAG stack that does this. Spoiler: there isn't one. Because RAG is a search engine, not an agent. And when someone says "our RAG-based AI has agency" — that's marketing fiction. Agency requires internal state, goals, a backlog, and self-assessment. RAG has none of these.

Cognitive Runtime > Model Size

The last myth to dismantle.

"When GPT-5 / Claude 5 / Gemini 3 ships — memory will solve itself." No. It won't. Ever.

Memory is not a property of the model. It's a property of the system the model runs in. The analogy:

A human has good memory not because neurons compute fast.
A human has good memory because there's a hippocampus, a neocortex, sleep-time consolidation, emotional gating through the amygdala, and an architectural separation between working / episodic / semantic / procedural memory.
It's infrastructure, not compute power.

Make the LLM ten times bigger — memory still doesn't appear. Build a runtime around the existing LLM that implements the seven principles from Part 2 plus abstain plus self-tasking — and a weak local model in that runtime starts doing things GPT-5 with RAG-memory architecturally cannot.

Not because it's smarter. But because the runtime does for it what it shouldn't have to do itself: remembers, verifies, abstains, tasks itself.

This is, by the way, the only meaningful path forward in a world where foundation models are commodity. When everyone has roughly equivalent Claude/GPT/Gemini — competitive advantage can only come from what's around the model. Domain-specific cognitive runtime. Project-specific memory. Team-specific rules.

And this bet is also about privacy. About data sovereignty. About the fact that your project's memory is your capital, and handing it to a third-party vector DB to pay monthly rent on it is a strategic mistake you'll only notice three years in, when you can't leave anymore.

That's why, incidentally, braincore is a local Go binary that works by default without OpenAI and without Anthropic. Not because I'm against them (I'm a paying customer of both). But because the architecturally correct path is a runtime where the model is a swappable component, not the center of gravity.

A checklist for anyone building AI products right now

If you've read the whole series and you're thinking "okay, agreed, what do I do Monday morning?" — here are ten items you can start moving on regardless of whether you use braincore or not.

Drop the word "memory" from your stack if what you have is RAG. Call it retrieval or search — instantly removes 80% of inflated expectations.
Introduce truth_status for every fact. Minimum: hypothesis | confirmed | deprecated. Disallow confirmed without source_ref.
Introduce valid_from / valid_until. Any fact without temporal validity is a hypothesis, not a fact.
Make abstain a first-class outcome. Not "when things go wrong" — but as one of four valid results.
Distinguish staging | working | consolidated | archived. Don't dump everything into one collection.
Negative memory. What broke — record it explicitly, with a link to the failing test or commit.
Entity disambiguation. Never auto-merge entities at low confidence. Create an ambiguity record instead.
Causal chains for decisions. Not "text" — problem → alternatives → decision → reasoning → outcome.
Local where possible. Project memory is your capital.
The metric is not "percentage of correct answers." It's 0% wrong actions at an acceptable abstain rate.

Not all at once. Pick two or three and start. In a month, you'll have an AI system you can trust more than most that exist.

Epilogue. Cognitive hygiene for the AI industry

I'm tired of the word "memory" getting slapped on every vector database with embeddings. It's a devaluation of the term — like calling a one-column text VARCHAR table a knowledge base. Technically — yes. Substantively — no.

Memory is:

structure, not a flat list
knowing the boundary, not confident bullshit
causal chains, not chunks
entity-aware, not string-aware
temporal-aware, not "created yesterday, valid forever"
self-correcting, not self-deceiving
governed, not "dump whatever, sort later"
abstain-capable, not "always answers"

If your "AI with memory" doesn't do at least half of those — your AI doesn't have memory. It has search results. These aren't the same thing.

One last thing. I'm not telling you to throw out RAG. RAG is an excellent tool for its class of tasks (find me the paragraph about X in 100 documents). I'm telling you to stop calling RAG memory and start building real cognitive runtimes — slower, more disciplined, with explicit gates and explicit abstain. It's the only path to AI systems you can trust with anything more important than rewriting a README.

If you're a startup with "our AI has long-term memory on a vector database" in your pitch deck — close that slide, redo it, and in two years you'll thank yourself.

If you're a developer fighting with an agent that forgets what you said yesterday — that's not the agent's fault. It's the fault of whoever sold you a search engine wrapped as a brain.

A good AI agent isn't the one that always answers. A good AI agent is the one that never takes a confidently wrong action. Between those two sentences lies the entire chasm separating 2024's AI tooling from AI tooling that will be trustworthy in 2027.

I've picked my side of the chasm. Building braincore — open, Apache-2.0, in the repo. If you recognize yourself in this series — we're in the same boat. If something works differently in your stack — tell me how, I genuinely want to know.

The one thing you can't do is stay silent.

TL;DR of the whole series:

Part 1: RAG = Ctrl+F with embeddings. It's search, not memory. Mem0/Letta/Zep — RAG in wrappers. 1M context is RAM, not disk.

Part 2: Real memory = seven principles in combination. Atomic units + lifecycle + truth_status + temporal + causal chains + AST identity + internal git + memory scoring + negative memory. Each exists in isolation. Combined — different product.

Part 3: The metric for production AI isn't accuracy — it's 0% confidently-wrong actions. Abstain is a first-class outcome, not an error. Cognitive runtime > model size.

If your AI "remembers" via vector_db.query(top_k=5) — it has dementia disguised as confidence. Fix the architecture, not the model.

Part 3 of 3. Series complete. If this resonated — share it. If you disagree — tell me in the comments, I love substantive arguments.

Seven principles of real memory for AI agents

Vitalii Cherepanov — Wed, 06 May 2026 07:15:40 +0000

Part 2 of 3 — "Memory for AI agents"
Architecture. Concrete. With formulas and lifecycle.

Article

In the previous post I broke the "RAG = memory" pitch into three uncomfortable problems: a chunk doesn't know it's a chunk; retrieval has no structure, only cosine; time doesn't exist as a first-class concept. In short — RAG is search wearing the marketing word "memory."

Today — what should be there instead.

A disclaimer up front. I don't claim to have invented any single item on this list. Atomic facts go back to Wittgenstein. Temporal validity is basic logic. Knowledge graphs are a whole field with textbooks. Lifecycle for data is standard in any normal information system.

I claim something different. I claim that all seven properties have to work in one system at the same time, and that any system in which only five of seven actually work continues to lie to the user with a confident face. There's only one way to see this — try assembling all seven into one codebase and watch what happens.

I tried. It worked. Called it braincore. Open source, Apache-2.0, single Go binary, MCP-stdio. I won't turn the article into a pitch — but in each section below I'll add one line about how it's done in braincore, so it's clear we're not talking theory.

Let's go.

Principle 1. Atomic Knowledge Units with lifecycle, not "chunks in Qdrant"

The pain. In RAG, any incoming text — dialogue, design doc, git commit, meeting transcript — gets sliced into chunks and shipped into the vector DB without questions. From there, no matter what happens — all chunks are equivalent, all equally "fresh," all equally "true." Six months later, one collection holds a soup of stale, current, hypothetical, and refuted facts. And every one of them has exactly one chance of making it into retrieval — by cosine.

What should be in the schema. Any incoming information does not flow into memory directly. It runs through a pipeline:

input
  → initial trust (by source: user=0.9, llm=0.3, web=0.4..0.7)
  → parse (entity / fact / relation / event / rule / hypothesis)
  → atomic knowledge units
  → validate (source / graph / dedup / contradiction / temporal / rule)
  → link (at least 1 edge into graph OR review item)
  → working memory (TTL + activation)
  → iterative verification loop
  → consolidation
  → long-term memory (only confirmed + linked)
  → edge strengthening (usage + success + co-occurrence − decay)

The core rule: nothing enters long-term memory immediately. Every atomic knowledge unit has at minimum:

truth_status: hypothesis | candidate | confirmed | contradicted | deprecated
lifecycle: staging | working | consolidated | archived
source_ref — where it came from
confidence — numerical certainty estimate
valid_from / valid_until — when it's true

Compare that to a RAG chunk that has only text and embedding. It's the difference between a junk drawer and a warehouse with inventory.

What this enables. When yesterday you said "we use Postgres" and today "we migrated to ClickHouse, Postgres is OLTP only" — the old fact automatically gets valid_until = today and superseded_by = new_fact_id. On retrieve, it either doesn't appear at all, or it comes flagged "historical, not current." Not because of a smart model. Because of the schema.

How braincore does it. The pipeline staging → working → consolidated is implemented literally — three separate SQLite tables plus an intermediate verification loop. A record reaches consolidated only if truth_status = confirmed, has at least one graph edge, no unresolved contradictions, and confidence ≥ threshold. Otherwise it stays in working with a TTL, or moves to a review queue.

Principle 2. Strict Mode and the right to abstain

This is, possibly, the most important point in the entire series. And the most absent from commercial memory frameworks.

The pain. The standard metric AI systems are measured by — "how often they give the right answer." This is a rotten metric. 95% correct answers and 5% confident hallucinations is a system you cannot trust in production. Because you don't know in advance which 5% you're in right now.

The right metric reads:

0% confidently-wrong actions at an acceptable abstain rate.

Not "always answer." But "never take a wrong action without verification." And if verification is missing — say "I don't know" and assign yourself a task to fix it.

What should be in the schema. Before a fact lands in prompt context, it passes through a gate:

is there a source_ref?
confidence ≥ threshold?
trust_score ≥ threshold (for the source)?
temporal_valid == true (valid at query time)?
no unresolved contradiction in the graph?
no unresolved ambiguity?

If even one requirement fails — the fact does not reach context. If no fact made it through for a query — the system returns abstain = true with reason = no_accepted_facts (or contradiction_unresolved, or temporal_invalid — always explicit).

And — pay attention, here's where the magic happens — abstain is not delivered to the user as a dead end. It becomes a brain task in the backlog: "I need evidence for X to answer with confidence. The source is here, the specific conflict is here." The system knows what it doesn't know, and assigns itself the task to fix it.

What this enables. An AI agent you can trust. Not because it's always right — but because when it's not sure, it stays silent or asks for clarification. And when it does take action — the action is grounded in facts that passed the gate, not "well, ChatGPT thought this was better."

Show me one RAG stack that does this. I'll wait.

How braincore does it. The internal/strictmode package is a separate module with explicit gate rules. By default, every query passes through strict mode; for UX scenarios where abstain is unacceptable (brainstorming, for example), you can drop it via an explicit --allow-uncertainty flag. All abstain events are logged as brain tasks with their source and reason.

Principle 3. Causal Decision Chains, not flat facts

The pain. In RAG, any decision is stored as "text about a decision." On retrieve, you get a chunk of text that describes the decision — but doesn't answer "why?", "what alternatives did we consider?", "what came of it?"

Six months later, you ask "why did we pick JWT over sessions?" — RAG returns three fragments of the declaration, and the model fills in the reasoning itself. Sometimes correctly. Sometimes inventing it from popular patterns in its training data. You don't know which one this time.

What should be in the schema. The entity is not a "document" and not a "memory entry." The entity is called decision and has a schema:

problem      → what we were solving
alternatives → what we considered and rejected (with reasons)
decision     → what we chose
reasoning    → why this specifically
outcome      → what came of it (filled in later, post-hoc)
superseded_by → link to a new decision if this one was revised

This isn't "let's stuff text into an embedding." This is a causal chain that answers WHY, not just WHAT.

What this enables. Six months later, you ask "why JWT?" — the system returns a structured answer:

Problem: session scaling + audit requirements.
Alternatives (rejected): stateful sessions with Redis (violates audit), opaque tokens with centralized lookup (latency).
Decision: JWT with short TTL.
Reasoning: stateless, audit-neutral, latency acceptable.
Outcome (recorded 4 months later): invalidation complexity higher than expected; added refresh tokens.
Superseded by: none.

RAG returns three fragments. A decision chain returns reasoning. These are different products.

How braincore does it. Decisions are a separate entity type in the graph with required problem, alternatives[], decision, reasoning fields, and optional outcome/superseded_by. They're stored not as chunks but as structured records with explicit edges into the code graph and into other decisions.

Principle 4. Stable code identity through AST, not strings

The pain. This one is specific to AI agents working with code — but it hits all of them. You renamed GetUser → FetchUser, moved it from pkg/auth to pkg/user, changed the signature from a pointer receiver to a value receiver. All the references in RAG memory pointing to "GetUser in pkg/auth" are now dead. Because RAG is bound to strings.

And nobody tells you. The chunk keeps living in Qdrant, its cosine to auth-related queries stays high. The agent pulls dead information and works against it. Congratulations, you have memory rot disguised as memory.

What should be in the schema. Code parsing through go/ast (for Go) and tree-sitter (for PHP, JS, TS, Python, Rust, Java, and beyond). Node identity is built not from a string and not from a file path, but from a structural hash:

node_id = sha256(qualified_name + kind + signature_hash)

Which means:

Renaming a function does not break references to it (qualified_name changed, but the link is updated automatically on next parse, with a back-reference to the old node_id as renamed_from).
Moving between packages — same thing.
Changing the signature (pointer → value receiver) — signature_hash changes, and old references automatically get marked stale — the brain knows they now require review.

What this enables. When the AI agent is about to edit FetchUser, the system pulls three past decisions about that function, two regressions in this module, and active project rules — before the agent starts writing code. Not because cosine happened to align. Because it's a code graph, and FetchUser has edges to decisions, regressions, and rules by identity, not by text similarity.

I call this pre-edit warning. And it's a qualitatively different kind of error prevention than "let's run a linter after generation."

How braincore does it. The code graph is a separate layer over AST/tree-sitter, with background reindex on filesystem watch events. Identity hashes live in SQLite, edges live there too. On a pre-edit hook, the agent gets the context of related decisions/rules/regressions automatically.

Principle 5. Internal Git as memory versioning

The pain. RAG has no concept of time beyond created_at. That's metadata about a record, not about a state of knowledge. You can't ask "show me what I knew about this code a month ago." You can't roll back the state of memory to before the agent dragged in garbage. You can't switch to a feature branch and have a parallel mental state for it.

What should be in the schema. Every change in memory is a commit. Not metaphorically. Literally, through go-git, into a hidden .internal-git/ repository that lives parallel to the project's main repo.

This gives you:

git log over the project's memory — what was added, what changed, when.
git checkout to roll back the brain state by N days — for audit, for regression investigation, for tests.
When you switch to a feature branch in the main repo, the brain mirrors that, and each branch has its own mental state. An experiment in a feature branch doesn't pollute master's memory.

What this enables. Time-travel queries: "which decision did I consider current 30 days ago?" Audit: "when exactly did the agent start believing we use ClickHouse?" Branch isolation: "in feature/oauth we have a different approach to auth, but that knowledge shouldn't leak into main."

RAG can't do this. RAG has no concept of "state of knowledge" — only a set of vectors that grows.

How braincore does it. The .internal-git/ is created on braincore init. Commits are made automatically on every change to knowledge units and graph edges. Branch tracking is synchronized with the main git through a post-checkout hook.

Principle 6. Memory Scoring — because not all knowledge is equal

The pain. In RAG, all chunks are equal. Top-k by cosine doesn't distinguish "this is confirmed by ten past uses" from "this was written yesterday and never used again." It doesn't distinguish "this is critical for the architecture" from "this is a random note in a corner." It doesn't distinguish "this is in active use" from "this has been gathering dust since last year."

What should be in the schema. Every knowledge unit has a composite MemoryScore, computed as a weighted sum:

MemoryScore =
  + 0.22 * ImportanceScore     (explicit importance, or derived from connectivity)
  + 0.22 * TrustScore          (source reliability + history of confirmations)
  + 0.20 * TaskRelevanceScore  (relevance to current work context)
  + 0.12 * UsageScore          (how often it's used)
  + 0.10 * RecencyScore        (freshness)
  + 0.10 * StabilityScore      (how often it changes — stable is more reliable)
  + 0.08 * NoveltyScore        (novelty as a soft boost)
  − 0.18 * RiskScore           (potential harm from use)
  − 0.18 * NoiseScore          (noise, duplicates, low coherence)

And on retrieve, what runs is no longer cosine similarity, but:

RetrievalScore =
  + 0.35 * semantic_similarity
  + 0.20 * memory_score
  + 0.15 * graph_relevance
  + 0.15 * temporal_validity
  + 0.10 * trust_score
  − 0.15 * ambiguity_penalty

These weights aren't ultimate truth — they're empirically tuned and shift with usage profile. The point isn't the numbers, it's the architectural shift: retrieval stops being "text similarity" and becomes "similarity × importance × trust × freshness."

Lifecycle transitions automatically:

memory_score ≥ 0.80 and trust ≥ 0.75 → consolidated (knowledge becomes "firmware")
memory_score ≥ 0.55 → stays in working
memory_score ≥ 0.30 → staging
memory_score < 0.30 → archive candidate

What this enables. Active memory. Not storage. An active environment in which what's important strengthens through use, and noise decays on its own — like in a biological brain, where rarely-used synapses weaken and frequently-used ones strengthen.

RAG = a hard drive that never gets defragmented.
Brain = a brain in which junk settles on its own and gets archived automatically.

How braincore does it. Scoring is recomputed by a background job every N hours. Lifecycle transitions are atomic and logged (see Principle 5). All weights are exposed in config — tune them per project.

Principle 7. Negative Memory and Rule Engine

The pain. Here's what every LLM agent does today: repeats mistakes. Yesterday it broke a migration — today it'll break a similar one. RAG won't help, because the broken migration doesn't go into RAG. What goes into RAG is "how to write migrations" from the official docs. The fact that you personally already stepped on this rake — recorded nowhere.

What should be in the schema. A separate class — negative memory: what broke, why it broke, how it was fixed, which commit/test confirms it. First-class entity, not a marginal field.

And during planning, every patch passes through a Rule Engine before code is generated:

patch
  → architectural rules
  → code rules
  → security rules
  → performance rules
  → anti-patterns (including "this exact one I broke before")
  → repair plan OR abstain

If a rule with severity critical or high is violated — the code does not get written. A repair plan is created. If repair is impossible — abstain (see Principle 2). No "let's hope this passes" generation.

And, critically, the safe execution pipeline closes the loop:

checkpoint
  → apply patch
  → rules validate
  → build
  → tests
  → success → commit
  → fail → rollback → record into negative memory

Every executed action is either confirmed by tests, rolled back, or recorded as negative evidence for future decisions.

What this enables. An agent that cannot repeat your last year's mistake. Not because it has a great model — but because the rule engine physically refuses to let through any patch that violates a rule derived from that mistake.

RAG helps the agent find something. Good memory prevents the agent from breaking something.

These are different products. And I feel sorry for those who keep mixing them up.

How braincore does it. Negative memory is a separate entity type with a required link to a failing test or git commit. The rule engine is a pre-execution gate, severity-aware, with override possible only via explicit user confirmation.

Bonus principle. Entity Disambiguation

Formally a special case of Principle 1 (atomic units), but it breaks separately often enough to deserve its own callout.

In RAG, there's no concept of an entity. There's only text. If your project has two User classes — one in pkg/auth, one in pkg/billing — for RAG these are two pieces of text with similar embeddings. On retrieve, they mix together, and the model confidently explains auth logic in the context of billing.

This isn't theory. This is happening right now in every code RAG agent.

The fix — EntityFingerprint:

fingerprint(symbol) = hash(
  project_id +
  file_path +
  symbol_name +
  symbol_type +
  signature +
  language
)

Two User entities in different files = two fingerprints = two distinct entities that never auto-merge. When a new candidate arrives, a SameEntityScore is computed:

SameEntityScore =
  + 0.30 * name_similarity
  + 0.20 * alias_match
  + 0.20 * context_similarity
  + 0.15 * graph_neighborhood_similarity
  + 0.10 * temporal_consistency
  + 0.05 * source_consistency

And:

≥ 0.92 → auto_merge
≥ 0.82 → same_as link (soft link, not merge)
≥ 0.65 → ambiguous — an ambiguity record is created, requiring human review
otherwise — new entity

The core rule: never merge entities at low confidence. Better to create an ambiguity record and ask a human than to silently glue them together and lie forever after.

Why all of this together

I'm deliberately not framing this as "this is nowhere done, I'm first." Each of the seven principles already exists. Atomic facts with lifecycle — in knowledge management systems. Strict mode + abstain — in last century's expert systems. Causal chains — in decision support tools. AST identity — in IDEs. Internal git — in tools like Pijul and in Datalog database experiments. Memory scoring — in research papers on episodic memory. Negative memory — in RL and reliability engineering.

Uniqueness isn't in the ideas. It's in the assembly.

If you have atomic units but no strict mode — you have a structured database of hallucinations. If you have strict mode but no causal chains — you abstain without understanding why. If you have causal chains but no AST identity — your decisions point into the void after two refactorings. If you have all of the above but no memory scoring — you have a perfectly structured dump in which the important drowns in noise.

Each property in isolation is an improvement. All seven together is a different category of product.

This, by the way, is the answer to the question I get most often: "why write something new if I already have Mem0/Letta/Zep?" The answer — look at their schemas and check how many of the seven principles are implemented not as a marketing claim, but as an enforced gate in code. For most, the honest count is two or three. For some — four. They aren't bad products. They're partial solutions, more honestly called "structured retrieval" than "memory."

In Part 3

Seven principles is engineering. What should be in the architecture. But behind engineering sits a deeper question: why should an AI agent know what it doesn't know? Why abstain at all, if it can just answer?

Part 3 is about the right of an AI agent to stay silent. About self-tasking. About why cognitive runtime matters more than model size. And about why the right metric for production AI isn't accuracy, but zero confidently-wrong actions at an acceptable abstain rate.

It's the shortest and most philosophical piece in the series. Drops next week.

Part 2 of 3. If you missed Part 1 — here (on why RAG is search and not memory). If this resonated — a repost would help.

RAG isn't memory. It's Ctrl+F with embeddings.

Vitalii Cherepanov — Fri, 01 May 2026 12:12:41 +0000

Part 1 of 3 — "Memory for AI agents"
Deconstructing the long-term memory myth in LLM systems

Article

It's 3 AM. I'm on my third night debugging an AI agent. I'm standing in the kitchen with a mug of tea, staring at a diff, swearing quietly. The agent has confidently rewritten the auth function — based on a chunk that belongs to a branch that was deleted from the repo two months ago.

The chunk lives in Qdrant. Its cosine similarity to my query is high. Top-1 in the retrieval. The agent honestly grabbed it, honestly stitched it into the prompt, honestly generated the "correct" patch. Against code from a different reality.

I close the laptop and think: okay, I have RAG. I have vectors. I have long-term memory. I have everything every AI conference deck has been promising for the last two years. Why did my agent just propose a fix based on code that doesn't exist anymore?

Because my agent doesn't have memory. My agent has search results with cosine instead of BM25. And between those two sentences lies the entire difference between "AI you can trust in production" and "AI you have to babysit on every line."

This piece is about that difference. And about why we, as engineers, are the ones to blame for not seeing it anymore.

The devaluation of the word "memory"

Let's be honest. What is the typical "memory" of an AI agent in 2026?

text → split into 512-1024 token chunks
     → embedding (bge / text-embedding-3 / openai)
     → vector DB (Qdrant / pgvector / Chroma / Pinecone)
     → cosine similarity top-k
     → concatenate into prompt

This is not memory. This is search. It's old-school Lucene from 2003, repainted in neural colors. Cosine instead of TF-IDF. Embeddings instead of an inverted index. Same thing.

If we just called it that — "vector search," "semantic retrieval" — I'd have no complaints. Call Lucene Lucene, no problem. But when it's sold under the banner "my AI has long-term memory" — sorry. My AI has déjà vu and amnesia at the same time.

This isn't a terminology gripe. It's a question of expectations. When an engineer hears "memory," they imagine a system that remembers: who said what, when, in what context, what was true then versus what's true now. When an engineer gets RAG, they get Ctrl+F. And instead of building honest architecture around that Ctrl+F — with honest constraints — they build a sandcastle and wonder why the agent confuses past with present.

Three holes you can drive a truck through

Three concrete failures. Each one I caught in production. Not theory.

Hole #1: A chunk doesn't know it's a chunk.

Take a perfectly normal declaration from a design doc:

"We moved to JWT because opaque sessions didn't scale to our traffic profile. The alternative was stateful sessions with a Redis cluster, but we ruled it out because of audit requirements from a customer — they don't allow session state outside their perimeter. JWT solves both, but adds invalidation complexity, which we mitigate with short TTLs and refresh tokens."

The chunker splits this into four 512-token pieces. On retrieval, a query comes in: "why did we pick JWT?" Top-3 returns three fragments of the same decision. With no causality. Without the alternative we ruled out. Without the trade-off we accepted.

A decision that was whole turns into three parallel "factoids." The model honestly stitches them into plausible text — and invents the missing connections. Because its job is to generate plausible text. And it will, without blinking.

This isn't a bug in the chunker. This is an architectural property of the entire approach. Any decision declaration you have gets ground into powder and reassembled with structural loss. Every single time.

Hole #2: There's no structure in memory. Only cosine.

When a human explains a project to you, they say:

here's the goal
here are the options we considered
here's what we picked and why
here's what broke two months later
here's what we changed, and that decision now supersedes the old one

In RAG, none of this exists. Zero. RAG doesn't distinguish "hypothesis," "confirmed fact," "rejected alternative," "deprecated decision moved to archive." For RAG, all of these are equivalent points in a 384-dimensional space.

Imagine you're trying to record thirty years of life into a single flat table entries(text, vector) and then search it by cosine. Surprised your memories blur together? That's not your memory failing. That's the structure you crammed it into — a structure that doesn't allow distinctions between "I thought about it" and "I did it," between "I tried it and it worked" and "I tried it and it hurt."

In RAG, there are no fields for these distinctions. Not because the developers didn't think of it. Because the vector-plus-distance paradigm itself doesn't accommodate causality and time. It's a mathematical limitation. You don't fix it with product features.

Hole #3: Time doesn't exist as a first-class concept.

Three weeks ago I wrote into the agent's memory: "we use Postgres." Today I wrote: "we migrated to ClickHouse for analytics, Postgres is OLTP only now." In RAG, both facts sit there. Both have high cosine to a database query. Top-k returns both. The model picks the one that "sounds" better in its pretraining — usually Postgres, because it appears more often in the training data.

This is not memory. This is a roulette wheel disguised as confidence.

When was the last time you saw valid_from, valid_until, deprecated_by, replaced_by, superseded_by fields in a production RAG system? I never have. Because in standard RAG, they're not in the schema. And again — not because devs are lazy. Because the schema "text plus embedding" has no place for the lifecycle of knowledge. No notion of "this is true now" versus "this was true then." Everything collapses into a single time slice — a present that somehow contains yesterday, last year, and deprecated-three-quarters-ago all at once.

Ctrl+F with embeddings doesn't remember. It finds. Different verbs.

"But memory frameworks fix this, right?"

Okay, the believer says. There's mem0, Letta, Zep, Cognee, MemGPT, the whole long-term memory zoo. They added a meaning layer on top of RAG. They're memory-aware.

Let's be honest. I've used them. One after another. For a long time. Looked under the hood, not just at the landing pages.

Each of them takes one piece of real memory — for some it's LLM-extraction before write, for some it's a buffer hierarchy like an OS, for some it's post-hoc graph extraction from dialogues, for some it's per-fact temporal validity — and implements that one piece, without weaving it into the rest.

This is warmer than vanilla Qdrant. It's not a solution.

Because real memory requires seven properties working together. Each of them, in isolation, already exists in the literature or in open source. As far as I can tell, no one has assembled all seven into a single system. Which seven, exactly — that's part 2 of this series. Here, only the limitation that unites all flat-fact solutions, however they wrap themselves:

None of them have the right to say "I don't know."

Show me any one of these systems with a formal abstain mechanism: a gate through which a fact will not pass into prompt context if it has no source, no confidence, no temporal validity, or an unresolved contradiction. I'll wait.

In the standard flow of all these frameworks, the system's response to "there's a contradiction in memory or not enough data" is "well, the model will figure it out." Which translates from marketing to engineering as "the model will hallucinate, and that becomes your problem in production."

Good memory isn't "remembering a lot." It's knowing the boundary of what you don't remember. Part 2 of this series is built around that thesis.

"Why not just push context to 1M tokens?"

This is the second fashion of the last two years, and it deserves its own breakdown, because it leads the industry into the same dead end under a different banner. "Why do we need memory if Gemini has 2M context, Claude has 1M?"

Four problems, no preamble.

One — economics. A single project conversation at 800K tokens with prompt caching off costs tens of dollars per request. Without aggressive caching, you're broke in a week. With aggressive caching, you're building exactly the same hierarchy as Letta — just more expensive and locked to one vendor.

Two — recall. Every long-context benchmark (NIH, Ruler, LongMemEval) shows the same thing: models drown in their own context past 200-300K tokens. Attention is unevenly distributed. This is lost-in-the-middle, and it doesn't get fixed by window size — it gets partially mitigated by architectural tricks inside the model, but it doesn't go away. The more you stuff in, the less of it actually gets considered.

Three — persistence. Context isn't saved. Close the session, gone. Tomorrow the same agent shows up with a clean context. So you have to feed it 800K tokens of "history" again. The problem isn't solved — it's hidden inside your wallet and your latency.

Four — learning. If the agent made a mistake yesterday and you corrected it, that experience isn't structured for the future. Tomorrow it'll repeat the mistake. Context is RAM, not disk. And when someone says "just increase context instead of building memory" — that's the same as saying "why do I need a database, I have a terabyte of RAM." Technically the words rhyme. In practice they're incomparable concepts.

Big context doesn't replace memory. It lets you stuff more into one session — and that's it.

What to do about it tomorrow morning

If you've read this far and you're thinking "okay, agreed, RAG is search, not memory. Now what?" — I have two pieces of news.

The bad: a systemically correct solution requires rewriting the memory layer from schema up through lifecycle, and that's months of work. Not a weekend.

The good: there are several things you can do tomorrow morning that already remove half the pain. Not magic — just engineering hygiene.

Drop the word "memory" from your stack if what you have is RAG. Call it retrieval or search — instantly more honest. That alone removes 80% of inflated expectations from users and the team.
Introduce valid_from and valid_until for every fact. Any fact without temporal validity is a hypothesis, not a fact. Old facts should drop out of retrieval automatically, not compete with new ones on cosine.
Distinguish staging, working, consolidated, archived. Don't dump everything into one collection. A fact that just arrived and a piece of knowledge confirmed by tests are different entities with different weight in retrieval.
Make abstain a first-class outcome. If no fact passed the confidence threshold during retrieve, the system must have the right to say "I don't know, I need data." And that "I don't know" should become a task in the backlog, not a dead end for the user.

This isn't a complete list — it's the minimum to start the transition from "I have RAG, I call it memory" to "I have memory, and it knows its boundaries." The full list of seven principles is in part 2.

Where this comes from

I sit deep in this kitchen — Claude Code, Cursor, Codex, Windsurf, MCP servers, mem0, Zep, local RAG stacks on Postgres + pgvector, Qdrant, Chroma. Over the last few months I've tried, I think, everything on the market. I have my own MCP memory server with about fifteen hundred entries, which I rewrote from scratch three times because each time I hit one of the three holes above.

At some point, I got tired. Not of AI — of what we call memory at AI. Sat down and started writing my own cognitive runtime that doesn't pretend to know, that knows what it doesn't know, and that sets its own tasks to close the gaps. Called it braincore. One Go binary, local, MCP-stdio, Apache-2.0. Not a pitch, because it's open source — just an example that I say "this can be done" not theoretically.

Seven architectural principles it's built on — that's part 2 of this series. Drops in a week. I'll cover atomic knowledge units, lifecycle, strict mode, causal decision chains, AST-based identity for code, internal git as memory versioning, memory scoring, and negative memory.

And why all of that combined produces a qualitatively different result than any of those pieces in isolation.

Part 3 is philosophical — about the right of an AI agent to stay silent, and why the right metric for production AI isn't accuracy but zero confidently-wrong actions at an acceptable abstain rate. About self-tasking. About why cognitive runtime matters more than model size.

If you read this far and recognized yourself in the opening paragraph — we're in the same boat. If you have RAG that you call memory and it works — tell me how, seriously, I want to know, I might be wrong.

The one thing you can't do is stay silent.

Part 1 of 3. Next — "Seven principles of real memory for AI agents" — drops next Tuesday.

I Studied the etcd Codebase — and It Changed How I Write PHP

Vitalii Cherepanov — Tue, 21 Apr 2026 10:41:13 +0000

There's a common piece of advice: "Want to write better code? Read good code." Sounds obvious. Rarely practiced.

The problem is that most open-source projects are mazes. You open a repo, see 200 directories, and close the tab. Kubernetes is two million lines. The Linux kernel — don't even think about it. Where do you start?

My answer: etcd.

For those unfamiliar: etcd is a distributed key-value store written in Go. It's the backbone of Kubernetes — every piece of cluster state lives there. But I'm not interested in etcd as a product. I'm interested in it as an example of architecture you can actually read from start to finish.

Here's what surprised me: the principles baked into etcd aren't about Go. They're about software design in general. I work with PHP and Symfony daily, and almost everything I found in etcd translated directly into my projects.

Seven principles, concrete examples, no fluff.

1. One Source of Truth for Your API

In etcd, every API is defined in .proto files. Open rpc.proto and you see all operations: Range, Put, DeleteRange, Txn. Every field is typed. There's no room for "wait, do we accept a string or an integer here?"

In PHP, instead of protobuf, we have strictly typed DTOs:

final readonly class CreateOrderRequest
{
    public function __construct(
        public string $customerId,
        /** @var OrderItemDto[] */
        public array $items,
        public ?string $promoCode = null,
    ) {}
}

One class — and everyone knows what the endpoint accepts. The frontend dev looks at the DTO, the backend dev writes logic against it, the OpenAPI schema generates automatically via NelmioApiDocBundle.

Compare this with what I've seen (and written) on real projects:

$data = json_decode($request->getContent(), true);
$customerId = $data['customer_id'] ?? null;
$items = $data['items'] ?? [];
// What's the format of items? Is promoCode a thing? Who knows.

When your contract is "well, some array comes in," any change breaks something unexpected. When your contract is a DTO with types, PHPStan catches the problem before production does.

2. Each Service Does One Thing

etcd has clearly separated gRPC services: KV (read-write), Watch (subscribe to changes), Lease (key TTLs), Auth (authorization). Each one is a separate interface. Watch doesn't touch writes. KV doesn't check tokens.

In Symfony — same idea, different tools:

class OrderController
{
    #[Route('/orders', methods: ['POST'])]
    public function create(
        CreateOrderRequest $request,
        OrderService $orderService,
    ): JsonResponse {
        return new JsonResponse(
            $orderService->create($request)
        );
    }
}

OrderService creates orders. It doesn't send emails — that's NotificationService listening to an OrderCreatedEvent. It doesn't process payments — that's PaymentService.

And then there's the alternative I see regularly:

class OrderController
{
    public function create(Request $request)
    {
        // 40 lines of validation
        // 20 lines of authorization
        // 60 lines of business logic
        // 15 lines sending email
        // 10 lines of logging
        // Total: 150 lines, untestable
    }
}

The 500-line god controller. We've all been there. etcd helped me finally articulate why it's bad: not because "the pattern is wrong," but because you can't trace what the system is doing.

3. Middleware Composes Like Lego

Every gRPC request in etcd passes through a chain of interceptors: logging → auth → metrics → handler → metrics → response. Each interceptor is small, single-purpose. The power comes from composition.

In Symfony, this maps to Event Listeners and Messenger Middleware:

class MetricsMiddleware implements MiddlewareInterface
{
    public function __construct(
        private PrometheusCollector $metrics,
    ) {}

    public function handle(Envelope $envelope, StackInterface $stack): Envelope
    {
        $start = microtime(true);

        try {
            $result = $stack->next()->handle($envelope, $stack);
            $this->metrics->increment('messages_processed_total', [
                'type' => $envelope->getMessage()::class,
                'status' => 'success',
            ]);
            return $result;
        } catch (\Throwable $e) {
            $this->metrics->increment('messages_processed_total', [
                'type' => $envelope->getMessage()::class,
                'status' => 'error',
            ]);
            throw $e;
        } finally {
            $this->metrics->histogram(
                'message_duration_seconds',
                microtime(true) - $start,
                [$envelope->getMessage()::class]
            );
        }
    }
}

One middleware, one job. Metrics here, logging there, retry somewhere else. Assemble the chain in messenger.yaml.

The antipattern — when every handler has this manually:

public function handle(CreateOrderCommand $command): void
{
    $this->logger->info('Starting order creation...');
    $start = microtime(true);

    // ... actual logic ...

    $this->metrics->record(microtime(true) - $start);
    $this->logger->info('Order created');
}

50 handlers, 50 copies of the same boilerplate. Forget one — no metrics. Change the log format — change it in 50 places.

4. Observability Is Architecture, Not an Afterthought

In etcd, Prometheus is wired into the gRPC layer from day one. Not "added six months after launch." The code isn't considered done without metrics.

In PHP:

class PaymentService
{
    public function charge(Order $order): PaymentResult
    {
        $timer = $this->metrics->startTimer('payment_charge_duration');

        try {
            $result = $this->gateway->process($order);

            $this->metrics->increment('payments_total', [
                'provider' => $result->provider,
                'status' => $result->isSuccess() ? 'success' : 'declined',
            ]);

            return $result;
        } catch (GatewayTimeoutException $e) {
            $this->metrics->increment('payments_total', [
                'provider' => $order->paymentMethod,
                'status' => 'timeout',
            ]);
            throw $e;
        } finally {
            $timer->observe();
        }
    }
}

Every payment — in metrics. How many succeeded, how many timed out, which provider is slow. Not because someone asked for it, but because without it you're flying blind.

I remember a project where production was down for 40 minutes and the only way to understand what was happening was tail -f /var/log/symfony.log | grep ERROR. Never again.

Package: promphp/prometheus_client_php. Five minutes to install, fifteen to wire up Grafana.

5. Simple Outside, Rocket Science Inside

clientv3 in etcd is a masterclass in the facade pattern:

client.Put(ctx, "name", "value")

One line. Under the hood: node selection, reconnection on failure, retry with exponential backoff, protobuf serialization, Raft consensus, disk write, quorum confirmation.

Same principle in PHP:

// Calling code. Simple and clear.
$paymentService->charge($order);

Inside charge():

public function charge(Order $order): PaymentResult
{
    if ($existing = $this->findExistingPayment($order)) {
        return $existing; // idempotency
    }

    $provider = $this->providerResolver->resolve($order);

    $result = $this->withRetry(
        fn () => $provider->process($order),
        maxAttempts: 3,
        backoff: 'exponential',
    );

    if ($result->isSuccess()) {
        $this->fiscalService->createReceipt($order, $result);
    }

    $this->events->dispatch(new PaymentProcessed($order, $result));

    return $result;
}

The controller calling charge() knows nothing about fiscal receipts, retries, or provider selection. And it shouldn't.

A sign of a good service: you can explain what it does in one sentence — "charges the customer for an order" — while the implementation is 200 lines of careful logic.

6. You Can Trace a Request With Your Finger

In etcd, the request path reads linearly:

gRPC handler → EtcdServer.Put() → Raft → apply → bbolt (disk)

No magic. No hidden calls. No "where does this even get triggered?"

In Symfony — same thing, if you don't abuse the event system:

Request
  → Controller (unwrap DTO)
    → Service (business logic)
      → Repository (database)
      → EventDispatcher (side effects)
  → Response

Open the controller — see which service is called. Open the service — see what it does. Open the repository — see the query.

What kills traceability:

@PostPersist on an entity that silently sends SMS
prePersist listeners modifying data before writes — and you spend 30 minutes figuring out who's touching the updatedAt field
Ten EventSubscribers on the same event with unclear execution order Event-driven is great. But if a new developer can't explain "request comes in here, response goes out there" within 2 minutes — you have a problem.

7. No Hidden Dependencies

In etcd, all dependencies are passed explicitly:

func NewKVServer(s *EtcdServer) KVServer { ... }

See the constructor — see everything the class needs.

In Symfony — constructor injection, same thing:

class OrderService
{
    public function __construct(
        private OrderRepository $orders,
        private PaymentGateway $payment,
        private EventDispatcherInterface $events,
        private LoggerInterface $logger,
    ) {}
}

Four dependencies. All visible. Want to test? Swap in mocks. Want to understand the class? Look at the constructor.

Antipatterns that still survive in the wild:

// Service locator: where did this come from?
$payment = $this->container->get('payment.gateway');

// Static calls: untestable
Cache::put('key', $value);

// new SomeService() inside another service: invisible coupling
$validator = new OrderValidator();

Symfony's autowiring isn't magic in the bad sense. The container wires dependencies by type, but you still see them in the constructor. It's convenience, not hidden behavior.

My Checklist

After studying etcd, I distilled a checklist I now apply to every new service:

Contract defined? DTOs exist, types are set, OpenAPI generates from them
Controller thin? 10 lines max, all logic in the service layer
Cross-cutting concerns extracted? Logging, metrics, retry — through middleware, not copy-paste
Metrics present? If not, the service isn't production-ready
Simple API externally? Calling code doesn't know about internal complexity
Request path traceable? A new developer finds the handler in 2 minutes
Dependencies explicit? Everything in the constructor, nothing from thin air None of this is revolutionary. It's basic hygiene that's easy to forget under deadline pressure.

etcd just reminded me what a codebase looks like when that hygiene wasn't skipped. And that it's possible even in a large production system.

What open-source codebase changed how you write code? I'd love to build a reading list — drop yours in the comments.