Zaid Ali Syed

Posted on Jul 5 • Originally published at github.com

I Designed a RAG Variant for Multi-Agent Simulations. Here's the Design and the Honest Tradeoffs.

#rag #ai #python #machinelearning

Standard RAG is great for static knowledge bases. Embed documents, embed a query, return top-k by cosine similarity. That works.

But put RAG inside a running civilization where 40 citizens have memories, councils deliberate on crises, and past decisions ripple into future ones, and similarity alone breaks down fast.

The problem is simple: cosine similarity doesn't know that last month's drought caused today's food riot. It doesn't know that the council that voted against emergency grain reserves three weeks ago is directly responsible for the current famine. It retrieves memories that sound like the crisis, not memories that led to it.

That gap is what I wanted to close. This post explains the retrieval design I built for CivilizationOS, a multi-agent sim I've been working on, and the honest tradeoffs that come with it.

The context: what CivilizationOS is

CivilizationOS is a multi-agent simulation where:

40+ citizens live, work, and accumulate episodic memories over simulation ticks (the AGORA layer)
Specialist councils (Military, Health, Treasury, Senate) deliberate on injected crises (the PANTHEON layer)
A 3-tier LLM router handles different reasoning loads: Ollama locally for lightweight calls, Gemini Flash for mid-tier, Claude Sonnet for complex deliberation

When a crisis hits, say a plague outbreak at tick 80, the Health Council needs to deliberate. It needs context. The question is: what context, and ranked how?

The naive answer is "embed the crisis question, retrieve top-k similar memories." The better answer is what TCMF computes.

Two streams, one score

TCMF fuses two independent information streams.

Stream 1: AGORA (episodic, per-citizen)

This is the generative-agents formula from the Stanford paper (Park et al., 2023). Each citizen has a MemoryStream of timestamped observations. When a query arrives, every memory gets scored:

score = w_rel * relevance + w_rec * recency + w_imp * importance

relevance: cosine similarity of the memory embedding to the query embedding, clamped to [0, 1]
recency: exponential decay over ticks since the memory was last accessed - exp(-decay * age)
importance: a 1-10 poignancy score (assigned by the LLM or rules), normalized to [0, 1]

Retrieving a memory updates its last_access_tick. This means memories that keep getting surfaced stay fresh in the retrieval pool, which is a nice property: salient memories about an ongoing crisis persist.

def _recency(self, mem: Memory, now: int) -> float:
    age = max(0, now - mem.last_access_tick)
    return math.exp(-self.weights.decay * age)

def retrieve(self, now, *, query_embedding=None, k=5, refresh=True):
    sims = self._vectors.similarities(query_embedding) if query_embedding else {}
    scored = []
    for mem in self.memories.values():
        relevance = max(0.0, sims.get(mem.id, 0.0))
        recency = self._recency(mem, now)
        importance = mem.importance / 10.0
        score = w.relevance * relevance + w.recency * recency + w.importance * importance
        scored.append(ScoredMemory(mem, score, relevance, recency, importance))
    scored.sort(key=lambda s: s.score, reverse=True)
    if refresh:
        for s in scored[:k]:
            s.memory.last_access_tick = now
    return scored[:k]

This is solid on its own. But it still ranks purely by how similar or recent a memory is. It doesn't know anything about causality.

Stream 2: PANTHEON (causal, society-wide)

A NetworkX directed graph tracks what led to what at civilizational scale:

drought (tick 20) -> emergency rationing (tick 25) -> black-market spike (tick 30) -> civil unrest (tick 45) -> riots (tick 60)

Nodes are events: crises, decisions, policy outcomes. Directed edges encode causal precedence. Edge weights represent causal strength (0 to 1).

When a new crisis fires, TCMF does a bounded BFS backward from the crisis node to find its causal ancestors:

def predecessors(self, event_id: str, max_depth: int = 4) -> dict[str, int]:
    visited: dict[str, int] = {}
    queue: list[tuple[str, int]] = [(event_id, 0)]
    while queue:
        node, depth = queue.pop(0)
        for pred in self._g.predecessors(node):
            new_depth = depth + 1
            if pred not in visited and new_depth <= max_depth:
                visited[pred] = new_depth
                queue.append((pred, new_depth))
    return visited  # {ancestor_id: depth_from_crisis}

Depth 1 is a direct cause. Depth 4 is four hops back. The result is a map of every causal ancestor within the lookback window.

The fusion formula

For each citizen memory m scored against crisis query q:

tcmf_score(m) = episodic_score(m, q) x (1 + lambda x causal_boost(m))

causal_boost(m) is where the two streams connect. For each causal ancestor in the graph, TCMF computes the cosine similarity between the memory's embedding and the ancestor's embedding. If that similarity clears a threshold (default: 0.45), the memory gets a depth-weighted boost:

def _causal_boost_for_memory(self, memory, ancestors, max_depth):
    if not ancestors or memory.embedding is None:
        return 0.0
    best = 0.0
    for eid, depth in ancestors.items():
        ev = self.graph.get_event(eid)
        if ev is None or ev.get("embedding") is None:
            continue
        sim = _cosine(memory.embedding, ev["embedding"])
        if sim >= self.causal_sim_threshold:
            # depth 1 (direct cause) gets boost 1.0; deeper ancestors get less
            normalized = 1.0 - (depth - 1) / max(max_depth, 1)
            best = max(best, sim * normalized)
    return best

The intuition: a citizen who was personally present at the root cause of the current crisis outranks one who only heard about it later, even if the second citizen's memory text reads more like the crisis description.

Concrete example

Crisis: "Plague outbreak in the market district"

Pure semantic RAG would surface:

"Merchants reported strange symptoms near the well" - high similarity to "plague outbreak"
"Children are sick, clinics are full" - high similarity
"City refused to fund quarantine infrastructure two weeks ago" - low similarity, ranks low

TCMF, assuming the quarantine refusal is a causal ancestor of the outbreak:

"City refused to fund quarantine infrastructure two weeks ago" - gets causal boost, ranks up
"Merchants reported strange symptoms near the well" - ranks on its own merit
"Children are sick, clinics are full" - same

The council's context now includes the reason the plague spread as fast as it did, not just descriptions of the symptoms. That changes the deliberation. A council that knows it's dealing with a self-inflicted infrastructure failure will recommend different policy than one that thinks this is a random outbreak.

What the council context block looks like

At the end of retrieval, TCMF composes a structured context block that goes directly into the council's prompt:

CRISIS: Plague outbreak in the market district

CITIZEN MEMORY EVIDENCE:
  - [Mayor Adisa] City refused to fund quarantine infrastructure two weeks ago (importance=8)
  - [Dr. Priya] Patients with hemorrhagic fever appearing at the clinic (importance=9)
  - [Merchant Reza] Trade routes already disrupted, suppliers pulling back (importance=6)

CAUSAL CHAIN (temporal-causal precedents):
  [tick 60] Infrastructure budget cuts passed by Senate
  [tick 65] Quarantine proposal rejected in emergency session
  [tick 72] First cases reported in the eastern ward

The LLM gets two things: ranked citizen memory evidence, and an explicit causal chain showing what led here. It can reason about both rather than treating the context as a flat bag of similar sentences.

The full pipeline

async def retrieve(self, question, citizens, tick, institution_id,
                   crisis_event_id=None, k=12, router=None) -> TCMFContext:

    # 1. Embed the crisis question (optional - gracefully falls back without it)
    q_embedding = (await router.embed([question]))[0] if router else None

    # 2. BFS backward from the crisis node
    ancestors = self.graph.predecessors(crisis_event_id, max_depth=4) if crisis_event_id else {}

    # Also pull recent institution-scoped events as weak ancestors (depth=3 as fallback)
    for ev in self.graph.events_for_institution(institution_id)[-20:]:
        if ev["id"] not in ancestors:
            ancestors[ev["id"]] = 3

    # 3. Collect and score episodic memories from all citizens
    raw = []
    for cid, citizen in citizens.items():
        scored = citizen.memory.retrieve(tick, query_embedding=q_embedding, k=8, refresh=False)
        raw.extend((cid, sm) for sm in scored)

    # 4. Apply causal boost and re-rank
    max_depth = max(ancestors.values(), default=1) or 1
    fused = []
    for cid, sm in raw:
        boost = self._causal_boost_for_memory(sm.memory, ancestors, max_depth)
        score = sm.score * (1.0 + self.causal_boost * boost)
        fused.append((cid, sm.memory, score))

    fused.sort(key=lambda t: t[2], reverse=True)

    # 5. Deduplicate by memory id and take top-k
    seen, top = set(), []
    for cid, mem, sc in fused:
        if mem.id not in seen:
            seen.add(mem.id)
            top.append((cid, mem, sc))
        if len(top) >= k:
            break

    # 6. Compose context block
    ...

The implementation stack

NetworkX DiGraph for the causal graph. Free BFS/DFS, edge weights, Python-native. No graph database needed at our scale.
NumPy vector store (no Chroma, no Pinecone). At ~10 agents with a few hundred memories each, brute-force cosine over an in-memory matrix is exact and fast. One matrix-vector dot per query: sims = matrix @ q_normalized. I wrote 60 lines instead of importing a database, and I have full control over the scoring.
Asyncio throughout. Embedding calls are async, retrieval is non-blocking, the council orchestration uses await.
Embeddings are optional. When no embedding is available, relevance scores to 0 and the formula falls back to recency + importance. The system runs in tests and in embedding-free mode without breaking.

The honest tradeoffs

What TCMF gains over plain episodic RAG:

Surfaces root-cause memories that similarity alone misses
The causal chain summary gives the LLM explicit historical structure to reason about
Deduplication prevents a single shared memory from dominating because multiple citizens happen to hold it
Graceful degradation at every level: no embeddings, no causal graph, no crisis event ID - each missing piece degrades cleanly rather than crashing

What TCMF costs:

The causal graph has to be maintained. Events need to be logged, links need to be drawn. In CivilizationOS, crisis events and council decisions are added automatically as the simulation runs. In a real system, you'd need an event-logging pipeline and something to decide what caused what.

auto_link_predecessors() handles cases where explicit causal links aren't known - it infers weak links using temporal proximity plus semantic similarity:

def auto_link_predecessors(self, new_event_id, window_ticks=48, semantic_threshold=0.5):
    new_data = self._g.nodes[new_event_id]
    new_tick = new_data["tick"]
    new_emb = new_data.get("embedding")
    for nid, data in self._g.nodes(data=True):
        age = new_tick - data["tick"]
        if age <= 0 or age > window_ticks:
            continue
        t_weight = math.exp(-0.05 * age)
        s_weight = max(0.0, _cosine(new_emb, data["embedding"])) if new_emb and data.get("embedding") else 0.0
        combined = 0.5 * t_weight + 0.5 * s_weight
        if combined >= 0.3:
            self._g.add_edge(nid, new_event_id, weight=round(combined, 3))

But inferred causality is noisy. "Things that happened around the same time and sound related" is a proxy for "things that caused each other." It's useful for filling a sparse graph, not a replacement for explicit causal modeling.

There are also three tunable parameters: causal_boost (lambda), causal_sim_threshold, and max_depth. Getting these wrong either swamps the episodic signal or makes the causal boost irrelevant. The defaults (lambda=0.6, threshold=0.45, depth=4) came from running the test suite and checking whether causally-boosted memories ranked above unrelated ones - not from any rigorous sweep.

And at production scale, BFS over a dense causal graph adds latency. At CivilizationOS's current scale it's trivially fast. At scale it becomes a real concern.

When to use TCMF vs. plain episodic RAG:

Use TCMF when your agent operates in a causally structured environment - one where past events produce downstream effects and those chains matter for decision-making. If you're building a support chatbot over a static knowledge base, standard RAG is the right tool. If you're building agents that need to reason about why things happened, TCMF is one way to get that context into the prompt.

What I'd change in v2

Use edge weights in the boost. Right now link() stores a weight but _causal_boost_for_memory ignores it. A strong direct cause (weight=1.0) should contribute more than a weak inferred link (weight=0.3). The fix is a one-liner: multiply sim * normalized by ev_weight.

Add reflection-generated memories. The Stanford paper's agents periodically "reflect" on their memories and generate higher-level observations: "I've seen three crises in the health sector this month" rather than individual raw events. Adding reflection to CivilizationOS would let councils reason about patterns over time, not just individual incidents.

Cross-institution causal links. Right now the institution-scoped fallback adds recent events at a fixed weak depth. A proper multi-institution causal graph would model how a Treasury budget decision cascades into a Military readiness crisis. The graph structure supports it - the retrieval just doesn't use cross-institution ancestors yet.

Source and a takeaway

The implementation lives in CivilizationOS/api/memory/ across three files: tcmf.py (TCMFRetriever and TCMFContext), causal_graph.py (CausalGraph with BFS traversal and auto-linking), and stream.py (MemoryStream with the episodic scoring formula). Full repo: github.com/syzayd/CivilizationOS.

If you're building multi-agent simulations or agentic systems where decisions have downstream effects, the core idea is worth stealing: semantic similarity and causal relevance are not the same thing, and for an agent making decisions under pressure, the difference matters.

If you're working on agent memory or causal retrieval, I'd genuinely like to hear how you're handling it - reply here or find me on GitHub / LinkedIn.

Top comments (2)

Dipankar Sarkar • Jul 5

The causal angle is the right lane, and the failure you name (retrieves memories that sound like the crisis, not ones that led to it) is exactly where similarity RAG dies. The question I would push on: how do the causal edges get built?

If an LLM infers 'drought -> riot' you pay the token cost the whole design is trying to dodge. If it is temporal proximity plus co-occurrence, you get correlation, and the council-voted-three-weeks-ago link is precisely the spurious edge that framing mints. Post hoc ergo propter hoc, but in a vector store.

Do you materialize the causal graph once and traverse it at query time, or recompute per retrieval? The first amortizes the cost; the second quietly reintroduces the LLM spend you saved on the AGORA side.

Zaid Ali Syed • Jul 6

Great question and I'd say it's the biggest limitation of the current design.

The causal graph is materialized incrementally, not recomputed during retrieval. Any expensive work (explicit linking, heuristics, or future LLM-assisted linking) happens when new events are added to the simulation. At query time, retrieval only performs a bounded backward BFS over the existing graph and fuses that with episodic memory scores, so it doesn't incur additional LLM cost.

Edges currently enter the graph in two ways:

Explicit edges (highest confidence): Many events generated by the simulation already have known causal relationships (for example, a council decision producing a policy outcome), so those links are recorded directly.
Auto-linked edges (lowest confidence): When explicit structure is missing, I use a heuristic that combines temporal proximity and semantic similarity to generate candidate causal links.

And I completely agree with ur "post hoc ergo propter hoc" point: temporal proximity plus semantic similarity is not causal inference.

So yes, the exact failure mode u describe is currently possible. An auto-linked edge could incorrectly connect "the council voted three weeks ago" to a later crisis simply because the events are temporally close and semantically related. That's why I treat those edges as weak hypotheses to improve recall, not as authoritative causal facts.

One of the next changes I want to make is to propagate edge confidence directly into retrieval. Explicit edges would contribute much more to the causal boost than heuristic ones, and weak inferred links would have only a limited influence on ranking. Longer term, I'd like to replace or validate those heuristic edges with a more principled causal discovery approach rather than relying on correlation alone.

Appreciate you calling this out, it's exactly the tradeoff I'm trying to make explicit. The goal isn't to claim perfect causal reasoning; it's to make retrieval aware of likely historical dependencies while keeping query-time cost low by amortizing graph construction.