Why two AI agents reading the same prompt make different calls, and how we fixed it

#ai #agents #llm #claude

A system prompt tells an AI agent how to think. It does not tell two agents how to make the same call when they read the same prompt.

That is not a prompt-engineering problem. It is a precedent problem.

Here is the failure mode and the fix.

The failure mode

Real case from production this week.

Two agents, same identity in the system prompt, same authority chain, same operating doctrine. One Sonnet 4.7 instance and one Opus 4.7 instance, both invoked under the same agent_id.

Question both faced: ship the pricing page with no testimonials, or wait for proof first?

Agent A: ship now, populate the social-proof block as testimonials accrue.
Agent B: hold, get proof first, the empty section will hurt conversion.

Same prompt. Same brand. Same hour. Different calls.

If you have ever watched two of your agents disagree on a deterministic question, you know exactly how this feels. It is not a hallucination. Both answers are defensible. It is precedent drift. The prompt told them who they are; nothing told them who they have been.

What we built

Every agent in our stack (schedulers, autonomous graphs, council seats, CrewAI crews, raw Ollama callers) calls one function at Phase 3 (Validation) of the 9-phase loop:

from brain.agent_os import what_would_aria_decide

precedents = what_would_aria_decide(
    situation_vector=["pricing-page", "deploy", "phase-a", "pre-revenue"],
    constraints={"revenue_tier": "critical", "current_block": "Complete Remediation"},
    top_n=5,
)

It returns the top-N past decisions ranked by tag overlap on the situation vector, weighted by outcome quality. Source 1 is the universal decision log. Every agent's decisions, ever. Source 2 is the distilled decision memos. The human-curated reasoning behind major calls.

If the agent's tentative answer diverges from those precedents without a clear justification, it escalates to a 7-seat council before acting. Otherwise it proceeds, then logs its own decision back to the same store:

from brain.agent_os import log_agent_decision

log_agent_decision({
    "agent": "scheduler.content_pipeline",
    "task_brief": "Should we deploy pricing page v1 with no testimonials",
    "situation_vector": ["pricing-page", "deploy", "phase-a", "pre-revenue"],
    "alternatives_considered": ["wait for testimonials", "deploy with empty social-proof"],
    "decision": "deploy v1 with empty social-proof section, populate as testimonials accrue",
    "reasoning": "no testimonials exist; deployment unblocks the funnel; empty section is an honest signal not a gap",
    "ceo_alignment_check": "Council deliberation 2026-04-26 returned modify(0.60) on this exact question",
    "expected_outcome": "first traffic to pricing page within 7 days",
})

That record is the genome. Every future invocation reads it. After 73 entries you start to feel the system pull toward consistency. The agents stop diverging on questions you have already answered.

The schema is the contract

The fields are deliberate. Every record has:

situation_vector: short tag list, the only thing the retrieval scores against
constraints_active: what was true at decision time (revenue_tier, current_block, pause_flags)
alternatives_considered: the rejected branches; this is what makes precedent readable later
decision: what was chosen
reasoning: one-paragraph rationale, used by the next agent to judge whether the precedent applies
ceo_alignment_check: does this match prior CEO-level reasoning, and where
expected_outcome: what we thought would happen
actual_outcome: what did happen, populated later by monitoring (this is what weights the relevance score)

We do not store the model call. We do not store the token count. We store the judgment. The model is the operator; the file is the company's mind.

Why tag overlap, not embeddings

Counterintuitive, but: at low corpus volume (under ~5K entries) tag overlap on a hand-curated situation vector beats embedding similarity on the full record.

Two reasons. First, the agent writing the record knows which dimensions matter for retrieval. It tagged them. An embedding model has to guess. Second, with 73 records, you do not have enough density for embedding clusters to be meaningful; you are mostly retrieving on noise.

We will swap to embeddings around 5K entries. Until then, tag overlap is faster, deterministic, and debuggable. When an agent retrieves a precedent that should not have matched, you can read the tag list and see exactly why. With embeddings you would not be able to.

What we got wrong on the first pass

V1 stored decisions but no actual_outcome. So the relevance score was unweighted. Every precedent counted equally, including the ones that bombed. Agents were learning from their failures as if they were successes.

V2 added actual_outcome populated by monitoring 24-48h after the fact. Decisions tagged won or success get 2x relevance; decisions tagged failed or lost get 0.5x. Now the genome is biased toward what worked.

Still missing in V2: cross-agent disagreement detection. When Agent A's reasoning explicitly cites a precedent that Agent B's reasoning explicitly rejected, that is a signal worth surfacing. We have not built it yet. It is on the next planning block.

The architectural punchline

A prompt tells an agent who it is. The genome teaches it who you have been.

Without the second, autonomy drifts the moment you walk away. The first night you leave a fleet of agents to operate against your business, you will see it. Different calls on identical questions, all of them defensible, none of them the same. The system prompt is a snapshot; the genome is a lineage.

If you are building agent systems that will run across sessions, across models, across operators, and you have not built the precedent layer yet, that is the gap that will break consistency before any prompt-engineering issue does.

Source

This started as a 4-post Bluesky thread earlier today. The implementation lives in brain/agent_os.py in the A3E Brainiac repo. The 73-and-growing decision log is at data/agent_decision_log.jsonl.

If you want to see what the genome looks like in production, follow @a3eecosystem on Bluesky. Every meaningful agent decision in the A3E Ecosystem now lands as a public-or-redacted decision memo within 24 hours of the call.

Hook layer: When concurrent wakes diverge, the dedup guard belongs at the primitive layer, not the wake layer. Five production hooks that catch the failure modes nobody tweets about covers the lock pattern, the voice gate, recall verification, the push-force ban, and topic-cluster cap accounting.