Sean | Mnemox

Posted on Mar 10

I Gave My Trading Agent Memory and It Made Everything Worse

#ai #llm #machinelearning #rag

How similarity-based recall amplifies LLM confirmation bias, and a simple mechanism that breaks the feedback loop.

I spent two days and $73 watching an LLM trading agent destroy itself with its own memories. What I found wasn't a bug. It was a structural flaw in how every similarity-based memory system interacts with an LLM's internal beliefs — and the fix turned out to be counterintuitively simple: make the agent remember its failures, even when the retrieval system doesn't want to.

This is the story of that experiment, what went wrong, and the open-source mechanism I built to prevent it.

The Setup

I'm building TradeMemory, an episodic memory layer for AI trading agents. The idea is straightforward: store every trade the agent makes — entry, exit, P&L, market context — and retrieve relevant past trades at decision time so the agent can learn from experience. Exactly what you'd want a human trader to do.

The experimental framework is called Trade Dreaming. It runs an LLM agent through historical XAUUSD M15 bars (50,802 bars from Jan 2024 to Mar 2026), letting the agent decide on each bar whether to trade or hold. Three strategies are available: VolBreakout (VB), IntradayMomentum (IM), and PullbackEntry (PB). Starting equity is $10,000, risk is 0.25% per trade, buy-only.

Before adding memory, I ran three different models through the same framework, same prompt, same data. The results were... instructive.

(A note on costs: the full 2-day experiment cost $72.69 across 6,836 decisions and 40 trades. Sonnet runs at about $0.014 per decision, Haiku at $0.001. I mention this because "I ran experiments" sounds different when you know the entire budget was under $75.)

Three Models, Three Personalities

Metric	Haiku 3.5	Sonnet 4	DeepSeek-V3
Decisions	200	2,000	2,000
Trades executed	22	6	0
Trade rate	11.0%	0.3%	0.0%
Win rate	22.7%	83.3%	N/A
Profit factor	0.96	2.42	N/A
Final equity	~$9,980	$10,176	$10,000
API cost	~$0.23	~$28.86	~$2.50

Haiku ran 200 decisions as a preliminary screen; Sonnet and DeepSeek ran the full 2,000.

Haiku was the trigger-happy intern — 22 trades in 200 decisions, 22.7% win rate, net negative. It fired at everything. Pure System 1: fast, impulsive, undiscriminating.

Sonnet was the senior trader — 6 trades in 2,000 decisions, 83.3% win rate, profit factor 2.42. It only took 4 VolBreakout and 2 PullbackEntry setups. Zero IntradayMomentum trades. It knew what to skip.

DeepSeek-V3 was the analyst who never left the office — 2,000 consecutive HOLD outputs. Zero trades. It found uncertainty in every setup, burned 3,000+ reasoning tokens per decision, and eventually crashed from memory accumulation at decision 1,786. Final equity: $10,000.00 exactly.

A perfect behavioral spectrum: reckless → precise → paralyzed. The same prompt, the same data, and a 37x difference in trade frequency between Haiku and Sonnet. This alone is interesting — existing literature has documented that smarter models don't always trade better (GPT-4o-mini beats GPT-4o on Sharpe ratio in one benchmark) and that reasoning models overthink financial decisions. But nobody had quantified the full spectrum in a single framework before.

Sonnet was clearly the winner. So I gave it memory.

Memory Made Everything Worse

The memory system stores each closed trade as an episodic record — strategy, entry/exit prices, P&L, market regime, session, ATR, confidence level. At each new decision, the retrieval system finds the 5 most similar past trades (scored by ATR proximity, session overlap, and regime match) and injects them into the prompt.

Here's what happened:

Metric	No Memory (baseline)	With Memory	Delta
Trades	6	7	+1
Win rate	83.3%	57.1%	-26.2pp
Profit factor	2.42	0.94	-1.48
PnL	+$176	-$28	-$204
Strategies	VB(4) + PB(2)	VB(4) + PB(1) + IM(2)	IM appeared

The agent went from profitable to unprofitable. Profit factor dropped below 1.0. Two IntradayMomentum trades appeared — a strategy Sonnet had correctly avoided in every single one of its 2,000 no-memory decisions. Both IM trades hit their stop-losses. Combined loss: -$437, wiping out all VB and PB gains.

And here's the kicker: both IM trades were entered with confidence 0.85 — the highest confidence of any trade in the entire run. The agent was most confident on its worst trades.

The Debugging Rabbit Hole

Getting to this point wasn't clean. The first attempt at adding memory revealed that the engine wasn't even storing closed trades — a bug where _execute_decision didn't return the closed position. I fixed that, re-ran, got 1 trade with 1 episodic record. Pipeline verified.

Then I discovered a shortcut: I could backfill episodic memory from the existing Sonnet 2000-decision JSONL log. Six trades, already completed, just needed to be converted to memory records. That saved $28 and 5 hours of re-running the full baseline.

With the backfilled memory in place, I ran the full 2,000-decision memory test. That's when the profit factor cratered from 2.42 to 0.94. Two IM trades appeared. Both lost.

My first fix attempt addressed three bugs at once: added loss balance to the retrieval, fixed unbalanced guidance text in the prompt, and patched the regime classifier that was tagging everything as "unknown." All 44 retrieval tests and 503 engine tests passed. Re-ran 200 decisions.

IM still appeared.

It took another hour of debugging to discover the engine was using the old recall function, not my new hybrid retrieval. The hybrid.py I'd written was sitting there, fully tested, completely unused. Classic integration failure. I redesigned the engine to accept a pluggable memory_recall_fn via dependency injection, wired in the hybrid retrieval, hit a Pydantic import error, fixed it, and finally ran the validation that worked.

The Root Cause: 100% Positive Recall

When I examined what the agent actually saw in its prompt at the point of the IM entries, the memory block looked like this:

## Past Similar Trades
1. [VolBreakout] pnl=+$92.00  Relevance: 0.97
2. [VolBreakout] pnl=+$31.10  Relevance: 0.93
3. [VolBreakout] pnl=+$105.80 Relevance: 0.78
4. [PullbackEntry] pnl=+$19.90 Relevance: 0.78
5. [PullbackEntry] pnl=+$51.60 Relevance: 0.78

Five trades. Five winners. Zero losses. The retrieval system had done exactly what it was designed to do — find the most similar past experiences — and returned an entirely positive sample.

Compare this to the no-memory prompt for the same decision point. Without memory, the agent sees the current bar, 20 recent bars, technical indicators (ATR, RSI, SMAs), and its recent trade history as a flat list. With memory, it gets an additional block of 5 "similar past trades," each with context, reflection text, and a relevance score. The agent reads: "In similar market conditions, here are 5 trades you made. All 5 were profitable. The most similar one (relevance 0.97) made $92."

There is no counterexample. No memory of "this setup also failed X% of the time." The agent generalizes from a perfectly biased sample.

I initially thought this was a data problem — maybe the memory just didn't have enough losses yet. But at the point of the second IM trade (around decision 1,600), the episodic memory already contained 12 records, including 3 losses. Nine of 12 records were wins (75% positive bias), and 5 had regime tagged as "unknown" due to a classifier bug. But the real issue wasn't the database composition — it was that the retrieval system picked the top 5 by similarity, and all 5 happened to be winners.

This wasn't a coincidence. It's a structural property of similarity-based retrieval.

Why Similarity-Based Retrieval Has a Built-In Positive Bias

Think about where winning trades cluster versus where losing trades cluster:

Winning trades tend to happen in typical conditions — trending markets, London session (most liquid), normal ATR ranges, textbook setups. These are the most common market states, because strategies are designed to work in common conditions.

Losing trades concentrate in atypical conditions — range-bound markets, off-hours with thin liquidity, extreme ATR spikes, edge cases. By definition, unusual conditions are less similar to any typical query.

When you ask "find me trades in conditions similar to right now," you're querying against the most common market state. Winning trades dominate that region of the space. Losses are scattered in the tails, where similarity scores are inherently lower.

This means any similarity-based retrieval system will systematically over-retrieve positive outcomes, even with a perfectly balanced underlying database. The bias isn't in the data. It's in the geometry of retrieval itself.

Resonance: When Retrieval Confirms What the LLM Already Believes

Here's where it gets dangerous. The biased retrieval doesn't operate in isolation — it feeds into an LLM that has its own beliefs.

Every LLM carries parametric memory: knowledge baked into its weights during training. For trading, this includes everything it absorbed from financial textbooks and trading forums: "breakout trading works," "momentum strategies capture intraday moves," "the trend is your friend." These beliefs are permanent, uninspectable, and always running in the background.

Current research on parametric-contextual knowledge interaction — surveyed comprehensively by Xu et al. at EMNLP 2024, with benchmarks like ConflictBank (NeurIPS 2024) and EchoQA (ICLR 2025) — focuses almost entirely on what happens when the two disagree. Six major papers and two benchmarks study the conflict axis. The implicit assumption is that agreement is good: both sources say the same thing, higher confidence, better output.

Our data shows the opposite.

When the retrieval system returns 5 winning VolBreakout trades, and Sonnet's parametric memory already believes "breakout trading works," the two signals amplify each other. I call this resonance. The mechanism follows a clear chain:

Sonnet's weights contain the belief that breakout strategies are valid (absorbed from training data — breakout trading is one of the most-documented technical strategies in existence).
The agent's first few closed trades happen to be VB winners. They get stored in episodic memory.
On the next decision, retrieval finds the 5 most similar past trades. All 5 are VB winners (because winning VB trades cluster in the most common market state).
Now the prompt says: "Here are 5 trades you made in similar conditions. All 5 were profitable."
Parametric memory says: "Breakout works." External memory says: "Everything you've done works." Both signals point the same direction. Resonance.
Confidence inflates beyond calibration. The agent starts taking IntradayMomentum entries — because parametric memory says "momentum is valid" and external memory says "I'm on a winning streak."

This maps directly onto documented LLM behavior. The "Chain of Evidence" paper (arXiv, Dec 2024) demonstrated that LLMs exhibit confirmation bias — they preferentially trust external evidence that aligns with their internal knowledge, regardless of whether that evidence is actually correct. ReDeEP (ICLR 2025 Spotlight) showed that Knowledge FFNs in transformer models overemphasize parametric knowledge while Copying Heads fail to properly integrate external context. And "No Free Lunch" (EMNLP 2025) found that RAG amplifies model confidence in biased answers — just 20% unfair samples in retrieval was enough to trigger amplification.

These are all pieces of the same puzzle. Nobody had assembled them into a single causal chain: similarity retrieval bias + LLM confirmation bias + parametric knowledge alignment = resonance.

The Human Parallel

The parallel to behavioral finance is not a metaphor — it's mechanistically identical.

Godker, Jiao, and Smeets published in PNAS (2021) that human investors systematically over-remember winning trades and under-remember losses. Jiang et al. in the Quarterly Journal of Economics (2025) showed that investor memory-based beliefs explain stock return expectations, with rising markets triggering positive recall feedback loops. Fudenberg, Lanzani, and Strack formalized this in the Journal of Political Economy (2024) as a "Selective Memory Equilibrium" — agents who over-remember ego-boosting experiences become overconfident.

Replace "human investor's selective forgetting" with "retrieval system's similarity bias" and you get the same outcome through a different mechanism: a biased sample of past experiences that systematically overstates the probability of success.

Nobody had connected these two literatures. The behavioral finance people study humans. The AI agent people study LLMs. They're describing the same phenomenon.

Anti-Resonance: The Fix Is Deliberate Conflict

If resonance is the problem — both memory sources agreeing, amplifying confidence — then the fix is to deliberately break the agreement. I call this anti-resonance.

When the retrieval system returns 5 winning VB trades, you force-inject at least 1 losing trade into the recall. Now the agent's prompt contains a contradiction:

Parametric memory: "Breakout strategies work."
External memory (4 wins): "Yes, they usually work here."
External memory (1 loss): "But sometimes they fail catastrophically."

The agent is forced to reconcile contradictory evidence instead of rubber-stamping a pre-existing belief. This is genuine reasoning — weighing competing signals, calibrating confidence, deciding whether this setup looks more like the 4 wins or the 1 loss. Without the injected loss, there's nothing to reason about.

The concept has precedents at other abstraction levels. Du et al. (ICML 2024) showed multi-agent debate improves factuality through conflicting positions. De Jong et al. (CSCW 2025) explored LLMs as "epistemic provocateurs" — challenging positions to reduce human confirmation bias. But nobody had applied deliberate conflict at the retrieval level — constructing recall results that contradict the model's parametric bias.

`ensure_negative_balance`: The Engineering Contribution

I implemented anti-resonance as a single, generic function:

def ensure_negative_balance(
    top: List[T],
    all_candidates: List[T],
    is_negative: Callable[[T], bool],
    min_negative_ratio: float = 0.20,
    score_key: Callable[[T], float] = lambda x: getattr(x, "relevance_score", 0.0),
) -> List[T]:

The mechanism is post-retrieval: normal relevance ranking happens first, preserving the quality of similarity matching. Then a hard constraint is applied — at least ceil(K × min_negative_ratio) of the top-K results must be negative outcomes. If there aren't enough negatives, the lowest-scored positives get swapped out for the highest-scored negatives from the full candidate pool.

The key abstraction is the is_negative predicate. It decouples the balance mechanism from any specific domain:

# Trading: losses
ensure_negative_balance(top, pool, is_negative=lambda t: t.pnl < 0)

# Customer service: bad outcomes
ensure_negative_balance(top, pool, is_negative=lambda t: t.satisfaction < 3)

# Code review: failed builds
ensure_negative_balance(top, pool, is_negative=lambda c: not c.test_passed)

This is domain-agnostic anti-resonance. Any system that stores outcomes, retrieves by similarity, and feeds context into an LLM with parametric knowledge will produce resonance when retrieved outcomes align with parametric beliefs. The specific domain doesn't matter.

Validation: It Works

After integrating the hybrid recall (with min_negative_ratio=0.20) into the engine, I ran a 200-decision validation — same data window, same model, new recall path:

Metric	Old Recall (memory hurts)	Hybrid Recall (fixed)
Decisions	200	200
Trades	2	1
IM trades	1 (appeared)	0 (eliminated)
PnL	-$154	+$29
Memory recalls triggered	41	200

IntradayMomentum — the strategy that only appeared with memory and caused -$437 in losses across the full run — was completely eliminated. The single trade was a clean VB winner. All 200 decisions triggered memory recall (compared to only 41 in the old version, which had a retrieval threshold that filtered out most queries), confirming the pipeline was fully operational.

The loss balance mechanism did exactly what it was designed to do: it didn't change the retrieval algorithm, didn't modify the scoring weights, didn't retrain anything. It just guaranteed that the agent would see at least one counterexample before making a decision. That single counterexample was enough to break the resonance loop and restore calibrated behavior.

Why This Matters Beyond Trading

Every LLM agent memory system has this problem. Any architecture that:

Stores outcomes (positive and negative)
Retrieves by similarity
Feeds retrieved context into an LLM with parametric knowledge

...will produce resonance when retrieved outcomes align with parametric beliefs. Consider:

Customer service agent: Retrieves 5 similar tickets, all resolved successfully → overconfident in a case that actually needs escalation.
Code review agent: Retrieves 5 similar PRs, all passed tests → misses a subtle bug pattern.
Medical triage agent: Retrieves 5 similar cases, all benign → misses a rare but serious condition.

The positive bias isn't in the data — it's in the geometry of retrieval. And the LLM's confirmation bias turns that geometric artifact into a confidence amplifier.

The Model-Dependent Twist

There's one more finding worth highlighting. The severity of resonance depends on the model's parametric confidence — and the interaction is nonlinear.

Haiku (weak parametric beliefs, fast System 1) produced noise regardless of memory. It was already making bad decisions; memory didn't make them worse because there was no coherent signal to amplify.

Sonnet (calibrated beliefs, deliberate System 2) was precisely where resonance struck hardest. It had accurate enough beliefs to trade well, and the retrieval bias pushed it past calibration into overconfidence.

DeepSeek (overthinking, paralyzed System 2) was immune to resonance because it never traded at all. You can't amplify a decision that doesn't get made.

This means memory hurts most for the best-calibrated models — exactly the ones you'd want to give memory to. The relationship between model quality and memory benefit isn't monotonic. It has a danger zone at the exact performance level where you'd deploy an agent in production.

Existing literature has studied model size vs. trading performance, and memory vs. trading performance, but never the interaction. This is, as far as I can tell from an extensive prior art search, the first empirical demonstration of the model × memory interaction effect.

What I Learned

Two days, $73 in API costs, 6,836 decisions, 40 trades, and one genuinely surprising finding:

The most dangerous thing you can do to a well-calibrated LLM agent is give it memory that confirms what it already believes.

Not wrong memories. Not hallucinated memories. Accurate, relevant, correctly-retrieved memories that happen to be biased toward positive outcomes because of the geometry of similarity search. The retrieval system works perfectly. The LLM reasons coherently. And the combination produces worse decisions than no memory at all.

The fix isn't better embeddings or smarter retrieval scoring. It's a structural intervention: guarantee that recall results contain enough negative outcomes to create tension with the model's parametric beliefs. Force the agent to reason about contradictory evidence instead of confirming what it already thinks.

I've open-sourced ensure_negative_balance as part of TradeMemory. It's 40 lines of Python. It took two days to discover why it was needed, and 30 minutes to build.

The resonance problem is hiding in every RAG pipeline that feeds results into an LLM. The question is whether you'll notice before your agent gets confident enough to act on it.

All data in this article comes from actual experimental runs on XAUUSD M15 bars (Jan 2024 – Mar 2026). No results are simulated or cherry-picked. The full material pack, including trade logs, prompt comparisons, and prior art analysis, is available in the project repository.

Key References:

Godker, Jiao & Smeets (2021, PNAS) — Investor memory is positively biased
Xu et al. (EMNLP 2024) — Knowledge Conflicts for LLMs: A Survey
Chain of Evidence (arXiv 2412.12632) — LLMs prefer evidence consistent with internal memory
ReDeEP, Sun et al. (ICLR 2025 Spotlight) — Detecting hallucination via mechanistic interpretability
Du et al. (ICML 2024) — Multi-agent debate improves reasoning
Xie et al. (2025) — Memory management impacts LLM agents
Schaul et al. (2015) — Prioritized Experience Replay
Jiang et al. (2025, QJE) — Investor memory and biased beliefs
No Free Lunch: RAG Undermines Fairness (EMNLP 2025) — RAG amplifies LLM confidence in biased answers

DEV Community

I Gave My Trading Agent Memory and It Made Everything Worse

The Setup

Three Models, Three Personalities

Memory Made Everything Worse

The Debugging Rabbit Hole

The Root Cause: 100% Positive Recall

Why Similarity-Based Retrieval Has a Built-In Positive Bias

Resonance: When Retrieval Confirms What the LLM Already Believes

The Human Parallel

Anti-Resonance: The Fix Is Deliberate Conflict

`ensure_negative_balance`: The Engineering Contribution

Validation: It Works

Why This Matters Beyond Trading

The Model-Dependent Twist

What I Learned

Top comments (0)

The Setup

Three Models, Three Personalities

Memory Made Everything Worse

The Debugging Rabbit Hole

The Root Cause: 100% Positive Recall

Why Similarity-Based Retrieval Has a Built-In Positive Bias

Resonance: When Retrieval Confirms What the LLM Already Believes

The Human Parallel

Anti-Resonance: The Fix Is Deliberate Conflict

ensure_negative_balance: The Engineering Contribution

Validation: It Works

Why This Matters Beyond Trading

The Model-Dependent Twist

What I Learned

`ensure_negative_balance`: The Engineering Contribution