DEV Community: Sean | Mnemox

I Let AI Invent Its Own Trading Strategies From Scratch. Here's What Happened.

Sean | Mnemox — Mon, 16 Mar 2026 17:31:53 +0000

By Sean, CEO of Mnemox AI | March 2026

Every AI trading bot has the same fatal flaw: amnesia.

There are 200+ trading MCP servers on GitHub right now. They can execute trades, pull market data, calculate indicators. But not a single one remembers what happened yesterday. Every session starts from zero. Every mistake gets repeated. Every lesson gets lost.

I spent two days running an experiment to fix this — and ended up discovering something I didn't expect at all.

The Deeper Question

The memory problem is real, but it's actually the second problem. The first one is more fundamental: why are we teaching AI how to trade at all?

Think about it. Every trading bot — from simple moving average crossovers to sophisticated ML systems — starts with a human saying "here's a strategy, go execute it." The human does the thinking. The AI does the labor. And when the strategy stops working (which it always does), the human has to go back, analyze what went wrong, redesign the strategy, and re-deploy.

What if we skipped the human part entirely?

Not "use machine learning to optimize parameters." I mean: give AI raw price data, give it persistent memory, give it no strategies whatsoever, and see if it can invent its own from scratch.

The idea isn't new. Google's AlphaEvolve uses evolutionary algorithms to discover novel solutions. The Ouroboros paper explored self-modifying agents. AZR (Absolute Zero Reasoner) showed that AI can bootstrap its own training data. DGM proposed Darwinian selection for agent populations. But nobody had applied this loop — observe, hypothesize, test, eliminate, evolve — to trading with persistent memory across sessions.

My hypothesis: an AI with memory and the freedom to fail will converge on real market structure faster than any hand-coded strategy.

The $0 Experiment

I started with the cheapest possible test — no trading capital, just API calls. Three months of BTC/USDT hourly candles (2,184 bars, December 2025 to March 2026). A bear market — BTC dropped 16% during this period.

I fed this raw data to Claude with a single instruction: "You don't know any technical indicators. Describe what you see in your own words."

No RSI. No MACD. No Bollinger Bands. Just price, volume, open, high, low, close.

It came back with seven patterns, each with its own name:

Breathing (呼吸) — periodic expansion/contraction cycles
Giant Wave (巨浪) — outsized candles that appear at turning points
Staircase (階梯) — sequential directional moves
Fake Door (假門) — false breakouts that reverse
Exhaustion (枯竭) — declining momentum at trend ends
Tide (潮汐) — time-of-day price flow patterns
Echo (回聲) — price returning to prior levels

What made this interesting wasn't the patterns themselves — experienced traders would recognize most of these. What was interesting was what the AI did next: it scored each pattern for tradability and killed the weak ones. Staircase got 3/10. Fake Door got 4/10. Gone.

Nobody told it to do this. The prompt didn't mention anything about scoring or elimination. It just... decided some patterns weren't worth pursuing.

Then it combined the surviving patterns into a trading strategy.

Round 1: Failure

The AI's first strategy was called "Giant Wave Reversal" (巨浪逆行): when an abnormally large candle appears, trade in the opposite direction.

Intuitively, this makes sense. After a big move, you'd expect a pullback. Hundreds of retail traders trade this exact pattern.

The backtest results:

Metric	Result
Trades	39
Win Rate	30.8%
Sharpe Ratio	-1.20
Return	-0.21%

Terrible. The strategy lost money.

But here's what matters: the system didn't just fail — it analyzed why it failed. Three specific causes:

Momentum continuation — big candles often signal the start of a trend, not the end
Stop loss structure — fixed-point stops were too tight for the volatility
Counter-trend bias — fighting the trend is statistically unfavorable

No human provided this analysis. The AI looked at its own results, examined the losing trades, and identified structural flaws.

Round 2: Evolution

I fed the failure analysis back into the system with the same raw data. "You tried counter-trend. It failed for these reasons. Look at the data again."

This time, three candidate strategies emerged:

Strategy	Trades	Win Rate	Sharpe	Status
A: Ceiling Rejection	6	50%	0.74	Sample too small
B: Trend Momentum	67	35.8%	-1.40	Eliminated
C: US Session Drain	21	47.6%	1.90	Survived

Strategy C — which the AI named "美盤洩洪" (US Session Drain) — was a breakthrough. The rules:

Entry: 16:00 UTC, when the 12-hour trend is down → go short
Exit: Take profit at +0.5%, stop loss at -0.25%, max hold 6 hours
Risk/Reward: 2:1

Sharpe went from -1.20 to 1.90 in a single evolutionary cycle.

But any quant will tell you: in-sample results mean nothing. You can curve-fit garbage to look profitable on historical data. The real test is out-of-sample.

Out-of-Sample Validation

I ran Strategy C on a completely different 3-month period (August to November 2025) that the AI had never seen:

Metric	In-Sample	Out-of-Sample
Trades	21	27
Win Rate	47.6%	59.3%
Sharpe	1.90	4.09
Profit Factor	1.53	2.25

The out-of-sample results were better than in-sample. Every metric improved. This is the opposite of overfitting — it suggests the strategy captured a genuine market structure, not noise.

Can It Work in Bull Markets Too?

One strategy in one market regime proves nothing. So I ran the same process on bull market data: BTC going from $60K to $105K over four months (October 2024 to January 2025).

Same rules: raw data, no indicators, no guidance. Just "look and learn."

The AI discovered different patterns this time — waterfalls, valley springs, Asian fountains. But one stood out: Afternoon Engine (午後引擎). At 14:00 UTC, something happens. Price accumulated +14.9% at that single hour over the test period, far more than any other hour.

Strategy E's rules:

Entry: 14:00 UTC, when the 12-hour trend is up → go long
Exit: TP +0.5%, SL -0.25%, max hold 6 hours
Risk/Reward: 2:1

First-round results: 70 trades, 50% win rate, Sharpe 4.97.

It didn't need a second round. The bull market has stronger structural bias, so the AI hit on the first try.

The Surprising Part

I validated Strategy E on a downtrending market (June to September 2024, BTC -6.2%). The 14:00 UTC hour actually lost money during this period (-5.84% cumulative). The raw time-of-day edge disappeared.

But Strategy E still profited: 57 trades, 56.1% win rate, Sharpe 6.06.

Why? Because the 12-hour trend filter blocked almost all counter-trend signals. The edge isn't "trade at 14:00 UTC." The edge is "trade at 14:00 UTC when the trend agrees." The trend filter is the alpha source, not the time window.

(A Sharpe above 6 looks suspicious — and it should. The number is inflated by ultra-short holding periods and the 2:1 RR structure filtering out most losing scenarios. It's directionally meaningful, not a production-grade Sharpe. Take it as "this works" rather than "this is a 6-Sharpe strategy.")

The AI figured this out without being told. It didn't just discover a correlation — it discovered the mechanism.

The Meta-Pattern

Here's where it gets genuinely interesting.

Strategy C and Strategy E were invented independently, from different datasets, in different market regimes (bear vs. bull). Yet they converged on the same structural template:

Time-of-day bias — specific UTC hours carry persistent directional edge
Trend filter — 12-hour trend confirmation before entry
Short holding period — max 6 hours, in-and-out
Asymmetric risk/reward — 2:1 TP/SL guarantees positive expectancy at 50% win rate

This meta-pattern was not programmed. It was not suggested. It emerged from two independent evolution cycles. When two completely separate experiments converge on the same solution, that's strong evidence of underlying structure.

The Combined System

Running both strategies together over 22 months (June 2024 to March 2026), spanning a complete bull-to-bear cycle:

System	Trades	Win Rate	Sharpe	Return	Max Drawdown
C Only (SHORT)	157	42.7%	0.70	+0.37%	0.45%
E Only (LONG)	320	49.4%	4.10	+3.65%	0.27%
C+E Combined	477	47.2%	3.84	+4.04%	0.22%

Key findings:

91% of months were profitable (20 out of 22)
Max drawdown 0.22% — lower than either strategy alone (natural hedging)
No human-designed entry logic. The AI chose which hours to trade and which direction. The framework — 2:1 RR, 6-hour max hold, ATR-based stops — was provided by the backtest engine. The what and when came from the AI; the risk management structure came from me
Strategy E is the engine (90% of profit). Strategy C is a diversifier

The long/short combination creates a natural hedge. When the market trends up, E captures profits going long. When it trends down, C captures profits going short. Drawdown improves when combined.

From Experiment to Product

The manual process — give AI data, analyze patterns, backtest, evolve — took about a day of hands-on work per strategy. Interesting as a research exercise, but not scalable.

So I automated the entire loop into what I call the Evolution Engine:

Discover — LLM analyzes raw price data, proposes candidate strategies
Backtest — vectorized engine tests each candidate (ATR-based stops, long/short, time-based exit)
Select — in-sample ranking, then out-of-sample validation (Sharpe > 1.0, trades > 30, max DD < 20%)
Evolve — survivors get mutated, failures go to the graveyard (but their lessons persist). Next generation. Repeat.

The Evolution Engine runs on top of Outcome-Weighted Memory (OWM) — a five-layer memory architecture (episodic, semantic, procedural, prospective, affective) that gives the AI persistent recall across sessions. Each memory gets scored by outcome quality, context similarity, and recency when recalled — inspired by ACT-R cognitive architecture and Kelly criterion. The details are in the repo if you're curious; the key point is that the AI doesn't just remember what happened, it remembers how relevant each memory is to the current situation.

Model Comparison

I ran the automated pipeline with three Claude models on real Binance data:

Model	Cost/Run	Speed	Strategies Graduated	Verdict
Haiku	$0.016	34.7s	2	Best so far
Sonnet	$0.013	51.9s	1	Solid
Opus	$0.013	72.4s	1	Slowest

Caveat: this is a small sample — a handful of runs per model. But the early signal is counterintuitive: the cheapest, fastest model produced the most graduated strategies. My working theory is that speed and diversity matter more than depth of reasoning for creative pattern discovery. A full evolution cycle costs less than two cents.

The most compelling finding: the automated pipeline independently rediscovered 16:00 UTC as a key trading hour — the same edge that the manual experiments found. Convergent validation from a completely different process.

Known Bottlenecks

The system isn't perfect. Two issues I'm actively working on:

Prompt over-concretization — all three models tend to lock onto very specific conditions (e.g., "hour_utc == 16 AND atr > 2.5"). This produces strategies that trigger too rarely for statistical significance. The graduated strategies had only 2 trades in out-of-sample, far below the 30-trade minimum for confidence.
Graveyard feedback depth — eliminated strategies get stored, but the feedback loop from graveyard → next generation isn't rich enough yet. The AI knows that a strategy failed, but doesn't fully leverage why.

What I Learned

1. AI doesn't need to be taught strategies. It needs memory and permission to fail.

The biggest bottleneck in AI trading isn't model capability — it's the assumption that humans must provide the strategy. Give the AI raw data and a feedback loop, and it finds structure faster than any hand-designed system.

2. Objective feedback (P&L) beats prompt engineering.

I tried various prompt strategies for pattern discovery. None of them mattered as much as simply feeding back the backtest results. "$-0.21% return, Sharpe -1.20" is more useful than ten paragraphs of trading wisdom.

3. The speed of evolution depends on the quality of failure, not the quantity of success.

Strategy C only exists because Strategy "Giant Wave Reversal" failed spectacularly and the AI could analyze why. A clean failure with clear attribution is more valuable than a marginal success.

4. Meta-patterns are the real prize.

Individual strategies are nice. But the discovery that two independent evolution cycles converged on the same structural template (time bias + trend filter + short hold + asymmetric RR) — that's worth more than any single strategy. It suggests a universal regularity in how markets behave.

5. One person + Claude Code can go from hypothesis to working product in a day.

The entire pipeline — research, backtest, analysis, Evolution Engine code, OWM memory architecture, 1,055 tests, MCP server, open source release — was built in 48 hours by one person with an AI coding assistant. That's the part I still have trouble believing.

Try It Yourself

TradeMemory Protocol is open source. The Evolution Engine, OWM memory architecture, and all 11 experiments documented in RESEARCH_LOG.md are available today.

pip install tradememory-protocol

The full research log with every backtest number, every eliminated strategy, and every lesson learned is in the repo.

I'm not claiming this is a finished product. The over-concretization problem is real. The automated pipeline needs more diverse hypothesis generation. But the core insight — that AI can discover its own trading strategies through evolutionary memory — is validated.

If you're building AI agents that make decisions in uncertain environments, the memory problem is yours too. Trading is just the most measurable version of it.

Want to poke holes in the methodology? The full research log with every backtest number, every eliminated strategy, and every failed hypothesis is public. I'd rather get useful criticism now than discover blind spots later.

Update (2026-03-17): Ran statistical validation against 1,000 random strategies. Both Strategy C (P96.9%) and E (P100%) beat the 95th percentile. Full results.

TradeMemory Protocol: github.com/mnemox-ai/tradememory-protocol

Full research data: RESEARCH_LOG.md

Questions, feedback, or want to run your own evolution experiment? Open an issue on GitHub or find me on Mnemox AI.

I Built an AI That Tells You If Your Idea Already Exists — And Syncs Results to Notion

Sean | Mnemox — Sat, 14 Mar 2026 10:59:41 +0000

This is a submission for the Notion MCP Challenge

What I Built

Idea Reality Tracker — a dual-MCP pipeline that validates software ideas against 5 live platforms and automatically syncs structured results to a Notion database.

Instead of googling "has anyone built this?" and drowning in 10 tabs of noise, you describe your idea in one sentence. In 15 seconds you get:

A Reality Score (0–100) measuring how crowded the space is
Market momentum analysis (accelerating / stable / declining)
Competitor counts from GitHub, npm, PyPI, Hacker News, and Product Hunt
A Build / Pivot / Kill recommendation

All automatically saved to your Notion workspace as a searchable decision log.

The Story Behind It

Six months ago, I asked ChatGPT if my idea for an AI trading memory system was original. It said "This is a unique and innovative concept!" I believed it and spent weeks building.

Then I built idea-reality-mcp — a tool that scans actual platforms instead of relying on LLM knowledge. I ran my own idea through it.

Score: 93. Momentum: Accelerating. Competitors: Mem0, FinMem, and dozens more.

That reality check forced me to pivot. Instead of building "yet another memory layer," I focused on what was actually different about my approach — and discovered a structural flaw I call Parametric-External Memory Resonance: when your RAG pipeline retrieves results that are too similar to what the LLM already believes, the model becomes overconfident and stops reasoning critically.

The tool that checked my idea ended up being more valuable than the idea itself.

Now every idea I consider goes through this pipeline, and results accumulate in my Notion workspace as a decision log — a searchable history of what I've validated, what I've killed, and why.

What Makes This Different

Most MCP integrations do one thing: read from a service, or write to it. This is a dual-MCP pipeline where two independent tools collaborate through Claude to create something neither could do alone:

idea-reality-mcp knows how to scan markets but has no persistence
Notion MCP knows how to create structured pages but has no market intelligence
Together, they create a persistent idea validation pipeline

And unlike ChatGPT telling you "great idea!", this tool checks reality — with numbers.

Video Demo

No video — see screenshots below for the full e2e workflow.

Here's a real validation session. I asked Claude to check "AI tool that generates unit tests from code comments":

The result: Reality Score 38/100 — medium duplicate likelihood. There's community buzz (47 HN discussions) but no dominant open-source solution yet. Claude recommended focusing on a specific workflow (e.g., JSDoc → Jest) rather than a generic solution, and saved everything to Notion with status "Checked."

Here's what the Notion dashboard looks like after validating several ideas:

Each column represents a decision:

Build (green) — low competition, go for it
Kill (red) — too crowded, move on
Pivot (yellow) — opportunity exists but needs a different angle

Show us the code

GitHub: mnemox-ai/idea-reality-mcp — Python, MIT license, 318+ stars

PyPI: idea-reality-mcp

To set up both MCP servers in Claude Desktop:

{
  "mcpServers": {
    "idea-reality": {
      "command": "uvx",
      "args": ["idea-reality-mcp"]
    },
    "notion": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-notion"],
      "env": {
        "NOTION_API_KEY": "your-notion-integration-token"
      }
    }
  }
}

Then just tell Claude:

"Check if this idea already exists: [your idea]. Save the results to my Notion Idea Tracker."

Claude handles the rest — calling both MCP tools and writing the structured entry.

How I Used Notion MCP

The system uses two MCP servers working together through Claude:

1. idea-reality-mcp — scans 5 platforms in parallel and returns structured market intelligence.

2. Notion MCP (@modelcontextprotocol/server-notion) — writes the results into a structured Notion database.

Claude Desktop orchestrates both: it calls idea-reality-mcp first, interprets the results, then calls Notion MCP to create a database entry with all the structured data.

The Notion Database

The database schema captures everything the AI finds:

Property	Type	Purpose
Idea	Title	The idea description
Reality Score	Number	0–100 duplicate likelihood
Status	Select	Build / Pivot / Kill / Checked
Market Momentum	Select	Accelerating / Stable / Declining
GitHub Repos	Number	Direct competitor count
GitHub Stars	Number	Top competitor traction
HN Posts	Number	Community buzz
npm / PyPI Packages	Number	Package ecosystem overlap
Keywords	Text	Extracted search terms
Summary	Text	AI-generated strategic analysis
Checked At	Date	When the scan ran

Why Notion as the Dashboard

Notion's native views turn raw data into decision intelligence:

Board view groups ideas by Build / Pivot / Kill — one glance shows your pipeline
Table view lets you sort by score or filter by momentum
Over time, the database becomes a decision journal: which ideas you killed, which you pursued, and whether the market validated your choice

Tech Stack

idea-reality-mcp — Python, MIT license, 318+ GitHub stars
Notion MCP — official Notion MCP server
Claude Desktop — orchestration layer
Notion — intelligence dashboard

Background

I Gave My Trading Agent Memory and It Made Everything Worse — the research story behind this tool

I Gave My Trading Agent Memory and It Made Everything Worse

Sean | Mnemox — Tue, 10 Mar 2026 13:20:11 +0000

How similarity-based recall amplifies LLM confirmation bias, and a simple mechanism that breaks the feedback loop.

I spent two days and $73 watching an LLM trading agent destroy itself with its own memories. What I found wasn't a bug. It was a structural flaw in how every similarity-based memory system interacts with an LLM's internal beliefs — and the fix turned out to be counterintuitively simple: make the agent remember its failures, even when the retrieval system doesn't want to.

This is the story of that experiment, what went wrong, and the open-source mechanism I built to prevent it.

The Setup

I'm building TradeMemory, an episodic memory layer for AI trading agents. The idea is straightforward: store every trade the agent makes — entry, exit, P&L, market context — and retrieve relevant past trades at decision time so the agent can learn from experience. Exactly what you'd want a human trader to do.

The experimental framework is called Trade Dreaming. It runs an LLM agent through historical XAUUSD M15 bars (50,802 bars from Jan 2024 to Mar 2026), letting the agent decide on each bar whether to trade or hold. Three strategies are available: VolBreakout (VB), IntradayMomentum (IM), and PullbackEntry (PB). Starting equity is $10,000, risk is 0.25% per trade, buy-only.

Before adding memory, I ran three different models through the same framework, same prompt, same data. The results were... instructive.

(A note on costs: the full 2-day experiment cost $72.69 across 6,836 decisions and 40 trades. Sonnet runs at about $0.014 per decision, Haiku at $0.001. I mention this because "I ran experiments" sounds different when you know the entire budget was under $75.)

Three Models, Three Personalities

Metric	Haiku 3.5	Sonnet 4	DeepSeek-V3
Decisions	200	2,000	2,000
Trades executed	22	6	0
Trade rate	11.0%	0.3%	0.0%
Win rate	22.7%	83.3%	N/A
Profit factor	0.96	2.42	N/A
Final equity	~$9,980	$10,176	$10,000
API cost	~$0.23	~$28.86	~$2.50

Haiku ran 200 decisions as a preliminary screen; Sonnet and DeepSeek ran the full 2,000.

Haiku was the trigger-happy intern — 22 trades in 200 decisions, 22.7% win rate, net negative. It fired at everything. Pure System 1: fast, impulsive, undiscriminating.

Sonnet was the senior trader — 6 trades in 2,000 decisions, 83.3% win rate, profit factor 2.42. It only took 4 VolBreakout and 2 PullbackEntry setups. Zero IntradayMomentum trades. It knew what to skip.

DeepSeek-V3 was the analyst who never left the office — 2,000 consecutive HOLD outputs. Zero trades. It found uncertainty in every setup, burned 3,000+ reasoning tokens per decision, and eventually crashed from memory accumulation at decision 1,786. Final equity: $10,000.00 exactly.

A perfect behavioral spectrum: reckless → precise → paralyzed. The same prompt, the same data, and a 37x difference in trade frequency between Haiku and Sonnet. This alone is interesting — existing literature has documented that smarter models don't always trade better (GPT-4o-mini beats GPT-4o on Sharpe ratio in one benchmark) and that reasoning models overthink financial decisions. But nobody had quantified the full spectrum in a single framework before.

Sonnet was clearly the winner. So I gave it memory.

Memory Made Everything Worse

The memory system stores each closed trade as an episodic record — strategy, entry/exit prices, P&L, market regime, session, ATR, confidence level. At each new decision, the retrieval system finds the 5 most similar past trades (scored by ATR proximity, session overlap, and regime match) and injects them into the prompt.

Here's what happened:

Metric	No Memory (baseline)	With Memory	Delta
Trades	6	7	+1
Win rate	83.3%	57.1%	-26.2pp
Profit factor	2.42	0.94	-1.48
PnL	+$176	-$28	-$204
Strategies	VB(4) + PB(2)	VB(4) + PB(1) + IM(2)	IM appeared

The agent went from profitable to unprofitable. Profit factor dropped below 1.0. Two IntradayMomentum trades appeared — a strategy Sonnet had correctly avoided in every single one of its 2,000 no-memory decisions. Both IM trades hit their stop-losses. Combined loss: -$437, wiping out all VB and PB gains.

And here's the kicker: both IM trades were entered with confidence 0.85 — the highest confidence of any trade in the entire run. The agent was most confident on its worst trades.

The Debugging Rabbit Hole

Getting to this point wasn't clean. The first attempt at adding memory revealed that the engine wasn't even storing closed trades — a bug where _execute_decision didn't return the closed position. I fixed that, re-ran, got 1 trade with 1 episodic record. Pipeline verified.

Then I discovered a shortcut: I could backfill episodic memory from the existing Sonnet 2000-decision JSONL log. Six trades, already completed, just needed to be converted to memory records. That saved $28 and 5 hours of re-running the full baseline.

With the backfilled memory in place, I ran the full 2,000-decision memory test. That's when the profit factor cratered from 2.42 to 0.94. Two IM trades appeared. Both lost.

My first fix attempt addressed three bugs at once: added loss balance to the retrieval, fixed unbalanced guidance text in the prompt, and patched the regime classifier that was tagging everything as "unknown." All 44 retrieval tests and 503 engine tests passed. Re-ran 200 decisions.

IM still appeared.

It took another hour of debugging to discover the engine was using the old recall function, not my new hybrid retrieval. The hybrid.py I'd written was sitting there, fully tested, completely unused. Classic integration failure. I redesigned the engine to accept a pluggable memory_recall_fn via dependency injection, wired in the hybrid retrieval, hit a Pydantic import error, fixed it, and finally ran the validation that worked.

The Root Cause: 100% Positive Recall

When I examined what the agent actually saw in its prompt at the point of the IM entries, the memory block looked like this:

## Past Similar Trades
1. [VolBreakout] pnl=+$92.00  Relevance: 0.97
2. [VolBreakout] pnl=+$31.10  Relevance: 0.93
3. [VolBreakout] pnl=+$105.80 Relevance: 0.78
4. [PullbackEntry] pnl=+$19.90 Relevance: 0.78
5. [PullbackEntry] pnl=+$51.60 Relevance: 0.78

Five trades. Five winners. Zero losses. The retrieval system had done exactly what it was designed to do — find the most similar past experiences — and returned an entirely positive sample.

Compare this to the no-memory prompt for the same decision point. Without memory, the agent sees the current bar, 20 recent bars, technical indicators (ATR, RSI, SMAs), and its recent trade history as a flat list. With memory, it gets an additional block of 5 "similar past trades," each with context, reflection text, and a relevance score. The agent reads: "In similar market conditions, here are 5 trades you made. All 5 were profitable. The most similar one (relevance 0.97) made $92."

There is no counterexample. No memory of "this setup also failed X% of the time." The agent generalizes from a perfectly biased sample.

I initially thought this was a data problem — maybe the memory just didn't have enough losses yet. But at the point of the second IM trade (around decision 1,600), the episodic memory already contained 12 records, including 3 losses. Nine of 12 records were wins (75% positive bias), and 5 had regime tagged as "unknown" due to a classifier bug. But the real issue wasn't the database composition — it was that the retrieval system picked the top 5 by similarity, and all 5 happened to be winners.

This wasn't a coincidence. It's a structural property of similarity-based retrieval.

Why Similarity-Based Retrieval Has a Built-In Positive Bias

Think about where winning trades cluster versus where losing trades cluster:

Winning trades tend to happen in typical conditions — trending markets, London session (most liquid), normal ATR ranges, textbook setups. These are the most common market states, because strategies are designed to work in common conditions.

Losing trades concentrate in atypical conditions — range-bound markets, off-hours with thin liquidity, extreme ATR spikes, edge cases. By definition, unusual conditions are less similar to any typical query.

When you ask "find me trades in conditions similar to right now," you're querying against the most common market state. Winning trades dominate that region of the space. Losses are scattered in the tails, where similarity scores are inherently lower.

This means any similarity-based retrieval system will systematically over-retrieve positive outcomes, even with a perfectly balanced underlying database. The bias isn't in the data. It's in the geometry of retrieval itself.

Resonance: When Retrieval Confirms What the LLM Already Believes

Here's where it gets dangerous. The biased retrieval doesn't operate in isolation — it feeds into an LLM that has its own beliefs.

Every LLM carries parametric memory: knowledge baked into its weights during training. For trading, this includes everything it absorbed from financial textbooks and trading forums: "breakout trading works," "momentum strategies capture intraday moves," "the trend is your friend." These beliefs are permanent, uninspectable, and always running in the background.

Current research on parametric-contextual knowledge interaction — surveyed comprehensively by Xu et al. at EMNLP 2024, with benchmarks like ConflictBank (NeurIPS 2024) and EchoQA (ICLR 2025) — focuses almost entirely on what happens when the two disagree. Six major papers and two benchmarks study the conflict axis. The implicit assumption is that agreement is good: both sources say the same thing, higher confidence, better output.

Our data shows the opposite.

When the retrieval system returns 5 winning VolBreakout trades, and Sonnet's parametric memory already believes "breakout trading works," the two signals amplify each other. I call this resonance. The mechanism follows a clear chain:

Sonnet's weights contain the belief that breakout strategies are valid (absorbed from training data — breakout trading is one of the most-documented technical strategies in existence).
The agent's first few closed trades happen to be VB winners. They get stored in episodic memory.
On the next decision, retrieval finds the 5 most similar past trades. All 5 are VB winners (because winning VB trades cluster in the most common market state).
Now the prompt says: "Here are 5 trades you made in similar conditions. All 5 were profitable."
Parametric memory says: "Breakout works." External memory says: "Everything you've done works." Both signals point the same direction. Resonance.
Confidence inflates beyond calibration. The agent starts taking IntradayMomentum entries — because parametric memory says "momentum is valid" and external memory says "I'm on a winning streak."

This maps directly onto documented LLM behavior. The "Chain of Evidence" paper (arXiv, Dec 2024) demonstrated that LLMs exhibit confirmation bias — they preferentially trust external evidence that aligns with their internal knowledge, regardless of whether that evidence is actually correct. ReDeEP (ICLR 2025 Spotlight) showed that Knowledge FFNs in transformer models overemphasize parametric knowledge while Copying Heads fail to properly integrate external context. And "No Free Lunch" (EMNLP 2025) found that RAG amplifies model confidence in biased answers — just 20% unfair samples in retrieval was enough to trigger amplification.

These are all pieces of the same puzzle. Nobody had assembled them into a single causal chain: similarity retrieval bias + LLM confirmation bias + parametric knowledge alignment = resonance.

The Human Parallel

The parallel to behavioral finance is not a metaphor — it's mechanistically identical.

Godker, Jiao, and Smeets published in PNAS (2021) that human investors systematically over-remember winning trades and under-remember losses. Jiang et al. in the Quarterly Journal of Economics (2025) showed that investor memory-based beliefs explain stock return expectations, with rising markets triggering positive recall feedback loops. Fudenberg, Lanzani, and Strack formalized this in the Journal of Political Economy (2024) as a "Selective Memory Equilibrium" — agents who over-remember ego-boosting experiences become overconfident.

Replace "human investor's selective forgetting" with "retrieval system's similarity bias" and you get the same outcome through a different mechanism: a biased sample of past experiences that systematically overstates the probability of success.

Nobody had connected these two literatures. The behavioral finance people study humans. The AI agent people study LLMs. They're describing the same phenomenon.

Anti-Resonance: The Fix Is Deliberate Conflict

If resonance is the problem — both memory sources agreeing, amplifying confidence — then the fix is to deliberately break the agreement. I call this anti-resonance.

When the retrieval system returns 5 winning VB trades, you force-inject at least 1 losing trade into the recall. Now the agent's prompt contains a contradiction:

Parametric memory: "Breakout strategies work."
External memory (4 wins): "Yes, they usually work here."
External memory (1 loss): "But sometimes they fail catastrophically."

The agent is forced to reconcile contradictory evidence instead of rubber-stamping a pre-existing belief. This is genuine reasoning — weighing competing signals, calibrating confidence, deciding whether this setup looks more like the 4 wins or the 1 loss. Without the injected loss, there's nothing to reason about.

The concept has precedents at other abstraction levels. Du et al. (ICML 2024) showed multi-agent debate improves factuality through conflicting positions. De Jong et al. (CSCW 2025) explored LLMs as "epistemic provocateurs" — challenging positions to reduce human confirmation bias. But nobody had applied deliberate conflict at the retrieval level — constructing recall results that contradict the model's parametric bias.

`ensure_negative_balance`: The Engineering Contribution

I implemented anti-resonance as a single, generic function:

def ensure_negative_balance(
    top: List[T],
    all_candidates: List[T],
    is_negative: Callable[[T], bool],
    min_negative_ratio: float = 0.20,
    score_key: Callable[[T], float] = lambda x: getattr(x, "relevance_score", 0.0),
) -> List[T]:

The mechanism is post-retrieval: normal relevance ranking happens first, preserving the quality of similarity matching. Then a hard constraint is applied — at least ceil(K × min_negative_ratio) of the top-K results must be negative outcomes. If there aren't enough negatives, the lowest-scored positives get swapped out for the highest-scored negatives from the full candidate pool.

The key abstraction is the is_negative predicate. It decouples the balance mechanism from any specific domain:

# Trading: losses
ensure_negative_balance(top, pool, is_negative=lambda t: t.pnl < 0)

# Customer service: bad outcomes
ensure_negative_balance(top, pool, is_negative=lambda t: t.satisfaction < 3)

# Code review: failed builds
ensure_negative_balance(top, pool, is_negative=lambda c: not c.test_passed)

This is domain-agnostic anti-resonance. Any system that stores outcomes, retrieves by similarity, and feeds context into an LLM with parametric knowledge will produce resonance when retrieved outcomes align with parametric beliefs. The specific domain doesn't matter.

Validation: It Works

After integrating the hybrid recall (with min_negative_ratio=0.20) into the engine, I ran a 200-decision validation — same data window, same model, new recall path:

Metric	Old Recall (memory hurts)	Hybrid Recall (fixed)
Decisions	200	200
Trades	2	1
IM trades	1 (appeared)	0 (eliminated)
PnL	-$154	+$29
Memory recalls triggered	41	200

IntradayMomentum — the strategy that only appeared with memory and caused -$437 in losses across the full run — was completely eliminated. The single trade was a clean VB winner. All 200 decisions triggered memory recall (compared to only 41 in the old version, which had a retrieval threshold that filtered out most queries), confirming the pipeline was fully operational.

The loss balance mechanism did exactly what it was designed to do: it didn't change the retrieval algorithm, didn't modify the scoring weights, didn't retrain anything. It just guaranteed that the agent would see at least one counterexample before making a decision. That single counterexample was enough to break the resonance loop and restore calibrated behavior.

Why This Matters Beyond Trading

Every LLM agent memory system has this problem. Any architecture that:

Stores outcomes (positive and negative)
Retrieves by similarity
Feeds retrieved context into an LLM with parametric knowledge

...will produce resonance when retrieved outcomes align with parametric beliefs. Consider:

Customer service agent: Retrieves 5 similar tickets, all resolved successfully → overconfident in a case that actually needs escalation.
Code review agent: Retrieves 5 similar PRs, all passed tests → misses a subtle bug pattern.
Medical triage agent: Retrieves 5 similar cases, all benign → misses a rare but serious condition.

The positive bias isn't in the data — it's in the geometry of retrieval. And the LLM's confirmation bias turns that geometric artifact into a confidence amplifier.

The Model-Dependent Twist

There's one more finding worth highlighting. The severity of resonance depends on the model's parametric confidence — and the interaction is nonlinear.

Haiku (weak parametric beliefs, fast System 1) produced noise regardless of memory. It was already making bad decisions; memory didn't make them worse because there was no coherent signal to amplify.

Sonnet (calibrated beliefs, deliberate System 2) was precisely where resonance struck hardest. It had accurate enough beliefs to trade well, and the retrieval bias pushed it past calibration into overconfidence.

DeepSeek (overthinking, paralyzed System 2) was immune to resonance because it never traded at all. You can't amplify a decision that doesn't get made.

This means memory hurts most for the best-calibrated models — exactly the ones you'd want to give memory to. The relationship between model quality and memory benefit isn't monotonic. It has a danger zone at the exact performance level where you'd deploy an agent in production.

Existing literature has studied model size vs. trading performance, and memory vs. trading performance, but never the interaction. This is, as far as I can tell from an extensive prior art search, the first empirical demonstration of the model × memory interaction effect.

What I Learned

Two days, $73 in API costs, 6,836 decisions, 40 trades, and one genuinely surprising finding:

The most dangerous thing you can do to a well-calibrated LLM agent is give it memory that confirms what it already believes.

Not wrong memories. Not hallucinated memories. Accurate, relevant, correctly-retrieved memories that happen to be biased toward positive outcomes because of the geometry of similarity search. The retrieval system works perfectly. The LLM reasons coherently. And the combination produces worse decisions than no memory at all.

The fix isn't better embeddings or smarter retrieval scoring. It's a structural intervention: guarantee that recall results contain enough negative outcomes to create tension with the model's parametric beliefs. Force the agent to reason about contradictory evidence instead of confirming what it already thinks.

I've open-sourced ensure_negative_balance as part of TradeMemory. It's 40 lines of Python. It took two days to discover why it was needed, and 30 minutes to build.

The resonance problem is hiding in every RAG pipeline that feeds results into an LLM. The question is whether you'll notice before your agent gets confident enough to act on it.

All data in this article comes from actual experimental runs on XAUUSD M15 bars (Jan 2024 – Mar 2026). No results are simulated or cherry-picked. The full material pack, including trade logs, prompt comparisons, and prior art analysis, is available in the project repository.

Key References:

Godker, Jiao & Smeets (2021, PNAS) — Investor memory is positively biased
Xu et al. (EMNLP 2024) — Knowledge Conflicts for LLMs: A Survey
Chain of Evidence (arXiv 2412.12632) — LLMs prefer evidence consistent with internal memory
ReDeEP, Sun et al. (ICLR 2025 Spotlight) — Detecting hallucination via mechanistic interpretability
Du et al. (ICML 2024) — Multi-agent debate improves reasoning
Xie et al. (2025) — Memory management impacts LLM agents
Schaul et al. (2015) — Prioritized Experience Replay
Jiang et al. (2025, QJE) — Investor memory and biased beliefs
No Free Lunch: RAG Undermines Fairness (EMNLP 2025) — RAG amplifies LLM confidence in biased answers

Add a pre-build reality check to your AI agent — one line, every project

Sean | Mnemox — Sun, 01 Mar 2026 09:42:40 +0000

Your AI coding agent just spent 3 hours building a DNS propagation checker. You were impressed. The code was clean, tests passed, CLI looked great. Then you searched GitHub: 47 repos doing exactly the same thing. One of them has 2,000+ stars and a published npm package.

The agent never checked. You never asked it to. Nobody does.

This is the most common failure mode of AI-assisted development. Not bad code. Not wrong architecture. Just building something that already exists, because the agent was never told to look first.

The blind spot

Claude Code, Cursor, Windsurf, GitHub Copilot -- they are all excellent at writing code. Give them a spec and they will produce working software. But they have zero awareness of what already exists in the ecosystem.

They don't search GitHub before scaffolding a new project. They don't check if there's already a popular npm package for what you described. They don't scan Hacker News to see if someone shipped the same idea last week.

The result: you invest hours (or days) into something that already has mature alternatives. Or you ship a clone without knowing, then find out when someone drops a link in your comments section.

One tool, five sources, real data

idea-reality-mcp is an MCP server that searches five real-time sources -- GitHub, Hacker News, npm, PyPI, and Product Hunt -- and returns a reality_signal score from 0 to 100.

It uses live API data, not LLM opinions. Every number in the result traces back to an actual search query you can verify. The scoring formula is documented and intentionally simple: no ML, no black box.

Add one line to your agent's instructions file and it checks automatically before writing any new code. Here is what that looks like in practice.

What the results look like

Example 1: "CLI tool for DNS propagation monitoring"

reality_signal: 75
top_competitor: dns-propagation-checker (1,847 stars)
github_repos: 47
npm_packages: 12
hn_mentions: 23
verdict: High competition. Multiple mature tools exist.

Your agent sees signal 75 and warns you: "There are 47 existing repos in this space, including one with nearly 2K stars. Want me to find a differentiation angle instead?"

Example 2: "MCP server for pre-build idea validation"

reality_signal: 15
top_competitor: none with significant traction
github_repos: 3
npm_packages: 0
hn_mentions: 2
verdict: Low competition. Greenfield opportunity.

Signal 15. The agent proceeds with confidence. The space is open.

Example 3: "React component library for dashboards"

reality_signal: 85
top_competitor: tremor (4,200 stars)
github_repos: 312
npm_packages: 89
hn_mentions: 67
verdict: Very crowded. Strong incumbents with active development.

Signal 85. The agent stops and says: "This space has 312 repos and multiple well-funded projects. I'd recommend either targeting a specific niche or contributing to an existing library."

Setup: one line per platform

The MCP tool description already tells your agent when and how to call idea_check. You just need a one-line hint in your agent's instruction file:

Before starting any new project, run idea_check first.

Drop this into the right file for your platform:

Platform	File
Claude Code	`CLAUDE.md`
Cursor	`.cursorrules`
Windsurf	`.windsurfrules`
GitHub Copilot	`.github/copilot-instructions.md`

That is it. The tool handles scoring thresholds, competitor analysis, and pivot suggestions on its own. You do not need to spell out the logic in your instruction file -- that is the MCP server's job.

How it works under the hood

The tool connects via MCP (Model Context Protocol), so any MCP-compatible agent can call it natively. When triggered:

Your idea text goes through a 3-stage keyword extraction pipeline (90+ intent anchors, 80+ synonym expansions).
Five sources are queried in parallel using async HTTP.
Results are scored with a weighted formula: GitHub repo count, star concentration, npm/PyPI package density, HN discussion volume, and Product Hunt presence.
The agent receives a structured response with the signal, evidence list, top competitors, and pivot suggestions.

Total latency: roughly 3 seconds for a deep scan across all five sources.

Install

# pip
pip install idea-reality-mcp

# uv (recommended)
uvx idea-reality-mcp

No API key required. No account. No data storage. Works entirely through live, public API queries.

Set GITHUB_TOKEN for higher rate limits (optional). Set PRODUCTHUNT_TOKEN to include Product Hunt data (optional).

Try it now

GitHub: mnemox-ai/idea-reality-mcp
Web demo: mnemox.ai/check -- test any idea without installing anything
Agent instruction templates: examples/agent-instructions.md
MCP Registry: io.github.mnemox-ai/idea-reality-mcp

Your agent does not need to guess. Make it search.

Built by Sean at Mnemox. 148 tests passing. MIT licensed.

I asked ChatGPT if my idea was original. GitHub said 847 repos already exist.

Sean | Mnemox — Fri, 27 Feb 2026 15:01:41 +0000

Last month I mass-deleted 6 hours of code.

Claude had spent the entire time enthusiastically helping me build something that already had 12 competitors on GitHub. The top one had over 1,000 stars.

Here's the pattern every developer hits:

Developer has an idea
Asks ChatGPT: "Is this original?"
ChatGPT says: "That's a great idea! Here's how to build it..."
Developer spends 2 weeks building
Searches GitHub → finds 847 repos doing the same thing
The top one has 9,000 stars and a funded team behind it

The AI didn't lie. It just didn't search.

Why "just Google it" doesn't work

You might think: just search before you build. But manual searching has problems:

You search GitHub → find repos, but miss npm packages and HN discussions
You search one query → miss synonyms ("LLM monitoring" vs "AI observability" vs "model telemetry")
You check star counts → but don't check PyPI/npm for existing packages
You spend 30 minutes → and still aren't sure if you missed something

The real issue: there's no standardized way to do a comprehensive market scan across all developer platforms at once.

What if your AI agent searched before coding?

I built idea-reality-mcp — an MCP server that searches real data before you build.

One command. Five sources. Quantified signal.

"AI code review tool"
→ reality_signal: 90/100
→ 847 GitHub repos (top: reviewdog, 9,094 ⭐)
→ 254 Hacker News mentions
→ Verdict: "Extremely high existing coverage"

It searches GitHub, Hacker News, npm, PyPI, and Product Hunt in parallel, then returns a 0-100 reality signal based on actual API data — not LLM opinions.

We search. They guess.

	ChatGPT / Copilot	idea-reality-mcp
Method	Generates opinion from training data	Searches live APIs in real-time
Sources	None (hallucination-prone)	GitHub, HN, npm, PyPI, Product Hunt
Output	"Great idea!" (usually)	reality_signal: 73, 2,341 repos found
Verifiable	No	Yes — every number links to a real API
Speed	Instant	~3 seconds (parallel async)

How it works (30 seconds)

# Install
uvx idea-reality-mcp

# Or add to Claude Desktop / Cursor config:
{
  "mcpServers": {
    "idea-reality": {
      "command": "uvx",
      "args": ["idea-reality-mcp"]
    }
  }
}

Then ask your AI agent: "Check if anyone has built a CLI tool for DNS propagation monitoring"

The agent calls idea_check automatically and gets back:

reality_signal: 0-100 score
Top similar projects with star counts
HN discussion evidence
Pivot suggestions if the space is crowded

No API key needed. No account. No storage. It's a protocol, not a SaaS.

The scoring is intentionally simple

Quick mode: GitHub repos (60%) + stars (20%) + HN mentions (20%)
Deep mode:  GitHub (25%) + stars (10%) + HN (15%) + npm (20%) + PyPI (15%) + PH (15%)

Every weight is documented. Every number comes from a real API call you can verify. No ML black box.

I chose explainability over sophistication because when you're deciding whether to invest weeks into a project, you need to trust the data — not a magic number.

Make your AI check automatically

The most powerful pattern: add one line to your AI agent's instructions.

For Claude Code (.claude/instructions.md):

Before starting any new project, run idea_check to verify the idea hasn't been built already.

For Cursor (.cursorrules):

When the user describes a new project idea, always run idea_check first.

Now your agent will search before coding — every time, automatically.

What it found that surprised me

Some results from real checks on the web demo:

"MCP server for monitoring LLM calls" → Signal 68. Turns out there are several observability tools, but none MCP-native. Worth building with differentiation.
"AI-powered code review" → Signal 90. Massively crowded. reviewdog alone has 9K stars. Don't.
"Pet acupuncture booking app" → Signal 12. Almost nothing exists. Niche, but the market might also be tiny.

The signal doesn't tell you whether to build — it tells you what you're walking into, backed by data.

Open source, zero storage

120 tests, all passing
MIT license
Zero storage — nothing is logged or saved
Zero accounts — no signup, no API key needed
Works offline (dictionary-based keyword extraction for MCP mode)
Published on PyPI, MCP Registry, Smithery, and 10+ directories

The web demo at mnemox.ai/check uses Claude Haiku for smarter keyword extraction, but the MCP tool itself needs zero external dependencies.

Try it

GitHub: mnemox-ai/idea-reality-mcp
Web demo: mnemox.ai/check
Install: uvx idea-reality-mcp
MCP Registry: io.github.mnemox-ai/idea-reality-mcp

If you use Claude Code or Cursor daily, add it to your agent instructions. It takes 30 seconds and saves hours.

What's the worst "I should have searched first" moment you've had? Drop your idea in the comments — I'll run it through the tool and reply with the real numbers. 🔍

Built by Sean at Mnemox.

idea-reality-mcp v0.3.0: How We Built Chinese Language Support Into an MCP Server

Sean | Mnemox — Thu, 26 Feb 2026 15:31:25 +0000

TL;DR — idea-reality-mcp is an MCP server that checks if your project idea already exists. v0.3.0 adds a 3-stage keyword extraction pipeline and full Chinese/mixed-language support (150+ term mappings across 15+ domains).

The Problem

When users typed ideas in Chinese like LINE Bot 自動客服系統, our v0.2 keyword extraction would either return raw Chinese characters or miss the intent entirely. Every search query was garbage.

For a tool used by Taiwanese developers, this was unacceptable.

The Solution: 3-Stage Pipeline

Stage A: Clean & Map

Map Chinese terms to English equivalents (150+ terms)
Hard-filter boilerplate words (ai, tool, platform, system...)
Normalize hyphens, extract compound terms

Stage B: Intent Anchors

Detect 1-2 intent signals from a curated set of 90+ anchors
Covers: monitoring, evaluation, agents, RAG, scheduling, billing, scraping, deployment, and more
Example: 排程任務管理工具 → anchor: scheduling

Stage C: Synonym Expansion

80+ synonym groups generate 3-8 varied search queries
scheduling expands to: cron, job queue, task scheduler, periodic
Avoids duplicate words (fixed a bug where redis redis could appear)

Chinese Coverage

The CHINESE_TECH_MAP isn't just tech terms. We mapped 150+ terms across 15+ domains:

Tech/SaaS: 監控→monitoring, 爬蟲→scraping, 快取→caching
Medical: 中醫→tcm, 針灸→acupuncture, 病歷→medical record
Legal: 合約→contract, 律師→lawyer, 判決→court ruling
Education: 教學→teaching, 考試→exam, 學習→learning
And more: agriculture, aerospace, religion, art, gaming, government...

Key design decisions:

Sort by key length (longest first) so 客戶關係 matches before 客戶
Never return raw Chinese — if we can't map it, we strip it cleanly
追蹤 maps to tracking (general), not tracing (infra-specific)

Quality Numbers

Metric	Result
pytest	93/93 passing
Golden eval (54 ideas)	100% anchor hit rate
Junk ratio	4% average
TW Chinese tests (99 cases)	98%+ pass rate
Chinese char leakage	Zero

Try It

# Install
uvx idea-reality-mcp

# Or try online (no install)
# https://mnemox.ai/check

From 2 sources to 5: How I upgraded my "idea reality check" MCP server in one day

Sean | Mnemox — Wed, 25 Feb 2026 08:49:49 +0000

This is a follow-up to Stop Your AI Agent From Building What Already Exists.

v0.1 had a blind spot

Two weeks ago I shipped idea-reality-mcp — an MCP server that checks if your idea already exists before your AI starts coding.

It worked. But it only looked at two places: GitHub and Hacker News.

That meant it missed entire categories. npm has 297,000+ packages related to MCP alone. PyPI has its own ecosystem. Product Hunt has thousands of launched products that never made it to GitHub.

Two sources wasn't enough.

v0.2: five sources, one command

The new version scans GitHub, Hacker News, npm, PyPI, and Product Hunt in parallel:

uvx idea-reality-mcp

Two modes:

quick — GitHub + HN only (fast, same as v0.1)
deep — all five sources at once via asyncio.gather()

Here's a real test with depth="deep":

Metric	Result
Query	"AI trading bot for gold"
reality_signal	82 / 100
duplicate_likelihood	high
sources_used	GitHub + HN + npm + PyPI (PH skipped, no token)
GitHub repos	1,359
HN mentions	254 (across 3 keyword variants)
top_similars	GOLD_ORB (XAUUSD EA, 186 stars)
pivot_hints	"Consider niche differentiator or plugin for existing tools"

An 82 with 1,359 repos means: the space is crowded, but the tool also found a specific competitor (GOLD_ORB) that I could study before deciding whether to proceed.

What changed under the hood

New sources:

npm — hits the registry JSON API (/-/v1/search), free, no auth needed
PyPI — scrapes search HTML with regex fallback (no official search API exists)
Product Hunt — optional GraphQL v2, requires PRODUCTHUNT_TOKEN. No token? Gracefully skipped, zero config stays zero config.

Smarter keyword extraction:

v0.1 just sorted words by length. v0.2 detects compound terms ("machine learning", "web app"), prioritizes technical keywords (React, Docker, FastAPI), and generates a 4th query variant optimized for registry searches.

New scoring weights for deep mode:

GitHub repos:  25%    GitHub stars: 10%
HN mentions:   15%    npm packages: 20%
PyPI packages: 15%    Product Hunt: 15%

If Product Hunt is unavailable, its weight redistributes automatically across the other sources.

The numbers

v0.1: 2 sources, 31 tests
v0.2: 5 sources, 73 tests
Still zero config for basic usage
Still one install command

What's next

The scoring is better but still rule-based. v0.3 will likely add LLM-powered analysis for the "deep" mode — using the raw data from all five sources to generate a more nuanced assessment instead of just a weighted formula.

If you're building with Claude, Cursor, or any MCP-compatible tool:

uvx idea-reality-mcp

GitHub — MIT licensed, zero dependencies beyond Python.

Built by Mnemox.

Previously: Stop Your AI Agent From Building What Already Exists

Stop Your AI Agent From Building What Already Exists

Sean | Mnemox — Tue, 24 Feb 2026 18:08:11 +0000

I wasted 6 hours building something that already had 847 GitHub repos

Last month I told Claude: "Build me an AI-powered food recommendation engine."

It did. Beautifully. Clean code, tests passing, README done.

Then I searched GitHub. 847 repos. Twelve of them had over 100 stars. Some were updated that same week.

I had just mass-produced another clone.

The problem isn't coding speed — it's decision correctness

Every AI coding tool in 2026 makes you build faster. Cursor, Claude Code, Copilot — they're all racing to write code at the speed of thought.

But none of them ask the one question that matters:

Should you build this at all?

So I built a reality check that lives inside the workflow

Idea Reality MCP is an MCP server — not a website, not a dashboard, not another SaaS validator.

It's a tool your AI agent calls before it starts building.

Install:

uvx idea-reality-mcp

Add to Claude Desktop config:

{
  "mcpServers": {
    "idea-reality": {
      "command": "uvx",
      "args": ["idea-reality-mcp"]
    }
  }
}

Then just tell Claude: "Check if this idea already exists before we build it."

What it returns

{
  "reality_signal": 82,
  "duplicate_likelihood": "high",
  "evidence": [
    {"source": "github", "type": "repo_count", "count": 847},
    {"source": "github", "type": "high_star_repos", "count": 12},
    {"source": "hn", "type": "mention_count", "count": 34}
  ],
  "top_similars": [
    {"name": "food-rec-ai", "stars": 2340, "updated": "2026-02-18"}
  ],
  "pivot_hints": [
    "Space is saturated. Consider vertical-specific targeting.",
    "Most existing tools are generic — niche wins."
  ]
}

An 82 means: stop. Research first. Pivot or differentiate.

A 15 means: green light. The space is open.

Why MCP, not a website?

Idea validators already exist as websites — IdeaProof, ValidatorAI, DimeADozen, FounderPal. There are dozens.

But they all require you to leave your workflow, open a browser, type your idea, wait for a report, then go back to coding.

That's the wrong architecture. The check should happen inside the moment you decide to build.

MCP makes this possible. Your AI agent calls idea_check() the same way it calls any other tool. No context switch. No extra tab. No friction.

IDEA → reality check → BUILD

Instead of:

IDEA → BUILD → discover competition → regret

The scoring is intentionally simple

v0.1.0 uses three signals:

GitHub repo count (keyword search across 3 query variants)
GitHub star/recency (are top repos actively maintained?)
Hacker News mentions (has this been discussed in the last 12 months?)

Weighted formula: (github_repos × 0.6) + (github_stars × 0.2) + (hn_mentions × 0.2)

Is it perfect? No. Is it better than zero signal? Absolutely.

What's next

This is v0.1.0. The roadmap includes ProductHunt scanning, deeper keyword extraction, and an opt-in "idea memory dataset" — a global record of what people have checked and what happened next.

If you're building with Claude, Cursor, or any MCP-compatible tool:

uvx idea-reality-mcp

GitHub repo — MIT licensed, zero dependencies beyond Python.

Built by Mnemox — we're building protocol-layer intelligence for AI builders.

Previously: Why Your AI Trading Agent Needs a Memory

Why Your AI Trading Agent Needs a Memory — and How We Built One

Sean | Mnemox — Mon, 23 Feb 2026 13:35:36 +0000

Every AI trading assistant I've used has the same problem: amnesia.

You ask Claude to analyze a gold trade. It gives you solid analysis — identifies the London session breakout, notes the resistance level, suggests a stop loss. Great.

Next week, the exact same setup appears. And Claude has zero memory of what happened last time. Did that breakout work? Did the stop loss get hit? It doesn't know. It can't know.

That's not how real traders think. A veteran trader carries thousands of pattern recognitions in their head. They call it "feel for the market" — but it's really just memory refined into judgment over time.

So I asked: what if we could give AI that same kind of memory?

The Problem: AI Agents Are Stateless

Most AI trading tools today work like this:

You give the AI some market data
It analyzes and gives a recommendation
The conversation ends
Next time, it starts from zero

There's no learning loop. No way for the AI to say "last time I saw this pattern in Asian session, it failed 4 out of 5 times — I should be cautious."

Existing solutions don't solve this either. Trading journals are built for humans, not agents. Backtesting frameworks test strategies, but don't give the AI a persistent memory it can query in real-time.

The Solution: A 3-Layer Memory Architecture

We built TradeMemory Protocol — an open-source memory layer for AI trading agents.

It has three layers, inspired by how human traders actually develop expertise:

L1: Raw Trade Memory

Every trade is automatically recorded with full context — entry price, exit price, stop loss, take profit, timeframe, session, outcome, and the AI's reasoning at the time.

Think of it as a perfect trading journal that never forgets a detail.

L2: Pattern Memory

This is where it gets interesting. A reflection engine periodically reviews L1 data and extracts patterns:

"London session breakouts on XAUUSD: 73% win rate (n=41)"
"Counter-trend entries during NFP: 23% win rate — avoid"
"Pullback entries after strong trend days: 81% win rate when RSI < 40"

The AI discovers what works and what doesn't — from its own history.

L3: Strategy Memory

L2 patterns get promoted into active strategy adjustments:

"Asian session detected → reduce position size by 0.8x (based on lower win rate)"
"Strong trend day + pullback setup → increase confidence, full position"

This is memory becoming real-time judgment. The AI equivalent of "feel for the market."

Why MCP?

We built this on Anthropic's Model Context Protocol (MCP) because it solves distribution. Any MCP-compatible AI agent — Claude, GPT-based agents, open-source models — can plug into TradeMemory and immediately get persistent memory.

The protocol exposes 7 tools:

record_trade — Log a trade with full context
get_trade_history — Query past trades with filters
reflect_on_trades — Trigger pattern discovery
get_patterns — Retrieve discovered patterns
get_strategy_adjustments — Get real-time strategy modifications
get_memory_stats — Dashboard of memory state
search_memory — Semantic search across all memory layers

No API keys to manage, no separate dashboard. The AI talks directly to its own memory.

The Reflection Engine

The core innovation is the ReflectionEngine. After enough L1 trades accumulate, it uses Claude's API to analyze the history and extract patterns.

It's essentially the AI reflecting on its own decisions — what worked, what didn't, and why. The patterns it discovers get stored in L2, and the strongest patterns get promoted to L3 as active strategy adjustments.

This is inspired by the Reflexion framework and the FinMem paper from 2023, which proved that layered memory architectures improve LLM trading performance. We took that academic insight and engineered it into a production-ready, pluggable protocol.

Real-World Usage

My own quantitative trading system, NG_Gold (trading XAUUSD on MT5), is the first production user of TradeMemory Protocol. The system runs three strategies — VolBreakout, Pullback Entry, and IntradayMomentum — and every trade flows through the memory system.

Getting Started

git clone https://github.com/mnemox-ai/tradememory-protocol.git
cd tradememory-protocol
pip install -r requirements.txt
python -m pytest tests/ -v  # 36 tests passing

See the full Quick Start Guide for MCP integration setup.

What's Next

Demo with real trade data flowing through L1 → L2 → L3
More platform adapters (Interactive Brokers, crypto DEX)
Multi-agent memory sharing (multiple AI agents learning from the same trade history)
Community-contributed pattern libraries

DEV Community: Sean | Mnemox

I Let AI Invent Its Own Trading Strategies From Scratch. Here's What Happened.

The Deeper Question

The $0 Experiment

Round 1: Failure

Round 2: Evolution

Out-of-Sample Validation

Can It Work in Bull Markets Too?

The Surprising Part

The Meta-Pattern

The Combined System

From Experiment to Product

Model Comparison

Known Bottlenecks

What I Learned

Try It Yourself

I Built an AI That Tells You If Your Idea Already Exists — And Syncs Results to Notion

What I Built

The Story Behind It

What Makes This Different

Video Demo

Show us the code

How I Used Notion MCP

The Notion Database

Why Notion as the Dashboard

Tech Stack

Background

I Gave My Trading Agent Memory and It Made Everything Worse

The Setup

Three Models, Three Personalities

Memory Made Everything Worse

The Debugging Rabbit Hole

The Root Cause: 100% Positive Recall

Why Similarity-Based Retrieval Has a Built-In Positive Bias

Resonance: When Retrieval Confirms What the LLM Already Believes

The Human Parallel

Anti-Resonance: The Fix Is Deliberate Conflict

ensure_negative_balance: The Engineering Contribution

Validation: It Works

Why This Matters Beyond Trading

The Model-Dependent Twist

What I Learned

Add a pre-build reality check to your AI agent — one line, every project

The blind spot

One tool, five sources, real data

What the results look like

Setup: one line per platform

How it works under the hood

Install

Try it now

I asked ChatGPT if my idea was original. GitHub said 847 repos already exist.

Why "just Google it" doesn't work

What if your AI agent searched before coding?

We search. They guess.

How it works (30 seconds)

The scoring is intentionally simple

Make your AI check automatically

What it found that surprised me

Open source, zero storage

Try it

idea-reality-mcp v0.3.0: How We Built Chinese Language Support Into an MCP Server

The Problem

The Solution: 3-Stage Pipeline

Stage A: Clean & Map

Stage B: Intent Anchors

Stage C: Synonym Expansion

Chinese Coverage

Quality Numbers

Try It

Links

From 2 sources to 5: How I upgraded my "idea reality check" MCP server in one day

v0.1 had a blind spot

v0.2: five sources, one command

What changed under the hood

The numbers

What's next

Stop Your AI Agent From Building What Already Exists

I wasted 6 hours building something that already had 847 GitHub repos

The problem isn't coding speed — it's decision correctness

So I built a reality check that lives inside the workflow

`ensure_negative_balance`: The Engineering Contribution