Sean | Mnemox

Posted on Mar 16

I Let AI Invent Its Own Trading Strategies From Scratch. Here's What Happened.

#ai #machinelearning #opensource #python

By Sean, CEO of Mnemox AI | March 2026

Every AI trading bot has the same fatal flaw: amnesia.

There are 200+ trading MCP servers on GitHub right now. They can execute trades, pull market data, calculate indicators. But not a single one remembers what happened yesterday. Every session starts from zero. Every mistake gets repeated. Every lesson gets lost.

I spent two days running an experiment to fix this — and ended up discovering something I didn't expect at all.

The Deeper Question

The memory problem is real, but it's actually the second problem. The first one is more fundamental: why are we teaching AI how to trade at all?

Think about it. Every trading bot — from simple moving average crossovers to sophisticated ML systems — starts with a human saying "here's a strategy, go execute it." The human does the thinking. The AI does the labor. And when the strategy stops working (which it always does), the human has to go back, analyze what went wrong, redesign the strategy, and re-deploy.

What if we skipped the human part entirely?

Not "use machine learning to optimize parameters." I mean: give AI raw price data, give it persistent memory, give it no strategies whatsoever, and see if it can invent its own from scratch.

The idea isn't new. Google's AlphaEvolve uses evolutionary algorithms to discover novel solutions. The Ouroboros paper explored self-modifying agents. AZR (Absolute Zero Reasoner) showed that AI can bootstrap its own training data. DGM proposed Darwinian selection for agent populations. But nobody had applied this loop — observe, hypothesize, test, eliminate, evolve — to trading with persistent memory across sessions.

My hypothesis: an AI with memory and the freedom to fail will converge on real market structure faster than any hand-coded strategy.

The $0 Experiment

I started with the cheapest possible test — no trading capital, just API calls. Three months of BTC/USDT hourly candles (2,184 bars, December 2025 to March 2026). A bear market — BTC dropped 16% during this period.

I fed this raw data to Claude with a single instruction: "You don't know any technical indicators. Describe what you see in your own words."

No RSI. No MACD. No Bollinger Bands. Just price, volume, open, high, low, close.

It came back with seven patterns, each with its own name:

Breathing (呼吸) — periodic expansion/contraction cycles
Giant Wave (巨浪) — outsized candles that appear at turning points
Staircase (階梯) — sequential directional moves
Fake Door (假門) — false breakouts that reverse
Exhaustion (枯竭) — declining momentum at trend ends
Tide (潮汐) — time-of-day price flow patterns
Echo (回聲) — price returning to prior levels

What made this interesting wasn't the patterns themselves — experienced traders would recognize most of these. What was interesting was what the AI did next: it scored each pattern for tradability and killed the weak ones. Staircase got 3/10. Fake Door got 4/10. Gone.

Nobody told it to do this. The prompt didn't mention anything about scoring or elimination. It just... decided some patterns weren't worth pursuing.

Then it combined the surviving patterns into a trading strategy.

Round 1: Failure

The AI's first strategy was called "Giant Wave Reversal" (巨浪逆行): when an abnormally large candle appears, trade in the opposite direction.

Intuitively, this makes sense. After a big move, you'd expect a pullback. Hundreds of retail traders trade this exact pattern.

The backtest results:

Metric	Result
Trades	39
Win Rate	30.8%
Sharpe Ratio	-1.20
Return	-0.21%

Terrible. The strategy lost money.

But here's what matters: the system didn't just fail — it analyzed why it failed. Three specific causes:

Momentum continuation — big candles often signal the start of a trend, not the end
Stop loss structure — fixed-point stops were too tight for the volatility
Counter-trend bias — fighting the trend is statistically unfavorable

No human provided this analysis. The AI looked at its own results, examined the losing trades, and identified structural flaws.

Round 2: Evolution

I fed the failure analysis back into the system with the same raw data. "You tried counter-trend. It failed for these reasons. Look at the data again."

This time, three candidate strategies emerged:

Strategy	Trades	Win Rate	Sharpe	Status
A: Ceiling Rejection	6	50%	0.74	Sample too small
B: Trend Momentum	67	35.8%	-1.40	Eliminated
C: US Session Drain	21	47.6%	1.90	Survived

Strategy C — which the AI named "美盤洩洪" (US Session Drain) — was a breakthrough. The rules:

Entry: 16:00 UTC, when the 12-hour trend is down → go short
Exit: Take profit at +0.5%, stop loss at -0.25%, max hold 6 hours
Risk/Reward: 2:1

Sharpe went from -1.20 to 1.90 in a single evolutionary cycle.

But any quant will tell you: in-sample results mean nothing. You can curve-fit garbage to look profitable on historical data. The real test is out-of-sample.

Out-of-Sample Validation

I ran Strategy C on a completely different 3-month period (August to November 2025) that the AI had never seen:

Metric	In-Sample	Out-of-Sample
Trades	21	27
Win Rate	47.6%	59.3%
Sharpe	1.90	4.09
Profit Factor	1.53	2.25

The out-of-sample results were better than in-sample. Every metric improved. This is the opposite of overfitting — it suggests the strategy captured a genuine market structure, not noise.

Can It Work in Bull Markets Too?

One strategy in one market regime proves nothing. So I ran the same process on bull market data: BTC going from $60K to $105K over four months (October 2024 to January 2025).

Same rules: raw data, no indicators, no guidance. Just "look and learn."

The AI discovered different patterns this time — waterfalls, valley springs, Asian fountains. But one stood out: Afternoon Engine (午後引擎). At 14:00 UTC, something happens. Price accumulated +14.9% at that single hour over the test period, far more than any other hour.

Strategy E's rules:

Entry: 14:00 UTC, when the 12-hour trend is up → go long
Exit: TP +0.5%, SL -0.25%, max hold 6 hours
Risk/Reward: 2:1

First-round results: 70 trades, 50% win rate, Sharpe 4.97.

It didn't need a second round. The bull market has stronger structural bias, so the AI hit on the first try.

The Surprising Part

I validated Strategy E on a downtrending market (June to September 2024, BTC -6.2%). The 14:00 UTC hour actually lost money during this period (-5.84% cumulative). The raw time-of-day edge disappeared.

But Strategy E still profited: 57 trades, 56.1% win rate, Sharpe 6.06.

Why? Because the 12-hour trend filter blocked almost all counter-trend signals. The edge isn't "trade at 14:00 UTC." The edge is "trade at 14:00 UTC when the trend agrees." The trend filter is the alpha source, not the time window.

(A Sharpe above 6 looks suspicious — and it should. The number is inflated by ultra-short holding periods and the 2:1 RR structure filtering out most losing scenarios. It's directionally meaningful, not a production-grade Sharpe. Take it as "this works" rather than "this is a 6-Sharpe strategy.")

The AI figured this out without being told. It didn't just discover a correlation — it discovered the mechanism.

The Meta-Pattern

Here's where it gets genuinely interesting.

Strategy C and Strategy E were invented independently, from different datasets, in different market regimes (bear vs. bull). Yet they converged on the same structural template:

Time-of-day bias — specific UTC hours carry persistent directional edge
Trend filter — 12-hour trend confirmation before entry
Short holding period — max 6 hours, in-and-out
Asymmetric risk/reward — 2:1 TP/SL guarantees positive expectancy at 50% win rate

This meta-pattern was not programmed. It was not suggested. It emerged from two independent evolution cycles. When two completely separate experiments converge on the same solution, that's strong evidence of underlying structure.

The Combined System

Running both strategies together over 22 months (June 2024 to March 2026), spanning a complete bull-to-bear cycle:

System	Trades	Win Rate	Sharpe	Return	Max Drawdown
C Only (SHORT)	157	42.7%	0.70	+0.37%	0.45%
E Only (LONG)	320	49.4%	4.10	+3.65%	0.27%
C+E Combined	477	47.2%	3.84	+4.04%	0.22%

Key findings:

91% of months were profitable (20 out of 22)
Max drawdown 0.22% — lower than either strategy alone (natural hedging)
No human-designed entry logic. The AI chose which hours to trade and which direction. The framework — 2:1 RR, 6-hour max hold, ATR-based stops — was provided by the backtest engine. The what and when came from the AI; the risk management structure came from me
Strategy E is the engine (90% of profit). Strategy C is a diversifier

The long/short combination creates a natural hedge. When the market trends up, E captures profits going long. When it trends down, C captures profits going short. Drawdown improves when combined.

From Experiment to Product

The manual process — give AI data, analyze patterns, backtest, evolve — took about a day of hands-on work per strategy. Interesting as a research exercise, but not scalable.

So I automated the entire loop into what I call the Evolution Engine:

Discover — LLM analyzes raw price data, proposes candidate strategies
Backtest — vectorized engine tests each candidate (ATR-based stops, long/short, time-based exit)
Select — in-sample ranking, then out-of-sample validation (Sharpe > 1.0, trades > 30, max DD < 20%)
Evolve — survivors get mutated, failures go to the graveyard (but their lessons persist). Next generation. Repeat.

The Evolution Engine runs on top of Outcome-Weighted Memory (OWM) — a five-layer memory architecture (episodic, semantic, procedural, prospective, affective) that gives the AI persistent recall across sessions. Each memory gets scored by outcome quality, context similarity, and recency when recalled — inspired by ACT-R cognitive architecture and Kelly criterion. The details are in the repo if you're curious; the key point is that the AI doesn't just remember what happened, it remembers how relevant each memory is to the current situation.

Model Comparison

I ran the automated pipeline with three Claude models on real Binance data:

Model	Cost/Run	Speed	Strategies Graduated	Verdict
Haiku	$0.016	34.7s	2	Best so far
Sonnet	$0.013	51.9s	1	Solid
Opus	$0.013	72.4s	1	Slowest

Caveat: this is a small sample — a handful of runs per model. But the early signal is counterintuitive: the cheapest, fastest model produced the most graduated strategies. My working theory is that speed and diversity matter more than depth of reasoning for creative pattern discovery. A full evolution cycle costs less than two cents.

The most compelling finding: the automated pipeline independently rediscovered 16:00 UTC as a key trading hour — the same edge that the manual experiments found. Convergent validation from a completely different process.

Known Bottlenecks

The system isn't perfect. Two issues I'm actively working on:

Prompt over-concretization — all three models tend to lock onto very specific conditions (e.g., "hour_utc == 16 AND atr > 2.5"). This produces strategies that trigger too rarely for statistical significance. The graduated strategies had only 2 trades in out-of-sample, far below the 30-trade minimum for confidence.
Graveyard feedback depth — eliminated strategies get stored, but the feedback loop from graveyard → next generation isn't rich enough yet. The AI knows that a strategy failed, but doesn't fully leverage why.

What I Learned

1. AI doesn't need to be taught strategies. It needs memory and permission to fail.

The biggest bottleneck in AI trading isn't model capability — it's the assumption that humans must provide the strategy. Give the AI raw data and a feedback loop, and it finds structure faster than any hand-designed system.

2. Objective feedback (P&L) beats prompt engineering.

I tried various prompt strategies for pattern discovery. None of them mattered as much as simply feeding back the backtest results. "$-0.21% return, Sharpe -1.20" is more useful than ten paragraphs of trading wisdom.

3. The speed of evolution depends on the quality of failure, not the quantity of success.

Strategy C only exists because Strategy "Giant Wave Reversal" failed spectacularly and the AI could analyze why. A clean failure with clear attribution is more valuable than a marginal success.

4. Meta-patterns are the real prize.

Individual strategies are nice. But the discovery that two independent evolution cycles converged on the same structural template (time bias + trend filter + short hold + asymmetric RR) — that's worth more than any single strategy. It suggests a universal regularity in how markets behave.

5. One person + Claude Code can go from hypothesis to working product in a day.

The entire pipeline — research, backtest, analysis, Evolution Engine code, OWM memory architecture, 1,055 tests, MCP server, open source release — was built in 48 hours by one person with an AI coding assistant. That's the part I still have trouble believing.

Try It Yourself

TradeMemory Protocol is open source. The Evolution Engine, OWM memory architecture, and all 11 experiments documented in RESEARCH_LOG.md are available today.

pip install tradememory-protocol

The full research log with every backtest number, every eliminated strategy, and every lesson learned is in the repo.

I'm not claiming this is a finished product. The over-concretization problem is real. The automated pipeline needs more diverse hypothesis generation. But the core insight — that AI can discover its own trading strategies through evolutionary memory — is validated.

If you're building AI agents that make decisions in uncertain environments, the memory problem is yours too. Trading is just the most measurable version of it.

Want to poke holes in the methodology? The full research log with every backtest number, every eliminated strategy, and every failed hypothesis is public. I'd rather get useful criticism now than discover blind spots later.

Update (2026-03-17): Ran statistical validation against 1,000 random strategies. Both Strategy C (P96.9%) and E (P100%) beat the 95th percentile. Full results.

TradeMemory Protocol: github.com/mnemox-ai/tradememory-protocol

Full research data: RESEARCH_LOG.md

Questions, feedback, or want to run your own evolution experiment? Open an issue on GitHub or find me on Mnemox AI.

Top comments (5)

蜜琪 • Mar 16

Fascinating experiment. The behavioral differences between models (impulsive vs selective vs frozen) are especially interesting.
It almost feels like you’re observing different “trader personalities” emerging from the same prompt and dataset.
Curious — do you think adding long-term memory will stabilize decision behavior, or could it actually amplify biases over time?

Sean | Mnemox • Mar 16

Great question — and honestly, both happen.

In our experiments, memory did stabilize behavior: Strategy C only exists because the AI remembered why its first attempt (Giant Wave Reversal) failed, and course-corrected in one cycle. Without that memory, it would just reinvent the same losing strategy forever.

But the bias amplification risk is real. We already see it in the automated pipeline — what I call "prompt over-concretization." The AI locks onto one very specific condition (e.g., "only trade at exactly 16:00 UTC when ATR > 2.5") and keeps reinforcing it because early results looked good. It's basically confirmation bias with a feedback loop.

The "trader personalities" observation is spot-on. Haiku is impulsive and diverse — it throws out many hypotheses fast. Opus is cautious and tends to overthink into narrow conditions. Same data, same prompt, very different "temperaments." I didn't expect that at all.

Our current approach: the graveyard system stores why strategies died, not just that they died. The idea is that failure memory acts as a counterweight to success bias. But the feedback depth isn't rich enough yet — that's the main thing I'm working on for v0.6.0.

If you're curious about the raw data behind this, the full research log is public: RESEARCH_LOG.md

蜜琪 • Mar 16

What does the graveyard feedback loop look like technically?

Sean | Mnemox • Mar 16

The current implementation has three layers:

Death Certificate — When a strategy fails (Sharpe < 1.0, or < 30 trades in out-of-sample), it gets stored with metadata: what it tried, what the backtest numbers were, and a tagged failure reason (e.g., overfitted, insufficient_trades, negative_expectancy).
Graveyard Query — Before generating new candidates, the Evolution Engine pulls recent failures and includes them in the LLM prompt: "These strategies were tried and failed for these reasons. Don't repeat them." This is the basic "don't touch the hot stove twice" mechanism.
What's Missing — Right now the graveyard only passes what failed and the tag. It doesn't pass the full backtest curve, the specific market conditions where it broke down, or structural analysis of why the failure mode exists. So the AI knows "counter-trend failed" but doesn't get "counter-trend failed specifically because momentum continuation dominated in hours 14-18 UTC during high-volume sessions."

The v0.6.0 plan is to add what I'm calling failure autopsies — the system will run a mini-analysis on each dead strategy (which trades killed it, what market regime, what the winning trades had in common vs. the losers) and feed that structured analysis into the next generation's prompt.

The hypothesis: richer failure memory → more diverse next-generation hypotheses → fewer repeated mistakes. Basically turning each death into a textbook instead of a tombstone.

The architecture diagram is in the repo if you want to see how the layers connect: Evolution Engine source

Jeffrey.Feillp • Apr 24

我正好为了解决这个问题开发了一个 OS，你可以看看我的帖子，或许对你有帮助。