DEV Community

Cover image for I Asked an LLM to Generate 20 Trading Strategies. 14 Were the Same Thing.
Whetlan
Whetlan

Posted on

I Asked an LLM to Generate 20 Trading Strategies. 14 Were the Same Thing.

A few months ago I asked an LLM to generate twenty trading strategies.

Fourteen were the same thing.

Not similar ideas. Not variations on a theme. The same mean-reversion logic with different lookback windows and parameter names.

I gave it historical price data, told it to find patterns, output entry/exit rules in Python. Ten minutes later I had twenty strategies. Clean code, proper docstrings, sensible-looking parameters.

I backtested all twenty. Twelve looked profitable. Some showed 200%+ annual returns.

Then I actually read the code.

Same structure. Same assumptions. Same failure mode: in a trending market, they'd all keep buying into a falling asset with no awareness anything had changed.

That's when I stopped thinking of LLMs as strategy generators and started thinking of them as very confident interns who hand you the same report twenty times with different cover pages.

People are giving these things real money now

Since then I've watched this move from experiment to actual deployment.

RockAlpha has LLMs managing $100K stock portfolios that retail users can copy-trade. Aster DEX ran an arena where AI agents traded against humans—the humans got liquidated at 43%, the AIs at 0%. On GitHub, ai-hedge-fund has 56K stars and LLM personas of Warren Buffett and Charlie Munger debating trades.

Last October a company called nof1.ai gave six frontier LLMs $10,000 each in real money and let them trade crypto perpetual contracts. No human intervention. Same prompts, same data feeds, same execution terms. Seventeen days.

Two made money. Four got destroyed.

It's not the model. It's what the model sees.

Nof1 ran a second round on US stocks. A mystery model won with +12.1%, later revealed as Grok 4.20.

It didn't win because it was smarter. It was processing 68 million tweets per day through the X Firehose, generating signals within minutes. GPT-5.1 had 15-minute delayed news summaries. Gemini was working from SEC filings with 30+ minute delays.

RockAlpha's ongoing arena shows the same pattern. DeepSeek keeps outperforming across market regimes. Not because it reasons better, but because High-Flyer Quant built the model with time-series data and risk management baked in from training.

The model matters less than what it was trained on and what it's looking at.

People comparing "GPT vs Claude vs Gemini" for trading are asking the wrong question.

Three generations of trading software

I worked on an MT4 bridge once. About 50,000 lines of C++. FIX protocol integration, order routing, the whole thing.

MQL4 was designed for people who think in moving averages and crossover signals. You write a rule, attach it to a chart, watch it go. That was Gen 1: human writes the rule, machine executes it.

But here's what people forget. Before MT4, indicators like moving averages and RSI were tribal knowledge. Stuff you picked up from other traders, maybe a book if you were lucky. MT4 turned that into reusable components. Click a button, get an RSI.

That's Gen 1's real legacy. It crystallized the indicators.


Gen 2 built on top of those indicators.

Frameworks like vnpy (14K stars), backtrader, freqtrade. Python-based. You define your indicators, the framework runs genetic algorithms or grid search to find the best parameters.

Gen 2 crystallized the strategy. Not just individual indicators, but combinations of them—entry rules, exit rules, position sizing, all packaged into something reusable.

I evaluated a few of these when I was looking at Python quant stacks. They work, technically. But Python's GIL kills you on anything latency-sensitive, and the optimization loop is a trap. You can always find parameters that backtest beautifully.


Gen 3: ML enters. QuantConnect, WorldQuant BRAIN.

Instead of optimizing parameters on fixed indicators, you build a feature pool and let XGBoost or LightGBM figure out which combinations matter.

Gen 3 crystallized the system. The whole pipeline. Data ingestion, feature engineering, model training, risk management, execution.

What Renaissance Technologies needed hundreds of PhDs and decades to build, QuantConnect turned into a platform.


Each generation left something behind for the next one to stand on. Indicators. Strategies. Systems.

Each one also hit the same wall: the gap between backtest and live performance.

The interesting question isn't why they all hit this wall. It's what happens when someone can build on all three layers at once.

Two failure modes

After watching arena results and looking at how LLMs behave with financial data, I think there are two distinct failure modes that keep showing up.

Strategy Hallucination

The LLM generates strategies that look structurally valid but encode no real market insight.

My twenty mean-reversion clones were this. Proper entry/exit logic, proper position sizing, proper risk management. Also all exploiting the same artifact in the training data.

A human quant would have caught it in five minutes. I caught it in two hours. Someone less experienced might not catch it at all.

The arena results suggest this happens at model level too. GPT-5 and Gemini generated plausible-looking trading behavior that fell apart under real conditions. The strategies "made sense" the way a hallucinated Wikipedia article makes sense. Coherent. Confident. Wrong.

Backtest Overfitting Blindness

The LLM doesn't understand that a beautiful backtest is a warning sign, not a success metric.

When I asked it to generate strategies with "strong backtesting performance," it optimized for exactly that. Curve-fitted parameters, lookahead bias in feature construction, survivorship bias in asset selection. Every quant knows these traps. The LLM walked into all of them with total confidence.

Here's what one looked like:

# What the LLM generated (looks clean):
def signal(prices, window=14, threshold=2.0):
    zscore = (prices - prices.rolling(window).mean()) / prices.rolling(window).std()
    return zscore < -threshold  # buy when "oversold"

# What it didn't tell you:
# - window=14 was fit to this specific dataset
# - threshold=2.0 maximized backtest returns
# - this exact pattern appears in 14 of 20 "different" strategies
# - in a trending market, zscore stays below -threshold for weeks
#   and you keep buying into a falling knife
Enter fullscreen mode Exit fullscreen mode

These two failure modes compound. The LLM hallucinates strategies, then fits them perfectly to historical data. Results look incredible on paper.

The worse part: the more strategies you generate, the more likely at least one shows amazing backtest results purely by chance.

What seems to work so far

I don't have this figured out yet. But after the twenty-clones incident and watching six months of arena results, a few things seem consistent.

Don't let the LLM pick parameters

Use it to generate structure—indicator combinations, entry logic, risk rules. Then run parameter optimization through something that understands walk-forward testing, out-of-sample validation, and transaction costs.

The LLM proposes. Something that can actually do math evaluates.

Treat outputs as hypotheses

This sounds obvious, but it's not how most people use them.

When the LLM hands you a strategy with a 180% annual return, the natural reaction is to start looking for reasons it might work. Flip it. Start by assuming it doesn't work and look for reasons the backtest is lying to you.

Check for actual diversity

Before running a batch of LLM strategies, cluster them.

If your "50 different strategies" collapse into four underlying patterns, you don't have diversification. You have four strategies wearing costumes.

I should have done this before getting excited about twelve profitable backtests.

What I keep thinking about

I wrote something a while back about a metaphor that stuck with me. A rider on an ox, carrying a trident. The ox is raw power, the trident is precision.

The three generations of trading software? That's the trident being forged. Indicators, strategies, systems. Each layer more precise than the last.

The arenas skipped all of it. Gave the LLM money and said go. Most crashed.

But here's what keeps me thinking. Building a Gen 3 system used to require a team, serious funding, years of work. The kind of thing only a Medallion Fund or a QuantConnect could pull off.

With AI, one person can realistically assemble their own version. Not a toy version. An actual pipeline with data ingestion, feature engineering, walk-forward validation, risk controls.

AI doesn't replace the three generations of accumulated knowledge. It makes them accessible.

I don't know what Gen 4 looks like yet. But if enough people are standing on Gen 3 instead of being locked out of it, we'll probably find out faster than anyone expects.

I've been building in this direction. Slowly figuring out what the right pieces are. Still making mistakes, but at least they're new mistakes.

I might be overfitting my own conclusions here. But from what I've seen, the pattern holds.

What does your setup look like? Has anyone else tried running LLM-generated strategies through traditional backtesting infrastructure and actually survived? Curious what failure modes you hit.

Top comments (0)