I spent the better part of six months trying to make Deep Reinforcement Learning work for stock trading.
PPO, SAC, A2C — I tried them all. Tweaked reward functions until 3 AM. Watched my agent learn to do absolutely nothing because that minimized drawdown. Then I switched to genetic algorithms, and within two weeks I had something that actually traded profitably.
This isn't a "GA is always better" post. It's a "here's what I learned the hard way" post.
The DRL Dream vs. Reality
The pitch for Deep RL in trading sounds amazing. Your agent learns optimal actions from raw market data. No hand-crafted features needed. It adapts to changing markets. AlphaGo beat the world champion, so surely it can beat the S&P 500, right?
Here's what actually happens.
Reward shaping is black magic. Do you reward realized PnL? Unrealized? Risk-adjusted returns? Sharpe ratio? Every choice creates perverse incentives. I had an agent that learned to open and close positions every single bar because the transaction cost wasn't weighted correctly in the reward. Fix that, and it learns to never trade at all.
Catastrophic forgetting is real. Train your agent on 2020 data, it learns COVID crash patterns. Fine-tune on 2023 bull market data, it forgets everything about crashes. You can mitigate this with replay buffers and curriculum learning, but now you're spending more time on training infrastructure than on the actual strategy.
You can't explain anything. Your agent makes a trade. Why? The neural net said so. Good luck debugging that at 2 AM when it just took a 15% loss. Good luck explaining to yourself why it suddenly started buying every dip in a sector it previously avoided.
The data hunger is insane. Financial markets give you maybe 5,000 daily bars over 20 years for a single stock. That's nothing for deep learning. You can use minute data, but then you're paying for data feeds, dealing with microstructure noise, and your training time explodes.
What Made GA Click
I didn't come to genetic algorithms from some grand theoretical insight. I was frustrated and looking for something simpler.
The core idea: instead of training a neural network to make decisions, you define a "strategy DNA" — a set of numerical parameters that control when to buy, what to buy, and when to sell. Then you breed thousands of these strategies, keeping the ones that perform well and mutating the rest.
Here's what an actual strategy DNA looks like in my system (simplified from the real 40+ dimensional version):
{
"rsi_buy_threshold": 32.5,
"rsi_sell_threshold": 71.0,
"hold_days": 5,
"stop_loss_pct": 3.2,
"take_profit_pct": 18.7,
"max_positions": 4,
"w_momentum": 0.0842,
"w_mean_reversion": 0.1205,
"w_macd": 0.0631,
"w_bollinger": 0.0974,
"w_kdj": 0.0518,
"w_obv": 0.0293,
"w_volume_profile": 0.0157,
"w_atr": 0.0412,
"w_cci": 0.0688,
"w_mfi": 0.0334
}
That's it. Every single decision the strategy makes is controlled by these numbers. RSI below 32.5? Potential buy. Weighted score above threshold? Execute. Hold for 5 days or until stop-loss hits.
Compare this to a DRL policy network with 50,000 parameters you can't inspect.
Where GA Actually Wins (For Trading Specifically)
You can read the strategy. When generation 89 produces a strategy with a Sharpe ratio of 6.36, I can look at it and say "oh, it's weighting mean reversion and Bollinger bands heavily, with a tight stop-loss and short holding period." That makes intuitive sense for the Chinese A-share market. A DRL agent with identical performance would just be a matrix of floats.
Mutation is debugging-friendly. When a child strategy performs worse than its parent, the diff is a handful of parameter changes. You can track exactly which mutation broke it. Try doing that with gradient updates across 50,000 neural network weights.
Non-stationarity is handled naturally. Markets change. A strategy DNA that worked in 2020 won't work in 2024. But here's the thing — evolution handles this. Each generation is backtested on the data you have. If the market regime shifts, the population adapts over a few generations. No catastrophic forgetting because there's nothing to "forget" — the DNA either works on the backtest or it doesn't.
Walk-forward validation is trivial. I split data into 70% train / 30% validation. Fitness is weighted 40% train, 60% validation. If a strategy crushes training but bombs validation, it gets a harsh penalty — fitness multiplied by 0.3. This one rule alone kills most overfitting. Implementing the equivalent in DRL requires careful environment design, multiple evaluation seeds, and a lot of prayer.
Compute cost is predictable. One generation of backtesting 30 strategies across 500 stocks takes about 45 seconds on my machine. I know exactly how long 100 generations will take. DRL training? Could converge in 2 hours. Could diverge after 20 hours. No way to know in advance.
Where GA Falls Short (And I'm Not Gonna Pretend Otherwise)
No guarantee of optimality. Gradient descent at least follows a principled path downhill. Evolution is random search with selection pressure. You might get stuck in local optima. I mitigate this with tournament selection, diversity bonuses, and random DNA injection when fitness stagnates for 15 generations — but it's still fundamentally stochastic.
Overfitting is still a risk. I ran my system on 4 crypto coins and got 25,300% returns with a Sharpe of 13.37. Sounds amazing until you realize 4 coins is basically no diversification. That strategy was overfitted to those specific assets. I'm currently re-running with 17 coins to get honest numbers.
The search space is huge. My current DNA has 40+ built-in weights plus ~440 dynamically discovered factor weights. That's nearly 500 dimensions. Evolution works, but it's exploring a massive space. Each generation only tests 30 candidates. You need patience.
It can't discover truly novel strategies. GA optimizes within the structure you define. If the winning strategy requires a completely different architecture — say, pairs trading or options strategies — evolution within the current DNA structure will never find it. DRL, in theory, could discover arbitrary patterns. In practice, it usually doesn't, but the theoretical edge is there.
Related Work
I'm not the only one thinking about this. The CGA-Agent paper (arXiv:2510.07943) explores using LLM-guided genetic algorithms for trading — essentially letting GPT-4 suggest new mutation operators and factor definitions. The marriage of LLMs and evolutionary search for trading strategies is genuinely interesting research.
My own project, FinClaw, takes a similar but more hands-on approach: a Python engine that evolves strategy DNA across 484 factors in 33 categories, with walk-forward validation and dynamic factor discovery. It's not a paper — it's a running system. Generation 127 on A-shares, fitness 3253, 3060% annual return, Sharpe 6.36. Those are real backtest numbers, not hypothetical.
So Which One?
Neither approach is a money printer. Both will overfit if you let them.
The difference is what happens when things go wrong. When a GA strategy blows up, I open the DNA, look at the weights, and see that w_momentum spiked to 0.4 while the stop-loss relaxed to 8%. I can fix that. When a DRL policy blows up, I stare at a loss curve and wonder.
Interpretability isn't a nice-to-have when your money is on the line.
I build open-source tools for quantitative trading. If you're interested in strategy evolution, check out FinClaw on GitHub.
Top comments (0)