"Why I Stopped Trusting Backtests and Built a Live Paper Trading Tournament Instead"

#python #trading #algorithms #finance

Backtesting feels productive. You write a strategy, run it against 5 years of historical data, watch the equity curve trend up and to the right, and start daydreaming about returns.

Then you paper trade it. And it falls apart within the first two weeks.

I've been through this cycle enough times to know that backtesting and live trading are fundamentally different problems. Backtesting is optimization. Live trading is survival.

What Backtesting Actually Tests

Backtesting tests whether your strategy would have worked on data that you implicitly shaped your strategy around. Even with walk-forward validation and out-of-sample testing, there's survivor bias baked in — you chose the instruments, the time period, the features. The strategy is always somewhat aware of the future it's supposedly predicting.

The four biases I kept running into (and eventually fixed in TradeSight):

Look-ahead bias — using daily close prices as if they're available intraday
Survivorship bias — only testing on stocks that still exist
Overfitting — RSI + MACD + Bollinger + custom filter = it works on exactly this data
Slippage blindness — assuming fills at the exact signal price

These aren't edge cases. They're the default state of most backtesting setups.

The Tournament Idea

The shift I made: instead of running one strategy and judging it on historical performance, I run multiple strategies simultaneously against live paper trading data and let them compete.

The setup is simple. Each strategy gets the same starting capital. They all receive the same market data in real-time via Alpaca's paper trading API. At the end of a session (I run overnight — markets are less noisy for my RSI-based signals), I score each strategy on:

Net return
Max drawdown
Win rate
Sharpe ratio
Number of trades (penalizes churning)

The top performers roll forward. The losers get replaced or adjusted. It's evolutionary selection on a small scale.

Why This Works Better

A few things surprised me when I switched to this approach:

Failure is fast. A strategy that's broken in live conditions fails within days, not months. With backtesting, I'd build elaborate validation pipelines to catch flaws. With tournaments, the market does it for you.

Signal quality becomes obvious. When a strategy is profitable in backtesting but losing in live trading, it's almost always overfitting or look-ahead bias. The gap between backtest and live performance is diagnostic data.

You stop over-engineering. When you're running 4-6 strategies at once with real-time data, simple beats complex almost every time. RSI crossover with clean entry/exit rules usually outperforms my "improved" version with 8 extra filters.

The Implementation

The core of TradeSight is a signal loop that runs every 30 minutes during market hours and overnight for swing setups:

for strategy in active_strategies:
    signals = strategy.generate_signals(market_data)
    for signal in signals:
        if passes_risk_guard(signal, portfolio_state):
            execute_paper_trade(signal)

    strategy.score = calculate_score(strategy.trade_history)

# Tournament: keep top performers
ranked = sorted(active_strategies, key=lambda s: s.score, reverse=True)
active_strategies = ranked[:keep_top_n]

The risk guard is important — it enforces position sizing limits and a correlation guard that prevents the portfolio from doubling up on correlated instruments (which backtests almost never catch).

What I've Learned From Running Tournaments

After running this for a few months:

RSI with sector-adjusted thresholds consistently outperforms vanilla RSI
Mean reversion beats momentum on overnight holding periods for the instruments I trade
Drawdown limits matter more than return targets — strategies that survive drawdowns compound better over time
The correlation guard saves you regularly — tech sector moves together and naive strategies pile into correlated positions

The most useful output isn't "this strategy made 3.2% this week." It's "this strategy's drawdown profile looks unstable — here's why."

The Actual Code

TradeSight is open source: github.com/rmbell09-lang/tradesight

It runs on Python 3.9+, connects to Alpaca for paper trading data, and stores tournament results in SQLite. Setup takes about 10 minutes if you have an Alpaca paper account (free to create).

If you're tired of backtesting giving you false confidence, this approach is worth trying. It won't make bad strategies good — but it'll tell you which ones are actually bad a lot faster.

Built a similar system? Running paper trading tournaments against something other than RSI? I'd like to hear what's working.