"Why Your Paper Trading Backtests Are Lying to You (And How to Fix It)"

#python #algotrading #fintech #trading

Why Your Paper Trading Backtests Are Lying to You (And How to Fix It)

You ran the backtest. The Sharpe ratio looks great. CAGR of 34%. Max drawdown a reasonable 12%. You switch to paper trading and watch your strategy bleed out over the next two weeks.

If this sounds familiar, you're not doing anything wrong. Backtests lie by design — and paper trading in isolation doesn't fix the problem. Here's what's actually happening, and a better way to evaluate strategies before you put real capital on the line.

Problem 1: Survivorship Bias in Your Data

If you're testing against S&P 500 constituents, you're testing against today's winners. The companies that went bankrupt, got delisted, or dropped out of the index over your test period aren't in your dataset.

This matters more than people realize. A strategy that "buys the dip on large-caps" looks spectacular if you're only testing on companies that survived. In reality, buying dips on Lehman Brothers in 2008 was a one-way ticket to zero.

The fix: Use point-in-time data that includes delisted stocks, or at minimum acknowledge this bias when evaluating results. If your edge depends on survivorship-biased data, it's not a real edge.

Problem 2: Look-Ahead Bias

This one is sneaky. It happens when your strategy accidentally accesses future data during backtesting.

Classic examples:

Using adjusted closing prices (which get recalculated retroactively when splits or dividends happen)
Rebalancing at the exact open price of the next day (impossible in practice)
Using a moving average that's recalculated with data you wouldn't have had

# This looks fine but introduces look-ahead bias:
df['ma_200'] = df['close'].rolling(200).mean()
df['signal'] = df['close'] > df['ma_200']  # Fine

# This is look-ahead bias — you're shifting the signal:
df['signal'] = df['close'].shift(-1) > df['ma_200']  # BAD

The second version is using tomorrow's close to make today's decision. Your real portfolio can't do that.

The fix: Always check that your signal is based on data available at signal generation time. When in doubt, add an explicit .shift(1) to your price data before generating signals.

Problem 3: Transaction Cost Blindness

Backtests routinely ignore:

Commissions (even "zero commission" brokers have payment for order flow)
Slippage (you don't always fill at the price you wanted)
Market impact (larger orders move the price)
Spread (bid-ask spread on less liquid names)

A strategy that trades 50 times a month with a theoretical 0.5% edge can easily get eaten alive by 0.1% slippage per trade. Run the math:

gross_return = 0.005  # 0.5% per trade
slippage_per_trade = 0.001  # 0.1% slippage
num_trades = 50
net_return = (gross_return - slippage_per_trade) * num_trades
# 0.004 * 50 = 0.20 (20%) vs 0.005 * 50 = 0.25 (25%)
# That 5% difference is real money

The fix: Use realistic transaction cost assumptions. For Alpaca paper trading, simulate slippage by assuming fills 0.05–0.1% worse than the midpoint. This is conservative but forces your strategy to prove it has real edge.

Problem 4: Single-Path Testing

Most backtests test one path through history. But markets have many possible paths. The specific sequence of events from 2020-2023 — COVID crash, meme stock mania, rate hike cycle — was one realization of many possible worlds.

Your strategy might have crushed that specific path but would have failed in 60% of plausible alternative scenarios.

This is where tournament-style testing changes the picture. Instead of one backtest, you run dozens — varying:

Start dates (offset by weeks or months)
Market regimes (bull, bear, high vol, low vol)
Parameter ranges (walk-forward optimization)

If your strategy only looks good in the one path you happened to test, that's not a strategy — it's curve fitting.

How TradeSight Approaches This

I built TradeSight to address exactly these problems with paper trading evaluation.

The core idea is tournament mode: instead of running one strategy against one period, you run multiple strategies head-to-head across the same market conditions on Alpaca paper trading. The strategies compete for the same tickers on the same days, with identical transaction cost assumptions.

Key features:

Simultaneous multi-strategy comparison — no cherry-picking which strategy to test when
Consistent transaction cost modeling — same slippage assumptions across all strategies
Walk-forward validation — out-of-sample periods baked in, not bolted on
MC simulation — Monte Carlo over resampled return sequences to stress-test equity curves

A simple comparison run looks like this:

python tradesight.py --mode tournament \
  --strategies momentum_basic,mean_reversion_rsi,bollinger_squeeze \
  --tickers AAPL,MSFT,NVDA,GOOGL \
  --period 90d \
  --paper-account your_alpaca_key

The output tells you not just which strategy won, but why — which market conditions favored it, how it performed in drawdown periods, and whether the edge holds across parameter variations.

The Honest Backtest Checklist

Before trusting any backtest result:

[ ] Is your data survivorship-bias free?
[ ] Are you using adjusted close prices and accounting for it?
[ ] Have you verified no look-ahead bias in signal generation?
[ ] Are transaction costs modeled realistically (including slippage)?
[ ] Have you tested across multiple start dates, not just one?
[ ] Have you done out-of-sample validation on data the strategy never "saw"?
[ ] Does the strategy perform consistently across different market regimes?

If you can check all seven boxes, your backtest result means something. If you can't, treat it as a hypothesis — not a conclusion.

Paper Trading Is Still Useful — Just Not for Validation

Paper trading with Alpaca is great for testing execution — does your order routing work? Does your position sizing math hold up? Are there bugs in your live data feed handling?

It's not great for strategy validation because you're only seeing one more path through market history. Combine it with rigorous backtesting (with the above fixes) and tournament comparison, and you have something worth trusting.

The market doesn't care how good your backtest looks. It only cares what you do with real capital.

GitHub: https://github.com/rmbell09-lang/tradesight