What this guide assumes
You can write Python. You understand the basics of equity markets — what a price series is, what total return means, what rebalancing means. You haven't done a backtest before, or you've done one and weren't sure if the result was real.
The goal: walk through building a minimal but correct backtest of a simple strategy (12-month momentum on the S&P 500 constituents), so you can spot the three biases that destroy 90% of amateur backtests and understand why your "Sharpe ratio of 4.2" almost certainly isn't real.
The strategy
We'll test a textbook momentum strategy:
- Each month, rank the S&P 500 constituents by their trailing 12-month total return (excluding the most recent month).
- Hold the top 10% (~50 stocks), equal-weighted.
- Rebalance monthly.
- Compare to buying and holding the S&P 500.
This is a real strategy that has been published in academic literature (Jegadeesh & Titman 1993). The "exclude the most recent month" detail is to avoid short-term reversal effects.
We expect (based on prior research) a Sharpe ratio improvement over buy-and-hold of ~0.2-0.4. If your backtest shows 1.5+, something is wrong.
The setup
import pandas as pd
import numpy as np
import yfinance as yf
# Get data
tickers = pd.read_csv('sp500_historical_constituents.csv') # see note below
prices = yf.download(tickers['ticker'].unique().tolist(),
start='2010-01-01', end='2025-12-31',
auto_adjust=True)['Close']
prices = prices.dropna(axis=1, how='all')
The csv referenced is critical. We need historical S&P 500 constituents — the list of companies that were in the index at each historical date — not today's S&P 500 list applied retroactively. The latter is survivorship bias and it's the #1 reason amateur backtests look unrealistically good.
If you backtest against today's S&P 500 list using historical prices, you're only testing companies that survived until today. The companies that went bankrupt or got acquired (Lehman, WaMu, Sun Microsystems) silently drop out of your universe. The strategy "looks better" than it would have in real life because you've eliminated the losers a-priori.
Sources for historical constituents: WRDS (academic), CRSP (paid), some Kaggle datasets (sketchy but free). If you can't get historical constituents, your backtest is fundamentally compromised — call it an "illustrative" backtest at best.
The momentum signal
# Compute trailing 12-month total return, lagged by 1 month
returns = prices.pct_change()
# 12-month return ending 1 month ago
trailing_12m = (1 + returns).rolling(12).apply(np.prod, raw=True) - 1
signal = trailing_12m.shift(1) # exclude most recent month
That .shift(1) is the second critical detail. Without it you're using current-month information to make decisions you'd have made at the start of the month. That's a look-ahead bias — using future data to make past decisions. It will inflate your Sharpe.
The general rule: any decision at time T can only use information available at T (or before). When in doubt, shift your signals by 1 period and check whether your backtest still works. If it doesn't, you had look-ahead bias and your previous result was illusory.
The portfolio formation
def get_top_decile(date, signal_df, return_df):
universe = signal_df.loc[date].dropna()
threshold = universe.quantile(0.9)
selected = universe[universe >= threshold].index
return selected
# Build portfolio returns
portfolio_returns = []
dates = signal.index[12:] # need 12 months of history
for date in dates:
if date not in signal.index:
continue
selected = get_top_decile(date, signal, returns)
if len(selected) == 0:
portfolio_returns.append(0)
continue
# Equal-weight return next period
next_date_idx = signal.index.get_loc(date) + 1
if next_date_idx >= len(signal.index):
break
next_date = signal.index[next_date_idx]
next_returns = returns.loc[next_date, selected].dropna()
portfolio_returns.append(next_returns.mean())
portfolio_series = pd.Series(portfolio_returns, index=dates[:len(portfolio_returns)])
Note: this code is naive on purpose. Real implementations would use a vectorized backtest library (vectorbt, bt, zipline). For a first backtest, the explicit loop is easier to reason about and harder to silently introduce bugs into.
The transaction cost layer
This is the third bias: ignoring transaction costs. A monthly-rebalanced momentum strategy turns over a lot of its portfolio every month — typically 30-60% of holdings change each rebalance. At those turnover rates, transaction costs matter.
For a realistic backtest:
TRANSACTION_COST_BPS = 10 # 10 basis points per trade, each way
def compute_turnover(prev_holdings, current_holdings):
if prev_holdings is None:
return 1.0 # 100% turnover on initial buy
prev_set = set(prev_holdings)
curr_set = set(current_holdings)
# rough: fraction that changed
return len(prev_set.symmetric_difference(curr_set)) / (2 * len(curr_set))
# Apply costs
prev_holdings = None
adjusted_returns = []
for date, selected in zip(dates, holdings_history):
raw_return = portfolio_returns[i]
turnover = compute_turnover(prev_holdings, selected)
cost = turnover * (TRANSACTION_COST_BPS / 10000) * 2 # buy + sell
adjusted_returns.append(raw_return - cost)
prev_holdings = selected
10 bps per trade is optimistic for retail. Actual retail costs (spread + commission + slippage) are 15-50 bps depending on the stocks and your broker. Bid-ask spreads on small caps are wider than you think.
Apply realistic costs and watch your Sharpe ratio drop by 0.3-0.5. If your strategy survives that with positive alpha, you might have something. If it doesn't, you had a "before costs" strategy, which is a strategy that doesn't exist.
What "good" looks like
After applying all the corrections (survivorship-free universe, properly lagged signals, realistic transaction costs), a textbook 12-month momentum strategy on US large-caps from 2010-2025 should produce roughly:
- Annualized excess return over S&P 500: 1-3% (highly variable by sub-period)
- Sharpe ratio: ~0.6-0.8 (vs. S&P 500's ~0.5-0.7)
- Maximum drawdown: comparable to or worse than the index
- Years where it underperforms: roughly 40% of the time
If your backtest produces Sharpe 1.5+ or 10%+ excess returns annually, assume you have a bug and go find it. The most likely candidate: you accidentally re-introduced survivorship bias or look-ahead bias somewhere.
The dangerous failure mode of backtesting is the plausible-looking result. A Sharpe of 0.9 from a momentum strategy is plausible-looking — it's hard to spot as wrong by inspection. But if your strategy "shouldn't" produce 0.9 (because the published research says 0.6-0.7), there's likely a subtle bias inflating it. The right reaction is "investigate the bug," not "publish the result."
The minimal viable backtest harness
For a first backtest, what you actually need:
- Survivorship-bias-free historical universe: this is the hardest data to source for free. Without it, treat all results as illustrative.
- Vectorized return computation: pandas + numpy is fine, don't pre-optimize.
-
Proper signal lagging: explicit
.shift(1)everywhere, audit by spot-checking that decisions don't use same-period data. - Transaction cost layer: configurable bps per trade, applied to turnover.
- Baseline comparison: every result is meaningful only relative to a benchmark. Always show your strategy's return alongside buy-and-hold of the same universe.
That's it. Don't add walk-forward optimization, Bayesian parameter selection, or regime detection to your first backtest. Those features hide bugs by giving you more degrees of freedom to fit historical noise.
What to use after this
Once your minimal backtest works:
- vectorbt (free, open-source) for faster iteration on parameter sweeps
- backtrader (free) for event-driven backtests with realistic order modeling
- QuantConnect (free for community, paid for production) for cloud backtests with paid data feeds
- Zipline-Reloaded (free) if you want institutional-grade infrastructure
For data: free options (Yahoo, Alpha Vantage) have known quality issues that will bite you. Paid options (Polygon.io, IEX Cloud, EOD Historical Data) are worth it once you're past the learning stage.
Verdict
A correct first backtest is harder than the code suggests. The bugs aren't in the syntax — they're in the unstated assumptions (survivorship bias, look-ahead, transaction costs) that destroy Sharpe ratios. Build the minimal harness, validate against published research expectations, and assume any spectacular result is wrong until proven otherwise.
The skill that takes most of the time to develop isn't writing the backtest. It's developing the paranoid intuition for "this looks too good — what bug is inflating it?" That intuition saves you from publishing or trading on illusory edges.
Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.
Top comments (0)