Z S

Posted on Apr 2

How I Used Python to Analyze S&P 500 Returns Since 1928

#python #datascience #finance #beginners

How I Used Python to Analyze S&P 500 Returns Since 1928

I have always heard financial advice like "the market averages 10% per year" and "time in the market beats timing the market." But I wanted to verify these claims with actual data. So I pulled 96 years of S&P 500 data and analyzed it with Python.

Here is what I found — and some of the results challenge common assumptions.

Getting the Data

I used the historical S&P 500 annual returns dataset from NYU Stern professor Aswath Damodaran, which covers 1928-2024. You can replicate this with Yahoo Finance data using yfinance or download Damodaran's dataset directly.

import pandas as pd
import numpy as np

# Load annual returns (1928-2024)
# Columns: Year, S&P 500 Return (including dividends)
df = pd.read_csv('sp500_returns.csv')

print(f"Years of data: {len(df)}")
print(f"Date range: {df['Year'].min()} - {df['Year'].max()}")

Finding 1: The "Average 10%" Is Misleading

The arithmetic mean annual return of the S&P 500 since 1928 is 11.7%. But the geometric mean (what you actually earn) is 9.8%.

arithmetic_mean = df['Return'].mean()
geometric_mean = (np.prod(1 + df['Return'])) ** (1/len(df)) - 1

print(f"Arithmetic mean: {arithmetic_mean:.1%}")  # 11.7%
print(f"Geometric mean:  {geometric_mean:.1%}")   # 9.8%
print(f"Difference:      {arithmetic_mean - geometric_mean:.1%}")

Why the gap? Volatility drag. If you lose 50% and then gain 50%, you do not break even — you are down 25%. The geometric mean accounts for this compounding effect.

Takeaway: When someone says "10% average returns," the number you will actually experience is closer to 9.8% before inflation (about 7% after inflation).

Finding 2: Negative Years Are More Common Than You Think

negative_years = df[df['Return'] < 0]
positive_years = df[df['Return'] >= 0]

print(f"Positive years: {len(positive_years)} ({len(positive_years)/len(df):.0%})")
print(f"Negative years: {len(negative_years)} ({len(negative_years)/len(df):.0%})")
print(f"Worst year: {df.loc[df['Return'].idxmin(), 'Year']} ({df['Return'].min():.1%})")
print(f"Best year:  {df.loc[df['Return'].idxmax(), 'Year']} ({df['Return'].max():.1%})")

Results:

Positive years: 70 out of 97 (72%)
Negative years: 27 out of 97 (28%)
Worst year: 1931 (-43.8%)
Best year: 1954 (+52.6%)

The market is down roughly 1 in every 4 years. If you invested in 2024, you should statistically expect about 7-8 negative years over the next 30.

Finding 3: Rolling Returns Tell the Real Story

This is where it gets interesting. I calculated rolling returns for every possible holding period:

def rolling_returns(returns, window):
    results = []
    for i in range(len(returns) - window + 1):
        period = returns.iloc[i:i+window]
        cumulative = np.prod(1 + period) 
        annualized = cumulative ** (1/window) - 1
        start_year = df['Year'].iloc[i]
        results.append({
            'start_year': start_year,
            'annualized_return': annualized,
            'cumulative_return': cumulative - 1
        })
    return pd.DataFrame(results)

for window in [1, 5, 10, 15, 20]:
    roll = rolling_returns(df['Return'], window)
    pct_positive = (roll['annualized_return'] > 0).mean()
    worst = roll['annualized_return'].min()
    best = roll['annualized_return'].max()
    print(f"{window:2d}-year: {pct_positive:5.1%} positive | "
          f"Worst: {worst:+.1%}/yr | Best: {best:+.1%}/yr")

Holding Period	% Positive	Worst Annualized	Best Annualized
1 year	72%	-43.8%	+52.6%
5 years	87%	-12.4%	+28.6%
10 years	94%	-1.4%	+20.1%
15 years	100%	+0.6%	+18.9%
20 years	100%	+3.1%	+17.9%

There has never been a 15-year period where the S&P 500 lost money. Not during the Great Depression, not during the 2008 financial crisis, not during the dot-com bust.

Finding 4: Dollar-Cost Averaging vs Lump Sum

I simulated both strategies for every possible 20-year period:

def simulate_dca_vs_lump(returns, annual_investment, years):
    total_invested = annual_investment * years

    # DCA: invest same amount each year
    dca_balance = 0
    for i in range(years):
        dca_balance = (dca_balance + annual_investment) * (1 + returns.iloc[i])

    # Lump sum: invest everything in year 1
    lump_balance = total_invested
    for i in range(years):
        lump_balance *= (1 + returns.iloc[i])

    return dca_balance, lump_balance

Lump sum beats DCA approximately 68% of the time over 20-year periods. This aligns with Vanguard's 2012 study across US, UK, and Australian markets.

However — DCA is what most of us actually do. We invest from each paycheck. So the right comparison is really "invest immediately when you have money" vs "save up and invest later."

Finding 5: The Cost of Missing the Best Days

Missing just the 10 best years out of 96 reduces your ending balance by over 60%. The problem with market timing is that the best days often come right after the worst days — during the periods when you are most tempted to sell.

What This Data Means Practically

After running all this analysis, here is what I took away:

10% is roughly right, but plan for 7% after inflation to be conservative
Hold for 15+ years and you have historically never lost money in the S&P 500
Do not try to time the market — missing a few good years devastates returns
Invest immediately when you have money rather than waiting for a dip
Expect downturns — they happen 28% of the time and are normal

Running Your Own Scenarios

If you want to model specific scenarios with your own numbers — like "what if I invest $500/month for 25 years at 8%?" — the compound interest and investment calculators at aihowtoinvest.com let you play with different assumptions without writing code. Useful for quick what-if analysis.

Full Code

The complete analysis notebook is about 200 lines of Python with pandas and numpy. The key insight from working with almost a century of data: the math strongly favors patience and consistency over cleverness.

Have you done your own analysis of market data? What surprised you most? I would love to see what other developers have found.

DEV Community

How I Used Python to Analyze S&P 500 Returns Since 1928

How I Used Python to Analyze S&P 500 Returns Since 1928

Getting the Data

Finding 1: The "Average 10%" Is Misleading

Finding 2: Negative Years Are More Common Than You Think

Finding 3: Rolling Returns Tell the Real Story

Finding 4: Dollar-Cost Averaging vs Lump Sum

Finding 5: The Cost of Missing the Best Days

What This Data Means Practically

Running Your Own Scenarios

Full Code

Top comments (0)