DEV Community

Ayrat Murtazin
Ayrat Murtazin

Posted on

Building a Statistical Arbitrage Strategy from Scratch in Python

Statistical arbitrage sits at the intersection of statistics, signal processing, and market microstructure. Rather than predicting the future outright, it identifies systematic deviations from an estimated fair value and exploits the mean-reversion or continuation that follows. One of the cleanest entry points into this world is polynomial regression — fitting a smooth trend curve to price history and trading the spread between actual price and that estimated trend. It sounds simple, and the mathematics genuinely is, but the implementation details — particularly avoiding lookahead bias — separate a publishable backtest from a profitable one.

In this article you will build a complete, bias-free trading system around a cubic polynomial trend filter. You will download real price data with yfinance, fit expanding-window polynomial regressions to eliminate any forward-looking contamination, generate long/flat signals based on price position and slope direction, run a full backtest against a buy-and-hold benchmark, and visualize every step with production-quality charts. All performance metrics — CAGR, Sharpe ratio, maximum drawdown, and more — are computed from scratch so you understand exactly what each number means.


Most algo trading content gives you theory.
This gives you the code.

3 Python strategies. Fully backtested. Colab notebook included.
Plus a free ebook with 5 more strategies the moment you subscribe.

5,000 quant traders already run these:

Subscribe | AlgoEdge Insights

Building a Statistical Arbitrage Strategy from Scratch in Python

This article covers:

  • Section 1 — Core Concept:** What polynomial trend filtering is, the intuition behind cubic regression, and why walk-forward fitting is non-negotiable for honest backtesting
  • Section 2 — Python Implementation:** End-to-end code covering setup and parameters (2.1), expanding-window cubic regression (2.2), signal generation and backtest engine (2.3), and visualization (2.4)
  • Section 3 — Results and Analysis:** What the strategy actually delivers, how it compares to buy-and-hold, and what the performance metrics reveal
  • Section 4 — Use Cases:** Where this approach fits in a real quant workflow
  • Section 5 — Limitations and Edge Cases:** Honest constraints, failure modes, and what to watch for in live deployment

1. Polynomial Trend Filtering as a Trading Signal

A moving average is essentially a linear opinion about where price "should" be right now. It works, but it is rigid — it cannot capture curvature. When a stock is in an accelerating uptrend or a decelerating selloff, a straight-line estimate lags badly and generates false signals. A polynomial regression of degree three — a cubic — adds two extra degrees of freedom, allowing the fitted curve to bend and flex with the actual shape of the price series. Think of it as upgrading from a ruler to a flexible drafting curve.

The core idea is straightforward. For each day in the backtest, you fit a cubic polynomial to all price data available up to and including that day. The fitted value at the final point of that window is your "fair value" estimate. If the actual closing price is above fair value and the slope of the polynomial is positive (momentum confirmation), you hold a long position. If either condition breaks, you step to cash. The position is binary: fully invested or fully flat.

The mathematical form is P(t) = a₀ + a₁t + a₂t² + a₃t³, where the coefficients are solved by ordinary least squares. NumPy's polyfit handles this in one line. What matters more than the algebra is the discipline of never using future data. Every coefficient must be estimated using only the observations that a real trader would have had access to on that calendar date. This is the walk-forward, or expanding-window, constraint — and ignoring it is the single most common source of inflated backtest results.

The slope of the polynomial at the current time point is simply the first derivative evaluated at the endpoint: P'(t) = a₁ + 2a₂t + 3a₃t². A positive slope means the trend curve is still climbing, which is your momentum filter. Combining position-relative-to-curve with slope direction gives a two-condition signal that filters out many of the whipsaws you would get from price-versus-trend alone.

2. Python Implementation

2.1 Setup and Parameters

The strategy has a small set of configurable parameters. TICKER and START_DATE define the universe and history window. MIN_WINDOW sets the minimum number of observations required before the first regression is attempted — too small and the cubic fit is meaningless. POLY_DEG is fixed at 3 throughout, but you can easily sweep it. TC is the one-way transaction cost applied on every position change.

# ── dependencies ──────────────────────────────────────────────────────────────
import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.patches import Patch
import warnings
warnings.filterwarnings("ignore")

# ── configurable parameters ───────────────────────────────────────────────────
TICKER     = "PLTR"        # any yfinance-compatible symbol
START_DATE = "2021-01-01"
END_DATE   = "today"
POLY_DEG   = 3             # cubic polynomial
MIN_WINDOW = 60            # minimum bars before first fit (≈ 3 months)
TC         = 0.001         # one-way transaction cost (0.1 %)

# ── data download ─────────────────────────────────────────────────────────────
raw = yf.download(TICKER, start=START_DATE, end=END_DATE, auto_adjust=True)
price = raw["Close"].squeeze().dropna()
price.index = pd.to_datetime(price.index)
print(f"Downloaded {len(price)} daily bars for {TICKER} "
      f"({price.index[0].date()}{price.index[-1].date()})")
Enter fullscreen mode Exit fullscreen mode

Implementation chart

2.2 Expanding-Window Cubic Regression

This is the engine of the strategy. For every bar from MIN_WINDOW onward, the function fits a degree-3 polynomial to the integer time index [0, 1, 2, ..., t] and returns the fitted value and slope at the rightmost point. Using an integer index rather than raw dates keeps the conditioning of the design matrix well-behaved and makes the derivative calculation trivial.

def fit_cubic(prices: np.ndarray) -> tuple[float, float]:
    """
    Fit a cubic polynomial to `prices` using an integer time index.
    Returns (fitted_value_at_end, slope_at_end).
    """
    t = np.arange(len(prices), dtype=float)
    coeffs = np.polyfit(t, prices, deg=POLY_DEG)   # highest degree first
    poly   = np.poly1d(coeffs)
    dpoly  = poly.deriv()                           # first derivative
    t_end  = t[-1]
    return float(poly(t_end)), float(dpoly(t_end))


# ── expanding-window loop ─────────────────────────────────────────────────────
n = len(price)
fitted_vals = np.full(n, np.nan)
slopes      = np.full(n, np.nan)

for i in range(MIN_WINDOW, n):
    window_prices = price.iloc[: i + 1].values
    fitted_vals[i], slopes[i] = fit_cubic(window_prices)

fitted_series = pd.Series(fitted_vals, index=price.index)
slope_series  = pd.Series(slopes,      index=price.index)

print(f"First valid signal date: {fitted_series.dropna().index[0].date()}")
Enter fullscreen mode Exit fullscreen mode

2.3 Signal Generation and Backtest Engine

The signal is binary. A value of 1 means long (invested), 0 means flat (cash). The two conditions must both be true simultaneously: price is above the cubic fit, and the slope is positive. Transaction costs are applied on every day where the position changes — entering or exiting. The daily strategy return is the product of the position (lagged by one day to prevent same-bar execution) and the asset's log return, minus any cost incurred on that day.

# ── signal logic ──────────────────────────────────────────────────────────────
above_trend  = (price.values > fitted_vals).astype(float)
positive_slope = (slopes > 0).astype(float)
raw_signal   = above_trend * positive_slope          # 1 = long, 0 = flat
signal       = pd.Series(raw_signal, index=price.index)

# lag by one bar: today's signal executed at tomorrow's open (approximated)
position = signal.shift(1).fillna(0)

# ── returns and transaction costs ─────────────────────────────────────────────
log_ret  = np.log(price / price.shift(1)).fillna(0)
trades   = position.diff().abs().fillna(0)           # 1 on entry/exit days
cost     = trades * TC

strat_ret = position * log_ret - cost
bh_ret    = log_ret.copy()

# ── cumulative performance ────────────────────────────────────────────────────
cum_strat = np.exp(strat_ret.cumsum()) - 1
cum_bh    = np.exp(bh_ret.cumsum()) - 1

# ── performance metrics ───────────────────────────────────────────────────────
def performance_metrics(log_returns: pd.Series, label: str) -> dict:
    ann_factor = 252
    total_ret  = np.exp(log_returns.sum()) - 1
    n_years    = len(log_returns) / ann_factor
    cagr       = (1 + total_ret) ** (1 / n_years) - 1
    ann_vol    = log_returns.std() * np.sqrt(ann_factor)
    sharpe     = (log_returns.mean() / log_returns.std()) * np.sqrt(ann_factor)
    roll_max   = np.exp(log_returns.cumsum()).cummax()
    drawdown   = (np.exp(log_returns.cumsum()) / roll_max) - 1
    max_dd     = drawdown.min()
    return {"Strategy": label, "Total Return": f"{total_ret:.1%}",
            "CAGR": f"{cagr:.1%}", "Ann. Volatility": f"{ann_vol:.1%}",
            "Sharpe (rf=0)": f"{sharpe:.2f}", "Max Drawdown": f"{max_dd:.1%}"}

results = pd.DataFrame([
    performance_metrics(strat_ret, f"{TICKER} Cubic Trend"),
    performance_metrics(bh_ret,    f"{TICKER} Buy & Hold"),
]).set_index("Strategy")

print("\n── Performance Summary ──────────────────────────────────────────")
print(results.to_string())
Enter fullscreen mode Exit fullscreen mode

2.4 Visualization

The chart below plots four layers on a dark background: the raw price series in white, the cubic trend filter in amber, buy/sell signal markers, and a shaded region highlighting periods when the strategy is in a long position. A separate lower panel shows cumulative returns for the strategy versus buy-and-hold so you can immediately see when the trend filter adds or destroys value.

plt.style.use("dark_background")
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 9),
                                gridspec_kw={"height_ratios": [3, 1.5]},
                                sharex=True)
fig.suptitle(f"{TICKER} — Cubic Polynomial Trend Strategy", fontsize=14,
             fontweight="bold", color="white", y=0.98)

# ── upper panel: price + trend + signals ──────────────────────────────────────
ax1.plot(price.index, price.values, color="white",  lw=0.9, label="Price")
ax1.plot(fitted_series.index, fitted_series.values,
         color="#FFA500", lw=1.6, label="Cubic Fit", alpha=0.85)

# shade in-position periods
in_pos = position.astype(bool)
for start_i, end_i in zip(
        price.index[in_pos & ~in_pos.shift(1, fill_value=False)],
        price.index[in_pos & ~in_pos.shift(-1, fill_value=False)]):
    ax1.axvspan(start_i, end_i, color="#00FF7F", alpha=0.07)

# entry / exit markers
entries = price.index[(position.diff() ==  1)]
exits   = price.index[(position.diff() == -1)]
ax1.scatter(entries, price.loc[entries], marker="^", color="#00FF7F",
            s=55, zorder=5, label="Long Entry")
ax1.scatter(exits,   price.loc[exits],   marker="v", color="#FF4444",
            s=55, zorder=5, label="Exit")

ax1.set_ylabel("Price (USD)", color="white")
ax1.legend(loc="upper left", fontsize=8)
ax1.grid(alpha=0.15)

# ── lower panel: cumulative returns ──────────────────────────────────────────
ax2.plot(cum_strat.index, cum_strat.values * 100,
         color="#00BFFF", lw=1.6, label="Strategy")
ax2.plot(cum_bh.index,    cum_bh.values * 100,
         color="#888888", lw=1.0, label="Buy & Hold", linestyle="--")
ax2.axhline(0, color="white", lw=0.5, linestyle=":")
ax2.set_ylabel("Cumulative Return (%)", color="white")
ax2.set_xlabel("Date", color="white")
ax2.legend(loc="upper left", fontsize=8)
ax2.grid(alpha=0.15)
ax2.xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m"))
plt.xticks(rotation=30, ha="right")

plt.tight_layout()
plt.savefig("cubic_trend_strategy.png", dpi=150, bbox_inches="tight")
plt.show()
Enter fullscreen mode Exit fullscreen mode

Figure 1. Upper panel: PLTR daily close price (white), expanding-window cubic trend fit (amber), long entry markers (green triangles), and exit markers (red triangles) with shaded long-position windows. Lower panel: cumulative return of the cubic trend strategy (blue) versus passive buy-and-hold (grey dashed), net of 0.1% one-way transaction costs.


Enjoying this strategy so far? This is only a taste of what's possible.

Go deeper with my newsletter: longer, more detailed articles + full Google Colab implementations for every approach.

Or get everything in one powerful package with AlgoEdge Insights: 30+ Python-Powered Trading Strategies — The Complete 2026 Playbook — it comes with detailed write-ups + dedicated Google Colab code/links for each of the 30+ strategies, so you can code, test, and trade them yourself immediately.

Exclusive for readers: 20% off the book with code MEDIUM20.

Join newsletter for free or Claim Your Discounted Book and take your trading to the next level!

3. Results and Performance Analysis

On PLTR daily data from January 2021 through mid-2025, the cubic trend strategy demonstrates its primary purpose: capital preservation during sustained drawdowns. The strategy steps to cash during extended downtrends — a critical property for a high-volatility single-name stock that experienced an 80 %+ peak-to-trough decline in 2021–2022. Buy-and-hold captures full upside but forces the investor to ride every correction. The trend filter sacrifices some of that upside in exchange for a meaningfully lower maximum drawdown and more consistent risk-adjusted returns.

In terms of concrete metrics, expect the strategy Sharpe ratio to improve over buy-and-hold on volatile names simply because the annualized volatility shrinks when you are flat for extended periods — even if the total return figure is comparable or slightly lower. The transaction cost assumption of 0.1% per side is conservative for liquid US equities and represents the realistic friction of a small retail account. For institutional sizing, costs would be lower; for illiquid names, they would be higher.

The most instructive output is not the final return number but the periods of divergence between the two equity curves. When the strategy underperforms, it is almost always during a sharp V-shaped recovery — the cubic fit is slow to respond and keeps you in cash while price rebounds. That is the fundamental trade-off: smoothness versus responsiveness. Adjusting MIN_WINDOW downward or switching from a cubic to a quadratic (POLY_DEG=2) makes the filter more responsive but also noisier, increasing trade count and friction costs. This parameter sensitivity is worth exploring systematically via a grid search before committing to any configuration.

4. Use Cases

  • Single-name equity trend following. The cubic filter works well on high-beta growth names (semiconductors, software, biotech) where trends are persistent but punctuated by sharp drawdowns. It acts as a systematic stop-loss that re-enters when momentum resumes.

  • ETF rotation systems. Apply the signal independently to a basket of sector or factor ETFs and allocate capital to whichever pass both conditions. This converts a single-instrument strategy into a cross-sectional momentum engine with a natural diversification benefit.

  • Pre-processing layer for ensemble models. The cubic fit value and its slope are clean, continuous features that can be fed into a machine learning classifier (Random Forest, XGBoost) alongside other technical indicators. They encode both trend level and trend momentum in two numbers.

  • Regime filtering for other strategies. Use the slope condition alone — positive or negative — as a market regime indicator to switch a mean-reversion strategy on or off. Many mean-reversion approaches fail in trending markets; gating them with a polynomial slope filter recovers a significant portion of those losses.

5. Limitations and Edge Cases

Curve-fitting risk at short windows. When MIN_WINDOW is small, a cubic polynomial has enough flexibility to fit noise rather than trend. The fitted value becomes unstable and the slope can flip sign on consecutive bars, generating excessive trades. Always validate that your minimum window is large enough relative to the dominant cycle length in your target asset.

Computational cost at scale. The expanding-window loop runs a least-squares fit on an ever-growing array. For a single ticker over a few years this takes seconds. For a universe of 500 names over 20 years, the naive Python loop becomes a bottleneck. Consider vectorizing with np.lib.stride_tricks or parallelizing across tickers with concurrent.futures.

Non-stationarity of the optimal degree. A cubic may be appropriate for PLTR in a bull market but overfit for a utility stock with mean-reverting prices. The polynomial degree and window length should be re-estimated periodically using out-of-sample validation, not fixed permanently.

Gap risk on binary positions. The strategy is fully invested or fully flat. A large overnight gap down — an earnings miss, a macro shock — hits a long position with no partial mitigation. Adding position sizing (e.g., scaling exposure by inverse volatility) would materially reduce tail risk.

Survivorship and selection bias. Testing on PLTR, which survived and ultimately rallied, understates the real risk of applying this to individual stocks. Backtest on a diversified basket and include delisted names to get an honest picture of expected performance.

Concluding Thoughts

Polynomial regression trend filtering is one of the most transparent quantitative strategies you can build. Every assumption is explicit in the code, the mathematics is undergraduate-level, and the bias-free expanding-window construction means the backtest results are genuinely informative rather than optimistically backward-fitted. That combination of simplicity and intellectual honesty makes it an excellent foundation for more sophisticated systems.

The natural next experiments are: sweeping POLY_DEG from 1 to 5 and comparing Sharpe ratios out-of-sample, adding a volatility scaling layer so position size contracts during high-VIX regimes, and replacing the binary long/flat signal with a continuous allocation proportional to the normalized distance between price and the cubic fit. Each of these extensions is a single function addition to the code you already have.

If you found this walkthrough useful, the same notebook-style treatment is applied to nineteen other strategies in the AlgoEdge Colab Vault — including a 3-State Hidden Markov Model volatility filter, a Kalman adaptive trend tracker, and a pairs cointegration engine. Each comes as a fully executable Google Colab notebook with real data, out-of-sample tests, and CSV export. Follow for the next article in this series, where we build the HMM volatility filter from scratch.


Most algo trading content gives you theory.
This gives you the code.

3 Python strategies. Fully backtested. Colab notebook included.
Plus a free ebook with 5 more strategies the moment you subscribe.

5,000 quant traders already run these:

Subscribe | AlgoEdge Insights

Top comments (0)