Bill Tu

Posted on Apr 9

Writing 89 Tests for a Quantitative Trading Framework: Strategy and Trade-offs

#python

Adding a test suite to an existing codebase is a different exercise than writing tests alongside new code. You're reverse-engineering the implicit contracts, discovering which behaviors are intentional and which are accidental, and deciding what's worth testing versus what's noise. This article covers how we built the test suite for QuantFlow, an open-source quantitative trading framework in Python, and the decisions behind each layer of tests.

The Testing Problem in Quant Systems

Quantitative trading code has a testing problem that most software doesn't: the outputs are floating-point numbers derived from financial time series, and "correct" is often a matter of degree. An SMA of [1, 2, 3, 4, 5] with period 5 should be exactly 3.0 — that's easy. But what should the RSI of a 50-bar synthetic price series be? You can't hardcode an expected value without coupling the test to the random seed and the exact implementation.

This creates a tension between two testing philosophies:

Test against known reference values (precise but brittle)
Test invariants and contracts (robust but less specific)

We use both, choosing based on the indicator.

Test Architecture

The suite is organized into five files, each targeting a different layer:

tests/
├── conftest.py          # Shared fixtures
├── test_indicators.py   # 46 tests — all 16 indicators
├── test_risk.py         # 12 tests — risk manager rules
├── test_portfolio.py    # 16 tests — execution and position tracking
├── test_strategies.py   # 8 tests  — built-in strategy signals
└── test_engine.py       # 9 tests  — end-to-end backtest integration

The dependency flows downward: indicators have no dependencies, risk and portfolio depend on core models, strategies depend on indicators, and the engine depends on everything. Tests follow the same order — if indicator tests fail, strategy and engine tests are meaningless.

Fixtures: Deterministic Randomness

The conftest.py provides four shared fixtures that generate synthetic market data with a fixed random seed:

@pytest.fixture
def sample_prices() -> np.ndarray:
    np.random.seed(42)
    returns = np.random.normal(0.001, 0.02, 50)
    prices = 100.0 * np.cumprod(1 + returns)
    return prices

The seed 42 makes tests deterministic — same data every run. The parameters (0.1% daily drift, 2% daily volatility) produce realistic-looking price series without being tied to any real market data. This matters because tests that depend on real market data break when the data source changes, and they can't be run offline.

The sample_ohlcv fixture generates correlated OHLCV data where high > close > low holds approximately (with small random noise). This is important for indicators like ATR and Stochastic that use all four price fields — feeding them random uncorrelated data would test the math but not the real-world behavior.

Indicator Tests: Three Categories

The 46 indicator tests fall into three categories.

Warm-up Guards

Every indicator should return None when given insufficient data. This is the most important contract — it prevents strategies from acting on garbage values during the first few bars of a backtest.

def test_insufficient_data_returns_none(self):
    sma = SMA(period=20)
    assert sma.calculate(np.array([1.0, 2.0, 3.0])) is None

We test this for every indicator. It's repetitive, but it catches a common class of bugs: an indicator that returns 0.0 or NaN instead of None when it doesn't have enough data.

Known-Value Tests

For indicators with simple formulas, we test against hand-calculated values:

def test_known_value(self):
    prices = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
    sma = SMA(period=5)
    assert sma.calculate(prices) == pytest.approx(3.0)

def test_known_value(self):
    prices = np.array([100.0] * 13)
    prices[-1] = 110.0
    roc = ROC(period=12)
    assert roc.calculate(prices) == pytest.approx(10.0)

These are the strongest tests — they verify the actual computation. But they're only practical for indicators where you can compute the expected value by hand. For complex indicators like ADX or Ichimoku, hand-calculation is error-prone, so we use invariant tests instead.

Invariant Tests

For indicators where exact values are hard to predict, we test mathematical invariants that must always hold:

def test_range_0_to_100(self, sample_prices):
    rsi = RSI(period=14)
    val = rsi.calculate(sample_prices)
    assert 0.0 <= val <= 100.0

def test_all_gains_returns_100(self):
    prices = np.arange(1.0, 20.0)  # monotonically increasing
    rsi = RSI(period=14)
    assert rsi.calculate(prices) == pytest.approx(100.0)

def test_all_losses_returns_0(self):
    prices = np.arange(20.0, 1.0, -1.0)  # monotonically decreasing
    assert rsi.calculate(prices) == pytest.approx(0.0, abs=0.01)

RSI must be between 0 and 100. All-gain series must produce RSI of 100. All-loss series must produce RSI near 0. These invariants hold regardless of the implementation details.

For Bollinger Bands: upper > middle > lower (when prices have non-zero variance). For Stochastic: %K between 0 and 100. For Williams %R: between -100 and 0. These are mathematical properties of the formulas, not implementation details.

Consistency Tests

One particularly valuable test pattern verifies that calculate() and series() agree:

def test_calculate_matches_series_last(self, sample_prices):
    ema = EMA(period=12)
    calc_val = ema.calculate(sample_prices)
    series_val = ema.series(sample_prices)[-1]
    assert calc_val == pytest.approx(series_val, rel=1e-10)

After the v0.2.0 optimization where EMA.calculate() was rewritten to avoid building the full series, this test ensures the optimized path produces the same result as the reference implementation. It's a regression test disguised as a unit test.

The NaN Leakage Test

One test specifically targets a bug we fixed in v0.2.0:

def test_no_nan_leakage(self):
    prices = np.arange(1.0, 40.0)
    macd = MACD()
    m, s, h = macd.calculate(prices)
    if m is not None:
        assert not np.isnan(m)
        assert not np.isnan(s)
        assert not np.isnan(h)

The contract is: calculate() returns either valid floats or None. Never NaN. This test exists because NaN is insidious — it passes through arithmetic silently and makes comparisons return False, causing strategies to silently skip signals without any error.

The Stochastic Alignment Test

Another test targets the %D alignment bug from v0.2.0:

def test_d_alignment(self, sample_ohlcv):
    stoch = Stochastic(k_period=5, d_period=3)
    k_series, d_series = stoch.series(...)
    first_k = np.where(~np.isnan(k_series))[0][0]
    first_d = np.where(~np.isnan(d_series))[0][0]
    assert first_d == first_k + stoch.d_period - 1

%D is a moving average of %K, so it must start exactly d_period - 1 bars after %K begins. This is a structural invariant — if the alignment is wrong, the %D values are computed from the wrong %K windows.

Risk Manager Tests: State Machine Behavior

The risk manager is a stateful component — once the drawdown circuit breaker triggers, it stays triggered until explicitly reset. The tests model this as a state machine:

def test_drawdown_halts_trading(self):
    signal = Signal(SignalType.BUY, size=10)
    result = self.rm.check(signal, 100.0, 75000.0, 100000.0, 0)
    assert result.type == SignalType.HOLD

def test_drawdown_halt_is_sticky(self):
    self.rm.check(Signal(SignalType.BUY, size=10), 100.0, 75000.0, 100000.0, 0)
    # Even if equity recovers, still halted
    result = self.rm.check(Signal(SignalType.BUY, size=10), 100.0, 95000.0, 100000.0, 0)
    assert result.type == SignalType.HOLD

def test_drawdown_reset(self):
    self.rm.check(Signal(SignalType.BUY, size=10), 100.0, 75000.0, 100000.0, 0)
    self.rm.reset()
    result = self.rm.check(Signal(SignalType.BUY, size=10), 100.0, 95000.0, 100000.0, 0)
    assert result.type == SignalType.BUY

Three tests, three states: normal → halted → reset. The "sticky" test is critical — it verifies that the halt persists even when equity recovers. Without this test, a bug that resets the halt flag on every call would go undetected.

The position sizing test uses exact arithmetic:

def test_position_sizing_caps_size(self):
    # equity=100000, max_pct=10% → max $10000 → at $100/share → max 100 shares
    signal = Signal(SignalType.BUY, size=500)
    result = self.rm.check(signal, 100.0, 100000.0, 100000.0, 0)
    assert result.size <= 100

The comment documents the math. This is intentional — risk management tests should be readable by someone who isn't a developer, because the rules they encode are business logic that a portfolio manager or risk officer should be able to verify.

Portfolio Tests: Isolating Execution Mechanics

Portfolio tests use zero slippage to make assertions predictable:

def setup_method(self):
    self.portfolio = Portfolio(
        initial_capital=100000.0,
        commission_rate=0.001,
        slippage_rate=0.0,  # zero slippage for predictable tests
    )

This is a deliberate choice. Slippage adds randomness to fill prices, which makes exact assertions impossible. By setting it to zero, we can test the execution logic in isolation. Slippage behavior is tested separately at the engine level.

The most important portfolio test verifies the v0.2.0 fix for equity calculation:

def test_equity_reflects_current_price(self):
    signal = Signal(SignalType.BUY, size=100)
    self.portfolio.execute_signal(signal, "TEST", 100.0, self.ts)
    self.portfolio.update_market_prices("TEST", 120.0)
    assert self.portfolio.equity > self.portfolio.cash + 100 * 100.0

If equity used entry price (the old bug), equity would equal cash + 100 * 100. With the fix, it equals cash + 100 * 120, which is strictly greater. The test doesn't check the exact value — it checks the relationship, which is more robust.

The short margin test verifies the v0.2.0 safety fix:

def test_short_margin_check(self):
    self.portfolio.short_margin_rate = 1.0
    signal = Signal(SignalType.SELL, size=2000)  # $200k margin > $100k cash
    result = self.portfolio.execute_signal(signal, "TEST", 100.0, self.ts)
    assert result is None

This test exists because the absence of a margin check was a real bug. The test documents the expected behavior: shorts that exceed available margin are rejected.

Strategy Tests: Signal Generation

Strategy tests face a unique challenge: strategies are stateful (they track previous indicator values for crossover detection), and their outputs depend on the full price history. Testing them requires constructing price series that produce known indicator states.

def test_buy_on_oversold(self):
    strategy = RSIMeanReversion(period=14, oversold=30, size=100)
    portfolio = Portfolio()
    prices = np.linspace(200, 100, 50)  # strong downtrend
    bar = make_bar(100.0, prices)
    signal = strategy.on_bar(bar, portfolio)
    assert signal.type == SignalType.BUY

A monotonically decreasing price series guarantees RSI will be very low (near 0), which is well below the oversold threshold of 30. We don't need to know the exact RSI value — we just need to know it's below 30, which is guaranteed by the price pattern.

The SMA crossover test is more involved because it requires a crossover event:

def test_generates_buy_on_golden_cross(self):
    strategy = SMACrossover(fast_period=3, slow_period=5, size=100)
    portfolio = Portfolio()
    prices_down = np.linspace(120, 100, 20)
    prices_up = np.linspace(100, 130, 10)
    prices = np.concatenate([prices_down, prices_up])
    signals = []
    for i in range(len(prices)):
        bar = make_bar(prices[i], prices[:i + 1])
        signals.append(strategy.on_bar(bar, portfolio))
    buy_signals = [s for s in signals if s.type == SignalType.BUY]
    assert len(buy_signals) > 0

The price series goes down then sharply up. At some point during the upturn, the fast SMA (3-period) will cross above the slow SMA (5-period). We don't assert exactly when — just that it happens at least once. This makes the test resilient to minor changes in the crossover detection logic.

Engine Tests: End-to-End Integration

Engine tests verify that all components work together. They use an InMemoryDataFeed to avoid file I/O:

class InMemoryDataFeed(DataFeed):
    def __init__(self, df: pd.DataFrame):
        self._df = df

    def load(self) -> pd.DataFrame:
        return self._df

The most valuable engine test verifies that commission has a measurable effect:

def test_commission_reduces_equity(self, data_feed):
    r1 = BacktestEngine(..., commission=0.0, ...).run()
    r2 = BacktestEngine(..., commission=0.01, ...).run()
    assert r2.equity_curve[-1] <= r1.equity_curve[-1]

This is an invariant test at the system level: higher commission should never increase final equity. It doesn't test the exact commission calculation — that's covered in portfolio tests — but it verifies that commission flows through the entire pipeline correctly.

Edge case tests verify the system doesn't crash on degenerate inputs:

def test_single_bar_no_crash(self):
    df = pd.DataFrame({...}, index=pd.DatetimeIndex([datetime(2024, 1, 1)]))
    result = BacktestEngine(data_feed=InMemoryDataFeed(df), ...).run()
    assert len(result.equity_curve) == 1

def test_zero_volume_no_crash(self):
    df = pd.DataFrame({..., "volume": np.zeros(n), ...})
    result = BacktestEngine(data_feed=InMemoryDataFeed(df), ...).run()
    assert result.total_return == pytest.approx(0.0)

These tests exist because real market data contains edge cases: single-bar datasets from API errors, zero-volume bars from illiquid instruments, gaps in timestamps. The system should handle all of these without raising exceptions.

What We Chose Not to Test

A few deliberate omissions:

We don't test the YahooDataFeed because it makes HTTP requests to an external API. Testing it would require either mocking the HTTP layer (which tests the mock, not the code) or hitting the real API (which is slow, flaky, and rate-limited). The CSVDataFeed is tested indirectly through the engine tests.

We don't test the LiveEngine because it runs an infinite loop with time.sleep(). Testing it properly would require threading, timeouts, and mock brokers — significant infrastructure for a component that's architecturally identical to the backtest engine but with a different data source.

We don't test the plot() method on PerformanceReport because visual output is best verified by humans. We could assert that it doesn't raise an exception, but that's a weak test that adds maintenance burden.

Running the Suite

pip install pytest
python -m pytest tests/ -v

============================= 89 passed in 4.87s ==============================

All 89 tests run in under 5 seconds. No external dependencies, no network calls, no file I/O (except the fixtures that generate in-memory DataFrames). This matters for CI — fast tests get run; slow tests get skipped.

The test suite is part of QuantFlow v0.3.0. Contributions and additional test cases are welcome.

DEV Community