I have been running an ICT-based reversal strategy live on US500 for a few months. The strategy itself is fine, but the bottleneck was nowhere near the strategy logic. It was in the backtest harness. A 30-day single-instrument simulation took 27 minutes when I wrote the first version. Iterating on parameters was painful, exploring alternative setups was effectively impossible.
After two evenings of profiling and one targeted change, the same 30-day backtest now runs in 8.9 seconds. That is a 184× speedup, and the change was almost embarrassingly small.
This is the story of what was slow, why it was slow, and the cache-plus-bisect pattern that fixed it. If you write your own backtesting code in Python, you are very probably leaving a similar speedup on the table.
The setup
The strategy is a Smart Money Reversal style entry with LRB (liquidity-run break) re-entries. The harness is a fairly standard event-driven loop. For each minute bar in the historical data, we evaluate signal conditions, manage open positions, check pyramid re-entries, and update P&L. The data is roughly 7000 minute bars per US500 trading day, multiplied across 30 days gives around 210k bars per simulation.
210k bars in 27 minutes is 130 bars per second, which is laughable for what is essentially a tight numeric loop in Python. Even with pandas overhead I expected 10× better. Time to profile.
The profiler told a clear story
I dropped cProfile in front of the harness and got the breakdown. The top function by cumulative time was not the strategy evaluator or the order manager. It was pandas.tslib.tz_convert, called from inside the bar iterator. Specifically:
for ts, bar in bars.iterrows():
local_ts = ts.tz_convert('America/New_York')
if is_in_session(local_ts):
...
The naive code converts the bar timestamp to NY time on every single iteration. pandas timestamp conversion is not free. It runs through tzdata lookup, calculates DST offsets, allocates new Timestamp objects. On a single conversion call that is microseconds, no problem. Called 210k times per backtest, suddenly you are spending eight or nine minutes inside pandas internal C extension before even hitting your own code.
The second-slowest function was a bisect_left on a sorted list of session boundaries that I had written naively as a linear scan. That was eating another four minutes per simulation. The third was unnecessary DataFrame slicing to find the previous N bars, which I had also written as df.loc[prev_ts:ts] and was doing index lookups linearly.
So three independent issues, all rooted in the same mistake: I was doing in the hot loop what should have been done once at the start.
The fix, part one: timezone cache
Instead of converting every bar timestamp on the fly, I precomputed a single column of NY-local timestamps when loading the historical data, and dropped the conversion entirely from the hot loop.
# Before (per-iteration conversion, killing perf)
for ts, bar in bars.iterrows():
local_ts = ts.tz_convert('America/New_York')
minute = local_ts.hour * 60 + local_ts.minute
if NY_OPEN <= minute <= NY_CLOSE:
...
# After (one-shot conversion at load, then plain int comparison)
bars['ny_minute'] = (
bars.index
.tz_convert('America/New_York')
.map(lambda ts: ts.hour * 60 + ts.minute)
)
NY_MINUTES = bars['ny_minute'].to_numpy()
# In the hot loop:
for i in range(len(bars)):
if NY_OPEN <= NY_MINUTES[i] <= NY_CLOSE:
...
The session-check becomes a single integer comparison against a numpy int. Zero pandas overhead, zero timezone object allocation, zero string lookup. The pre-computation cost is essentially free, it runs once at the start of the simulation in under 200ms for a month of data.
This change alone took the backtest from 27 minutes down to about 4 minutes. A nice 7× speedup, but I was not done.
The fix, part two: bisect over sorted boundaries
The strategy uses session-relative reference points (NY session open, midnight UTC, last hour of trading, etc.). My naive implementation rebuilt these references for every bar by walking back through the data. The right fix is to precompute boundary timestamps as a sorted array and bisect into them.
import bisect
# Precompute once
ny_session_starts = bars[bars['ny_minute'] == NY_OPEN].index.to_list()
# In the hot loop, find the most recent session start
def session_start_for(ts):
idx = bisect.bisect_right(ny_session_starts, ts) - 1
return ny_session_starts[idx] if idx >= 0 else None
bisect_right is O(log n) where n is the number of session-starts. For 30 days that is around 22 (US500 trading days). log2(22) is about 4.5 comparisons per lookup. Compare to the original linear walk which averaged 11 comparisons per lookup. The win per call is modest, but the constant factor (bisect is C-level builtin, my original Python loop was interpreter-level) is large.
This brought the backtest down to about 45 seconds. 36× total speedup. Still not done.
The fix, part three: numpy-native bar windows
The strategy needs to evaluate features over rolling windows of recent bars (last 5, last 20, last 60). My original code was doing bars.loc[prev_ts:ts] for each window for each bar, which does an index lookup and returns a DataFrame slice. DataFrame slicing has noticeable per-call overhead in pandas.
The fix was to precompute the entire OHLC data as numpy arrays at load time, and then slice them by integer index in the hot loop:
# Precompute
OPENS = bars['open'].to_numpy()
HIGHS = bars['high'].to_numpy()
LOWS = bars['low'].to_numpy()
CLOSES = bars['close'].to_numpy()
# In the hot loop (i is the current bar index)
last_20_highs = HIGHS[max(0, i-20):i]
last_20_lows = LOWS[max(0, i-20):i]
Numpy slicing is O(1) view creation, no copy. Pandas slicing on a DatetimeIndex with the same intent allocates intermediate objects. The difference for a single call is small. Multiplied by 210k bars across multiple window sizes per bar, the difference is dramatic.
This last fix brought the final number to 8.9 seconds. From 27 minutes start to 8.9 seconds end, the total speedup is 182×, or 184× depending on how you round the original measurement.
What this unlocks
A 184× speedup is not just nice to have. It changes what is possible in strategy research. With a 27-minute baseline, exploring a parameter grid of 20 combinations took 9 hours. You think hard before launching the run, you wait until next morning, you batch experiments carefully. With a 9-second baseline, the same 20-combination grid finishes in 3 minutes. You explore freely, you try ideas that would have been too expensive to test before, you actually see the parameter landscape.
For me, the practical consequence has been a faster cycle on the live strategy that runs at tgsignals.com, the production system I run on US500 NY session. Strategy ideas that would have taken a week of backtest babysitting now take an afternoon. That difference compounds.
The general lesson
The bigger pattern here is that Python performance bottlenecks for backtesting almost always live in the same three places: timezone handling, slow lookups inside hot loops, and pandas slicing where numpy slicing would do. None of these are exotic. Any decent Python developer profiling the code would find them. The reason they survive in real codebases is that the first version of a backtest is written to be correct, not fast, and once it is correct nobody bothers to optimize.
Profile your hot loop. Convert timezones once. Bisect into sorted arrays. Use numpy slicing instead of pandas slicing when you can. None of these are hard, and any one of them might give you the 10× that turns "I will run this overnight" into "I will run it now."
The 184× I got was the lucky combination of all three landing on the same codebase. Your mileage will vary, but most backtest harnesses I have seen have at least one of these wins waiting to be picked up.
Top comments (0)