NydarTrading

Posted on Feb 22 • Originally published at nydar.co.uk

Rebuilding Our Prediction Engine From the Ground Up

#ai #machinelearning #trading #fintech

The Problem Nobody Talks About

There's a dirty secret in trading platforms that use ML: most of them are slow. Not "could be faster" slow — genuinely, painfully slow. The kind of slow where you click a button and wonder if the page crashed.

We had this problem. Our daily picks, market analysis, and morning briefing endpoints were taking 25+ seconds on a cold cache. For a platform that's supposed to help you make timely trading decisions, that's unacceptable. You'd click "Daily Picks" and have time to make a cup of tea before the results appeared. That's not a user experience — that's a test of patience.

The really frustrating part? The actual predictions were good. Our 13,500-model-fit research had produced models with genuine predictive power. The ML side was working. But none of that matters if users give up and close the tab before the results load.

We could have thrown hardware at it. Spun up bigger servers, added GPU instances, horizontally scaled across a cluster. That's the standard playbook — and it's what most platforms do. But that's not engineering — that's spending money to avoid thinking. Bigger hardware doesn't fix a fundamentally inefficient architecture. It just makes the inefficiency more expensive.

We wanted to understand why it was slow and fix the actual problem.

That turned into weeks of work. Multiple failed approaches. A complete rewrite of core systems. What we found was worse than we expected — and the fix required rethinking how our entire prediction pipeline was built from the ground up.

What Our Pipeline Actually Does

Before diving into the optimisation, it's worth understanding the sheer scope of what happens when you load the Daily Picks page. This isn't a database query. There's no pre-computed table of recommendations sitting on a server waiting to be read. Every signal is generated in real-time from live market data, every time you load the page.

For each of the 55+ symbols we track across crypto and equities, the system runs this full analysis pipeline:

Multi-timeframe data fetching — pulls OHLCV (open, high, low, close, volume) candles across four timeframes: 15-minute, 1-hour, 4-hour, and daily. Each timeframe tells a different story about price action, and you need all of them to form a complete picture of where an asset stands.
Technical indicator computation — calculates dozens of mathematical indicators across momentum (RSI, Stochastic RSI, CCI, Williams %R), trend (ADX, EMA crossovers, MACD), volatility (ATR, Bollinger Bands, Keltner Channels), volume (OBV, VWAP, MFI), and price action (candle body ratios, wick analysis, pattern detection). Each indicator requires a full pass over the price series with rolling-window calculations.
ML feature extraction — transforms raw indicators into normalised, model-ready features. This isn't just passing numbers through — features need to be scaled relative to price, normalised across different volatility regimes, and structured to capture the relationships between indicators, not just their raw values.
Sentiment integration — for crypto assets, the system ingests Fear & Greed Index data, computes moving averages, zone classifications, contrarian signals, and divergence metrics. Sentiment features turned out to be among the most predictive in our research.
Market regime detection — classifies the current market environment into one of six states: trending up, trending down, ranging, volatile, breakout, or consolidating. Each regime has different implications for how predictions should be interpreted and which trading strategies are appropriate.
Model loading and inference — loads the pre-trained XGBoost model for each symbol-timeframe combination and generates probability distributions across bullish, bearish, and neutral outcomes. The model was trained using walk-forward validation on 2,000+ candles to ensure it generalises to unseen data.
Meta-labeling — a secondary model that acts as a quality gate. It evaluates whether the primary prediction is reliable enough to act on, based on the current feature state and the primary model's confidence level. Low-quality signals get filtered to neutral, reducing false positives.
Multi-timeframe alignment — analyses whether the 15-minute, 1-hour, 4-hour, and daily timeframes agree on direction. A bullish signal on the 1-hour chart means very little if the daily and 4-hour are screaming bearish. The alignment score quantifies this cross-timeframe consensus and is a key factor in the final confidence ranking.
Confidence scoring and ranking — combines all signals into a final confidence score, applies regime-aware adjustments (predictions in volatile regimes get lower confidence), ranks all symbols, and generates natural-language explanations of why each pick was selected.

That's nine distinct computational stages, each with its own data dependencies and processing requirements, for each of 55+ symbols. A single Daily Picks request triggers well over 400 individual analyses. The morning briefing is even more complex — it adds sector analysis, correlation assessment, and macro context on top of all of the above.

It's a lot of work — and it has to be, because the quality of the output depends on doing all of it thoroughly. Cut any corner and accuracy suffers. We'd proved that in our research.

The False Starts

Before we found the right approach, we went down several dead ends. It's worth mentioning these because they represent the obvious solutions that don't work.

"Just cache everything." This was the first suggestion. Cache the predictions for longer, and most requests will be instant. The problem: caching doesn't solve the underlying issue, it hides it. The first user to hit an expired cache still waits 25 seconds. With a 5-minute cache TTL and bursty traffic, that's a lot of users getting a terrible experience. Caching is a complement to fast computation, not a substitute for it.

"Analyse fewer symbols." Why scan 55 symbols when you could scan 20 and be 3x faster? Because the whole point of daily picks is breadth. You want the system to find the best opportunities across a wide universe. Restricting to 20 symbols means potentially missing the best trade of the day because it was symbol #21. Users trust us to do the comprehensive scan — that's the product.

"Pre-filter with a cheap heuristic." Run a quick, rough analysis first, pick the top 10 candidates, then only run the full pipeline on those 10. This sounds clever, and it would be fast. But it's fundamentally flawed. The cheap heuristic has to be cheap for a reason — it uses less data, fewer indicators, simpler logic. Which means it will rank symbols differently than the full pipeline would. The symbols it selects as "top 10 candidates" may not be the same symbols the full pipeline would have ranked highest. You end up with a fast system that gives worse answers. We tried this approach, saw how the candidate lists diverged from the full analysis, and scrapped it.

"Switch to a GPU server." Our models are XGBoost, not deep learning. XGBoost doesn't benefit from GPU acceleration in the way neural networks do (at least not for inference — prediction on a trained model is a series of tree traversals, not matrix multiplications). And the bottleneck wasn't model inference anyway. A GPU would have made no measurable difference.

Each dead end taught us something: the solution had to be architectural, not infrastructural. We couldn't buy or shortcut our way out of this. We had to understand the actual problem.

Phase 1: Profiling — Measuring Before Cutting

With the quick fixes exhausted, we did what we should have done first: measured.

The instinct when something is slow is to guess where the time goes. "It's probably the API calls." "It's probably model training." "Maybe we need a faster server." Everyone had a theory. Every theory felt plausible. And every theory was wrong.

Guessing where performance bottlenecks are is one of the most common mistakes in software engineering. Study after study has shown that developers correctly identify the bottleneck less than half the time. Our intuitions about code performance are shaped by what feels expensive (network calls, disk I/O, complex algorithms) rather than what is expensive (often something mundane that runs far more often than you'd expect).

We instrumented the entire pipeline with high-resolution timing markers at every stage. Not just endpoint-level timings that tell you the total request took 25 seconds — we measured each individual step within each symbol's analysis, independently, with microsecond precision. Feature extraction time. Regime detection time. Model inference time. Data fetching time. Sentiment data retrieval. Macro feature computation. Model loading from disk. Every single step, timed and logged.

Then we hit the endpoints on production, collected the timing data across hundreds of symbol analyses, and built a complete breakdown of where every millisecond was going.

What We Expected

Our assumptions going in:

Network I/O would dominate — fetching live data from exchanges over the internet
Model training would be expensive — fitting XGBoost on 2,000+ candles with feature selection
Database/disk operations — loading cached models, reading configuration, writing prediction logs

What We Actually Found

Stage	Time Per Symbol	% of Total
Feature extraction	~1.3s	47%
Regime detection	~1.4s	51%
Fear & Greed fetch	~0.03s	1%
Model loading	~0.01s	<1%
XGBoost inference	~0.002s	<1%
Meta-labeling	~0.001s	<1%
Everything else	~0.02s	<1%

The API calls were fast. Model inference was essentially instantaneous — XGBoost predictions take single-digit milliseconds once the model is loaded. Network I/O was not the bottleneck. Disk access was not the bottleneck. Model training only happens once per symbol, so it's amortised to near-zero on most requests.

98% of the time was spent in two systems: feature extraction and regime detection. Specifically, the heavy numerical computation that turns raw price data into mathematical indicators — ATR, Bollinger Bands, ADX, EMA crossovers, RSI, MACD, CCI, Stochastic RSI. Each of these requires a full rolling-window calculation over thousands of data points, involving standard deviations, exponential moving averages, and statistical aggregations.

And then came the real insight — the one that made us want to put our heads through the desk.

We were computing many of the same indicators multiple times per symbol. Feature extraction computed RSI, ATR, Bollinger Bands, EMA20, EMA50, ADX. Then regime detection computed RSI, ATR, Bollinger Bands, EMA20, EMA50, ADX again, completely from scratch, because it was a separate system that didn't know what feature extraction had already done.

These weren't different concerns that happened to share names. They were the exact same mathematical operations on the exact same data, running independently in two different parts of the codebase. Six expensive indicator computations, duplicated, for every single symbol, on every single request.

For 30 crypto symbols, that's 180 redundant indicator computations per request. Each one taking 50-200ms. The duplication alone accounted for nearly 15 seconds of wasted computation.

The irony is that the code was well-structured. Feature extraction was a clean, self-contained module. Regime detection was a clean, self-contained module. Each had a clear API, clear responsibilities, clear separation of concerns. By every software engineering principle, the architecture was correct. But the performance characteristics were catastrophic, precisely because those clean boundaries prevented the systems from sharing work.

Phase 2: Data Layer Redesign

With profiling data in hand, we attacked the problem layer by layer, starting at the top: data fetching.

The original implementation fetched data serially by symbol, then serially by timeframe within each symbol. Symbol 1 would request 15-minute candles, wait for the response, request 1-hour candles, wait, request 4-hour, wait, request daily, wait. Then symbol 2 would do the same. For 55 symbols across 4 timeframes, that's 220 sequential API calls — even though every single one is independent and could happen simultaneously.

On top of that, different parts of the pipeline were fetching the same data independently. The prediction service would fetch 1-hour candles for BTC/USDT. Then the multi-timeframe analyser would fetch 1-hour candles for BTC/USDT again, plus the other three timeframes. Two separate round trips to the exchange for the same data.

We redesigned the data layer to prefetch everything in one coordinated burst. All 55 symbols, all 4 timeframes, fired concurrently in a single asynchronous sweep. The exchange has rate limits, so we implemented intelligent batching that maximises throughput without triggering throttling — firing requests in waves, respecting per-second limits, retrying failures automatically.

The results come back as a unified data cache — a dictionary mapping each symbol-timeframe combination to its OHLCV array. Every downstream system pulls from this cache instead of making its own API calls. Zero duplicate fetches. Zero sequential waiting.

This required a fundamental change in how data flows through the system. Previously, each analysis function was self-contained: it accepted a symbol name, fetched its own data, and returned a result. That's a clean API, but it means every function independently decides when and how to fetch data, with no coordination.

The new design separates data acquisition from data processing. First, fetch everything. Then, process everything. The analysis functions now accept pre-fetched data as input rather than fetching their own. This is a less elegant API — the caller has to know about the data cache — but the performance difference is dramatic.

The multi-timeframe analyser, which previously made 4 separate API calls per symbol (one per timeframe), now makes zero. It receives all the data it needs as a parameter. This single change eliminated hundreds of redundant network calls per request.

Data-layer redesign cut total endpoint time from 25 seconds to about 11 seconds. A meaningful improvement, but still far from acceptable — because the computational bottleneck was still untouched.

Phase 3: Concurrency Model Overhaul

With network I/O largely solved, the profiling data pointed squarely at CPU-bound computation. Our indicator calculations aren't waiting on anything external — they're doing intensive numerical work on arrays of thousands of data points. Rolling standard deviations across 2,000-candle windows. Exponential moving averages with dynamic weighting. Bollinger Band calculations requiring nested standard deviations. All of it pure mathematics.

Python's Global Interpreter Lock (GIL) is a well-known limitation for CPU-bound work. In standard Python, only one thread can execute Python bytecode at a time. Threading gives you concurrency for I/O, but not parallelism for computation. This is why many teams reach for multiprocessing instead — separate processes, each with their own GIL.

But our situation had a subtlety that changes the equation. Our indicator calculations don't run in Python. They run in C. NumPy, pandas, and the pandas_ta library all execute their heavy computation in compiled C code that releases the GIL during execution. When pandas computes a rolling standard deviation, the GIL is released for the duration of that C-level computation, allowing other threads to genuinely run in parallel on separate CPU cores.

This means that with the right threading model, we can get genuine CPU parallelism without the overhead and complexity of multiprocessing. Each symbol's indicator computation runs in its own thread, and because the C libraries release the GIL, multiple symbols are genuinely computed simultaneously.

We restructured the prediction pipeline to offload all CPU-heavy indicator work to a managed thread pool using asyncio.to_thread(). The async orchestration layer fires off 30 symbol analyses concurrently. Each one runs in the thread pool, computes its indicators in C (releasing the GIL), and returns the results. The async layer collects the results as they complete.

Getting this right required more than just wrapping functions in to_thread(). Several challenges emerged:

Thread safety. The original code had shared mutable state in places we didn't expect. Singleton instances caching intermediate results. Class-level attributes being modified during computation. All of this had to be audited and either eliminated or made thread-safe. In a serial pipeline, shared state is convenient. In a concurrent pipeline, it's a race condition waiting to happen.

Error isolation. When 30 analyses run concurrently and 3 of them fail (bad data, missing candles, exchange errors), the failures can't be allowed to poison the 27 successful results. We implemented return_exceptions=True pattern where exceptions are collected alongside results, failed symbols are logged and skipped, and the user still gets picks from the symbols that succeeded.

Thread pool tuning. Python's default thread pool size is min(32, os.cpu_count() + 4), which on our 2-core VPS gives us 6 worker threads. Too few threads and you're not saturating the CPU. Too many and context-switching overhead eats into computation time. We tested various pool sizes and found the default was reasonable for our hardware, but the profiling revealed something interesting: later symbols in the batch showed inflated wall-clock times not because they were individually slower, but because they were queued behind earlier symbols in the thread pool. Understanding thread pool queuing behaviour was essential for interpreting our timing data correctly.

Threading brought total endpoint time from 11 seconds down to about 7.5 seconds. Better, but still not where we needed to be. The per-symbol computation time was still ~2.8 seconds — we were just running more of them simultaneously.

Phase 4: The Unified Computation Engine

This was the big one. The single change that delivered the most dramatic improvement, and the one that required the deepest understanding of the codebase.

Remember the profiling results: feature extraction (47% of per-symbol time) and regime detection (51%) were each independently computing the same expensive indicators. The same RSI. The same ATR. The same Bollinger Bands. The same EMA20 and EMA50. The same ADX with plus/minus directional indicators. Computed twice, independently, from the same raw data, producing the same numbers, in two separate code paths that had no knowledge of each other.

The naive fix would be to add a per-symbol indicator cache. Compute RSI once, store the result keyed by symbol, look it up when needed again. But this adds complexity in all the wrong places. Cache invalidation logic. Memory management. Thread-safe cache access. Cache lifetime management. All for something that should be structurally impossible — we shouldn't need a cache to avoid computing the same thing twice in the same request.

Our approach was more fundamental: redesign the feature extraction system to produce everything downstream systems need in a single computational pass. One set of indicator calculations serving both the ML feature pipeline and the regime detection system. Compute once, use everywhere.

This sounds straightforward. It was anything but. The two systems expected data in completely different formats:

The ML feature system needs normalised time-series features. RSI as a full column in a DataFrame spanning the entire lookback window. ATR as a percentage of price, computed at every candle. Bollinger Band width as a normalised percentage, also across the full window. These are used as input to the XGBoost model, which needs a consistent row-per-candle structure.

The regime detection system needs point-in-time scalar values. What is the current ADX right now? Is price above or below EMA20 at this moment? What's the ATR ratio compared to the 50-period average today? What's the BB squeeze ratio right now? These are used for rule-based classification — a series of threshold comparisons that determine whether the market is trending, ranging, or volatile.

Same underlying indicators. Completely different representations. The RSI computation is identical in both cases — it's ta.rsi(close, length=14). But feature extraction wants the entire resulting Series (1,900+ values) while regime detection wants just the last value as a float.

We built a unified computation layer that computes each indicator once and produces both representations simultaneously. The core function makes 8 pandas_ta calls (instead of the previous 15+ for features alone, plus 6 more for regime). From those 8 calls, it constructs:

A features DataFrame with all columns the ML model needs — normalised, properly indexed, ready for prediction
A regime metrics dictionary with all scalar values the regime detector needs — pre-extracted from the same indicator series

The regime detector was then modified to accept pre-computed metrics as an optional parameter. When metrics are provided, it skips its entire _calculate_metrics() method — all 6 pandas_ta calls, the DataFrame construction, everything — and jumps straight to classification logic. Classification is pure conditional checks: if ADX > 25 and trend direction is positive, it's trending up. If ATR ratio > 2.0, it's volatile. These comparisons take microseconds.

The result: regime detection went from ~1.4 seconds per symbol to effectively zero. Not "fast" — zero. The computation isn't cached or deferred. It simply doesn't happen, because someone else already did the work and passed the results forward.

Phase 5: Intelligent Feature Pruning

Our 13,500-model-fit research had already told us exactly which features matter for prediction accuracy. Out of the 60+ features our original extraction pipeline computed, only 15 had meaningful predictive power for crypto, and just 6 for equities. The other 45+ features were computed, assembled into a DataFrame, then immediately discarded by the feature selection step that runs before model inference.

Think about what that means. We were computing Stochastic RSI, Keltner Channels, Williams %R, OBV slope, volume trend, upper/lower wick ratios, inside bars, outside bars, doji patterns, engulfing patterns, hour-of-day cyclical encoding — all of it mathematically intensive, all of it computed for every single symbol, all of it thrown away before the model ever sees it. The feature selection step was working correctly — it kept only the proven features — but the extraction step upstream was doing massive amounts of work for no reason.

The original design made sense when we built it. We didn't know which features would matter. We computed everything, tested everything, and let the research tell us what worked. That's good science. But once the research was done and the winners were identified, continuing to compute the losers on every request was pure waste.

We rebuilt the extraction pipeline to compute only what the model actually uses. This required mapping the full dependency tree from the 15 model features back to their underlying pandas_ta indicator calls:

atr_pct and volatility_ratio both derive from ta.atr() — one call, two features
bb_width derives from ta.bbands() — one call, one feature
macd and macd_hist both derive from ta.macd() — one call, two features
rsi derives from ta.rsi() — one call, one feature
price_vs_ema50 derives from ta.ema(length=50) — one call, one feature
cci derives from ta.cci() — one call, one feature

For crypto, that's 6 unique pandas_ta calls to produce the 15 model features (plus ta.ema(length=20) and ta.adx() for regime metrics — 8 total). For equities, it's even fewer since only 6 features matter.

The remaining features — vwap_diff, volume_ratio, price_change_5, roc_5/10/20, return_std_10/20, close lags, sentiment features — are computed with plain pandas operations (rolling means, percentage changes, shift operations) that are extremely fast compared to full pandas_ta indicator calculations.

8 pandas_ta indicator calls instead of 15+. Each one still involves a full rolling-window calculation over the price series, but nearly halving the count translates directly to nearly halving the per-symbol feature extraction time.

Phase 6: Pipeline Reordering and Macro Feature Elimination

Two smaller but meaningful optimisations completed the picture.

Operation reordering

The original pipeline ran every analysis stage in a fixed order regardless of intermediate results. Regime detection ran after multi-timeframe analysis, even though regime information is useful for prioritising symbols earlier in the pipeline. A symbol in a clearly consolidating regime (low ADX, tight range, low volume) is unlikely to produce a high-confidence directional pick — but the system wouldn't know that until after it had already run the most expensive analysis stages.

We restructured the pipeline to put regime detection first. Since regime detection now uses pre-computed metrics and completes in microseconds, there's no cost to running it earlier. The regime information is then available to downstream stages for smarter scheduling and prioritisation.

Macro feature elimination

Our profiling also revealed that we were computing macro-economic features (FRED data, market-wide indicators) and injecting them into the feature DataFrame — only for the feature selection step to immediately discard them. None of the macro features made it into the top 15. They were computed, fetched from external APIs, aligned to the candle index, concatenated into the DataFrame, and then removed by the feature selection step on the very next line.

We removed the macro feature computation from the prediction path entirely. The training pipeline still computes them (since future research might identify valuable macro features), but the hot prediction path no longer wastes time on features that don't contribute to the final output.

The Results

After all six phases of work, here's where we ended up:

Metric	Before	After	Improvement
Daily Picks (30 crypto)	25s	~2s	12x faster
Morning Briefing	25s	~2s	12x faster
Market Analysis	18s	~1.5s	12x faster
Per-symbol prediction	2.8s	<0.1s	28x faster
Cached responses	0.3s	0.3s	(already fast)

That's not a typo. Per-symbol prediction went from 2.8 seconds to under 100 milliseconds — a 28x improvement. The endpoint-level gains are somewhat less dramatic because network I/O for fetching live market data still takes real time (you can't make the speed of light faster), but the computational work that we control is now a tiny fraction of overall request time.

To put it in perspective: the time spent on feature extraction and regime detection went from 2.7 seconds per symbol to roughly 0.4 seconds. The 0.4s that remains is genuine, non-duplicated computation — the 8 pandas_ta calls that produce features the model actually uses. There's no more fat to cut without reducing analysis quality.

Where the time goes now

Stage	Before	After
Feature extraction	1.3s	~0.4s
Regime detection	1.4s	~0s
Model inference	0.002s	0.002s
Everything else	0.06s	0.06s
Total per symbol	2.8s	~0.5s

Regime detection is now effectively free — pre-computed metrics mean it does zero indicator work. Feature extraction is ~70% faster thanks to computing only the indicators that matter. Everything else was already fast and remains fast.

What We Refused to Do

Throughout this process, we had plenty of opportunities to take shortcuts. We rejected all of them.

We didn't reduce the number of symbols. Analysing 20 symbols instead of 55 would have been faster, but it would have given users a worse product. The whole point of daily picks is to scan a broad universe and surface the best opportunities. The best trade of the day might be in a mid-cap altcoin that a restricted scan would have excluded entirely.

We didn't cut analysis depth. Every symbol still gets the full treatment: multi-timeframe analysis across four timeframes, regime detection with six regime classifications, ML prediction with confidence scoring, meta-labeling to filter low-quality signals, multi-timeframe alignment scoring, and natural-language explanation generation. The analysis is identical in quality — it's just faster.

We didn't add caching as a primary strategy. Yes, we have intelligent caching — predictions are cached for 5 minutes since market conditions don't change that fast. But caching was already in place before these optimisations. It was never the solution to the underlying problem, and we refused to lean on it as one. The first user to hit an expired cache should get a fast response too.

We didn't switch to a bigger server. We're still running on the same modest VPS we've always used. Same CPU, same RAM, same everything. The improvements came entirely from doing less redundant work and doing necessary work more efficiently. This is an important distinction — hardware scaling has diminishing returns and ongoing costs, but algorithmic improvements are permanent and free.

We didn't sacrifice accuracy. At several points, we considered heuristic approaches — approximating indicator values, reducing lookback windows, using fewer candles for computation. Every one of those shortcuts would have degraded prediction quality. Our research showed that accuracy depends on specific features computed over specific windows with specific lookback periods. The features that survived 13,500+ model fits survived because they capture genuine market dynamics. Computing them incorrectly defeats the purpose.

Lessons Learned

This project reinforced several principles that we'll carry forward into every future engineering decision.

Profile before you optimise. Always. This is the oldest advice in software engineering, and it's the most frequently ignored — including by us, initially. Our intuitions about where the time was going were completely wrong. We assumed network I/O and model inference. The actual bottleneck was duplicated numerical computation. Without microsecond-level profiling of every pipeline stage, we would have spent weeks optimising the wrong systems and wondering why nothing improved.

Architectural boundaries hide redundancy. Clean separation of concerns is generally good engineering. Modular systems are easier to understand, test, and maintain. But when separate modules independently compute the same expensive operations on the same data, the boundaries themselves become a performance problem. The feature extraction module and the regime detection module were both well-designed, well-tested, self-contained units. They were also doing 50% redundant work between them, invisible from either side of the boundary.

Sometimes you need to break clean abstractions to eliminate waste. The unified computation engine is architecturally messier than two independent modules — there's now a data dependency flowing from feature extraction to regime detection that didn't exist before. But it's 28x faster. That's a trade-off we'll take every time.

Python can be fast enough. There's a persistent belief in the quantitative finance community that Python is too slow for real-time computation. That anything performance-sensitive needs to be in C++, Rust, or at minimum Java. This is wrong — or at least, it's wrong when you understand where Python's actual computation happens.

Our indicator calculations don't run in Python. They run in C, via NumPy and pandas. Python is the orchestration layer — it decides what to compute and how to combine results. C does the actual number-crunching. With proper threading (exploiting GIL release during C-level computation), we get genuine multi-core parallelism from a Python codebase. The language isn't the bottleneck. The architecture is.

The best optimisation is doing less work. We didn't make any single computation faster. We didn't find a faster algorithm for RSI or a more efficient Bollinger Band implementation. We didn't rewrite anything in C or Cython. We just stopped computing things we didn't need (45+ unused features), and stopped computing things we did need more than once (6 duplicated indicators). The fastest code is the code that doesn't run.

Research pays compound interest. The 13,500-model-fit experiment we ran weeks ago had an unexpected second payoff. It wasn't just about finding which features predict prices — it also told us which features don't. That negative knowledge turned out to be just as valuable for performance as the positive knowledge was for accuracy. Without knowing which features to cut, the intelligent feature pruning phase wouldn't have been possible.

What's Next

Performance work is never truly done. Now that the computational pipeline is lean, we're looking at the next tier of improvements:

Incremental indicator updates — when new candles arrive, recompute only the affected tail of each indicator rather than recalculating the entire series from scratch. A new 1-hour candle shouldn't require recalculating RSI over the entire 2,000-candle window.
Adaptive cache invalidation — instead of fixed 5-minute cache TTL, invalidate predictions based on how much the underlying price has moved. If BTC moves 3% in 2 minutes, the prediction is stale. If it moves 0.1% in 10 minutes, it's probably still valid.
Warm-start model updates — incrementally update existing XGBoost models with new data rather than retraining from scratch. This would eliminate the occasional cold-start latency when a model expires and needs retraining.
WebSocket-pushed predictions — rather than waiting for users to request predictions, push updated signals proactively as market conditions change. The pipeline is now fast enough that this becomes feasible.

But those are future projects. For now, the daily picks load in 2 seconds, the morning briefing is ready when you are, and you don't need to wonder if the page crashed.

That's what good engineering should feel like — you shouldn't notice it at all.

Want to see the faster pipeline in action? Check out the Daily Picks widget or start paper trading to get ML-powered signals on 55+ assets.

Originally published at Nydar. Nydar is a free trading platform with AI-powered signals and analysis.

DEV Community