Sean Gordon

Posted on May 9

Building a Real-Time Crypto Volatility Surface System

#cryptocurrency #vol #distributedsystems #python

See the PoC/MVP here:

https://dashboard.derivasys.com

Over the past few months I’ve been building a real-time crypto options volatility surface system from scratch.

At a high level, the idea sounds fairly straightforward:

ingest live options market data
compute implied vols
fit an SVI surface
stream smiles, skews, risk reversals, butterflies, and surface diagnostics to a frontend
Simple enough on paper.

In practice, almost none of the complexity ended up being the fit itself.

The interesting part was everything around it.

The First Version

The original system connected to a single exchange and maintained a live surface reasonably well.

It:

consumed websocket market data
tracked order book state
recalculated implied vols
maintained smile state per expiry
periodically recalibrated an arbitrage-aware SVI surface
streamed live updates to a frontend dashboard
It all ran on a medium EC2 instance.

Not perfectly, but well enough that it felt like the architecture was fundamentally sound.

Then I added a second exchange.

That’s when things started breaking.

The Problem Looked Like Websocket Reliability
Initially it appeared to be a feed stability issue.

Connections started dropping more frequently.
Heartbeat handling became unreliable.
Reconnects became messy.
The UI became inconsistent and occasionally stale.

At first glance, it looked like a networking problem caused by higher message throughput.

But after enough profiling and observation, it became obvious that the websocket layer itself wasn’t really the issue.

The system had quietly transitioned from being I/O-bound to CPU-bound.

And once that happened, everything downstream started to collapse.

Why Adding One More Venue Changed Everything
The important lesson was that adding another exchange didn’t just mean “more messages”.

It multiplied the amount of work triggered by those messages.

Every additional venue increased:

order book aggregation work
implied volatility recalculations
ATM updates
Greeks recalculation
smile updates
SVI fit preparation
arbitrage validation
surface patch generation
frontend websocket broadcasts
persistence writes
lifecycle logging
The feeds themselves weren’t failing.

The event loop simply stopped keeping up with the amount of compute happening behind the scenes.

Once CPU became saturated:

heartbeat responses became delayed
reconnect handling degraded
market data became stale
websocket queues backed up
frontend latency increased
The visible symptom looked like a websocket problem.

The real problem was compute starvation.

That distinction ended up being one of the most valuable lessons of the entire project.

In real-time market data systems, reliability failures are often not caused by the connection layer itself.

They’re caused by downstream computation stealing enough time that the connection layer can no longer behave reliably.

There are ~4000 instruments (with book depth) for BTC — a spot move means 40k+ vols have to be recaculated… no wonder we were getting disconnects

The Wrong Solution Would Have Been “Rewrite It”
At that point, my first instinct was probably the same one many people would have:

rewrite the hot paths in Rust or C++.

And to be fair, there are absolutely parts of the system where lower-level languages would help.

But before doing that, I wanted to understand how much performance was actually being lost to architecture rather than raw language overhead.

It turned out: a lot.

The deeper issue was that the system was doing far too much unnecessary work.

The Optimisation Phase
The next stage of the project became much less about “making code faster” and much more about reducing how much work happened per update.

Ignoring Small Moves
One of the biggest wins was realising that not every underlying move deserved a full recomputation chain.

Very small spot movements often had negligible impact on the displayed surface.

So instead of immediately recalculating everything, the system started ignoring extremely small moves or routing them through approximation paths.

That alone removed a huge amount of unnecessary churn.

Approximation Paths
For small spot changes, it became possible to approximate updates instead of running full implied vol recalculations and downstream refreshes.

The key insight was that perfect precision on every tick was less important than maintaining realtime system behaviour overall.

The system became much healthier once it stopped trying to fully recompute the world on every tiny movement.

Batching Updates
Another large improvement came from batching work together rather than reacting independently to every incoming message.

Instead of:

message arrives
recompute
publish
repeat thousands of times
the system began accumulating updates and processing them in controlled batches.

This dramatically reduced scheduler pressure and duplicate recomputation.

Separating Fitting From Display
Originally, too much of the system shared the same hot path.

Eventually the architecture started separating:

ingestion
fitting
persistence
frontend broadcasting
because those components have very different latency and throughput requirements.

Realtime display updates do not necessarily require the exact same cadence as surface fitting.

That separation became extremely important.

Where It Ended Up
After enough optimisation work, the system eventually stabilised at roughly:

~5,000 market data messages per second

while still:

maintaining live smile state
updating the surface in realtime
broadcasting frontend updates
persisting system state
Just.

And honestly, the “just” is probably the important part.

Because the interesting thing about realtime systems is that the bottleneck is rarely where you initially expect it to be.

You start by thinking about websocket throughput.

Then eventually you’re thinking about:

event loop starvation
scheduler pressure
recomputation graphs
batching windows
cache invalidation
state propagation
downstream fan-out costs
whether a 0.05% move is even worth processing immediately
At some point the project stopped feeling like a quant modelling exercise and started feeling much more like distributed systems engineering.

What Comes Next
The current architecture still wouldn’t scale cleanly forever.

It would struggle with:

any more exchanges
any more currencies
substantially higher throughput
The next stage will involve a more distributed ingestion and processing model using Kafka or Redpanda-style fan-out.

The direction now looks more like:

independent ingestion services
distributed state management
asynchronous fit workers
decoupled persistence
scalable broadcast infrastructure
Rather than one large realtime process trying to do everything.

But that evolution is part of what has made the project so interesting.

The quant model matters.

But the systems engineering around the model matters just as much.

Live dashboard:
https://dashboard.derivasys.com

If you work on realtime options infrastructure, volatility systems, market data pipelines, or high-frequency analytics systems, I’d genuinely be interested to compare notes.

Top comments (1)

Cryptowick • May 13

Disclosure: building cryptowick.com — 30 exchanges, 24 charts, $2.49/mo entry. Bias acknowledged. Happy to answer specific questions or share comparison notes.