<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sean Gordon</title>
    <description>The latest articles on DEV Community by Sean Gordon (@gdnaes).</description>
    <link>https://dev.to/gdnaes</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3921944%2Fabe04cc9-943c-441f-a81f-beeb952048ef.png</url>
      <title>DEV Community: Sean Gordon</title>
      <link>https://dev.to/gdnaes</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gdnaes"/>
    <language>en</language>
    <item>
      <title>Building a Real-Time Crypto Volatility Surface System</title>
      <dc:creator>Sean Gordon</dc:creator>
      <pubDate>Sat, 09 May 2026 14:07:19 +0000</pubDate>
      <link>https://dev.to/gdnaes/building-a-real-time-crypto-volatility-surface-system-3nbf</link>
      <guid>https://dev.to/gdnaes/building-a-real-time-crypto-volatility-surface-system-3nbf</guid>
      <description>&lt;p&gt;See the PoC/MVP here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dashboard.derivasys.com" rel="noopener noreferrer"&gt;https://dashboard.derivasys.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F16o3zlbpvw196dytoh1t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F16o3zlbpvw196dytoh1t.png" alt=" " width="800" height="519"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Over the past few months I’ve been building a real-time crypto options volatility surface system from scratch.&lt;/p&gt;

&lt;p&gt;At a high level, the idea sounds fairly straightforward:&lt;/p&gt;

&lt;p&gt;ingest live options market data&lt;br&gt;
compute implied vols&lt;br&gt;
fit an SVI surface&lt;br&gt;
stream smiles, skews, risk reversals, butterflies, and surface diagnostics to a frontend&lt;br&gt;
Simple enough on paper.&lt;/p&gt;

&lt;p&gt;In practice, almost none of the complexity ended up being the fit itself.&lt;/p&gt;

&lt;p&gt;The interesting part was everything around it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The First Version&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F34zi4kgt9tg3cppjlqdm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F34zi4kgt9tg3cppjlqdm.png" alt=" " width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The original system connected to a single exchange and maintained a live surface reasonably well.&lt;/p&gt;

&lt;p&gt;It:&lt;/p&gt;

&lt;p&gt;consumed websocket market data&lt;br&gt;
tracked order book state&lt;br&gt;
recalculated implied vols&lt;br&gt;
maintained smile state per expiry&lt;br&gt;
periodically recalibrated an arbitrage-aware SVI surface&lt;br&gt;
streamed live updates to a frontend dashboard&lt;br&gt;
It all ran on a medium EC2 instance.&lt;/p&gt;

&lt;p&gt;Not perfectly, but well enough that it felt like the architecture was fundamentally sound.&lt;/p&gt;

&lt;p&gt;Then I added a second exchange.&lt;/p&gt;

&lt;p&gt;That’s when things started breaking.&lt;/p&gt;

&lt;p&gt;The Problem Looked Like Websocket Reliability&lt;br&gt;
Initially it appeared to be a feed stability issue.&lt;/p&gt;

&lt;p&gt;Connections started dropping more frequently.&lt;br&gt;
Heartbeat handling became unreliable.&lt;br&gt;
Reconnects became messy.&lt;br&gt;
The UI became inconsistent and occasionally stale.&lt;/p&gt;

&lt;p&gt;At first glance, it looked like a networking problem caused by higher message throughput.&lt;/p&gt;

&lt;p&gt;But after enough profiling and observation, it became obvious that the websocket layer itself wasn’t really the issue.&lt;/p&gt;

&lt;p&gt;The system had quietly transitioned from being I/O-bound to CPU-bound.&lt;/p&gt;

&lt;p&gt;And once that happened, everything downstream started to collapse.&lt;/p&gt;

&lt;p&gt;Why Adding One More Venue Changed Everything&lt;br&gt;
The important lesson was that adding another exchange didn’t just mean “more messages”.&lt;/p&gt;

&lt;p&gt;It multiplied the amount of work triggered by those messages.&lt;/p&gt;

&lt;p&gt;Every additional venue increased:&lt;/p&gt;

&lt;p&gt;order book aggregation work&lt;br&gt;
implied volatility recalculations&lt;br&gt;
ATM updates&lt;br&gt;
Greeks recalculation&lt;br&gt;
smile updates&lt;br&gt;
SVI fit preparation&lt;br&gt;
arbitrage validation&lt;br&gt;
surface patch generation&lt;br&gt;
frontend websocket broadcasts&lt;br&gt;
persistence writes&lt;br&gt;
lifecycle logging&lt;br&gt;
The feeds themselves weren’t failing.&lt;/p&gt;

&lt;p&gt;The event loop simply stopped keeping up with the amount of compute happening behind the scenes.&lt;/p&gt;

&lt;p&gt;Once CPU became saturated:&lt;/p&gt;

&lt;p&gt;heartbeat responses became delayed&lt;br&gt;
reconnect handling degraded&lt;br&gt;
market data became stale&lt;br&gt;
websocket queues backed up&lt;br&gt;
frontend latency increased&lt;br&gt;
The visible symptom looked like a websocket problem.&lt;/p&gt;

&lt;p&gt;The real problem was compute starvation.&lt;/p&gt;

&lt;p&gt;That distinction ended up being one of the most valuable lessons of the entire project.&lt;/p&gt;

&lt;p&gt;In real-time market data systems, reliability failures are often not caused by the connection layer itself.&lt;/p&gt;

&lt;p&gt;They’re caused by downstream computation stealing enough time that the connection layer can no longer behave reliably.&lt;/p&gt;

&lt;p&gt;There are ~4000 instruments (with book depth) for BTC — a spot move means 40k+ vols have to be recaculated… no wonder we were getting disconnects&lt;/p&gt;

&lt;p&gt;The Wrong Solution Would Have Been “Rewrite It”&lt;br&gt;
At that point, my first instinct was probably the same one many people would have:&lt;/p&gt;

&lt;p&gt;rewrite the hot paths in Rust or C++.&lt;/p&gt;

&lt;p&gt;And to be fair, there are absolutely parts of the system where lower-level languages would help.&lt;/p&gt;

&lt;p&gt;But before doing that, I wanted to understand how much performance was actually being lost to architecture rather than raw language overhead.&lt;/p&gt;

&lt;p&gt;It turned out: a lot.&lt;/p&gt;

&lt;p&gt;The deeper issue was that the system was doing far too much unnecessary work.&lt;/p&gt;

&lt;p&gt;The Optimisation Phase&lt;br&gt;
The next stage of the project became much less about “making code faster” and much more about reducing how much work happened per update.&lt;/p&gt;

&lt;p&gt;Ignoring Small Moves&lt;br&gt;
One of the biggest wins was realising that not every underlying move deserved a full recomputation chain.&lt;/p&gt;

&lt;p&gt;Very small spot movements often had negligible impact on the displayed surface.&lt;/p&gt;

&lt;p&gt;So instead of immediately recalculating everything, the system started ignoring extremely small moves or routing them through approximation paths.&lt;/p&gt;

&lt;p&gt;That alone removed a huge amount of unnecessary churn.&lt;/p&gt;

&lt;p&gt;Approximation Paths&lt;br&gt;
For small spot changes, it became possible to approximate updates instead of running full implied vol recalculations and downstream refreshes.&lt;/p&gt;

&lt;p&gt;The key insight was that perfect precision on every tick was less important than maintaining realtime system behaviour overall.&lt;/p&gt;

&lt;p&gt;The system became much healthier once it stopped trying to fully recompute the world on every tiny movement.&lt;/p&gt;

&lt;p&gt;Batching Updates&lt;br&gt;
Another large improvement came from batching work together rather than reacting independently to every incoming message.&lt;/p&gt;

&lt;p&gt;Instead of:&lt;/p&gt;

&lt;p&gt;message arrives&lt;br&gt;
recompute&lt;br&gt;
publish&lt;br&gt;
repeat thousands of times&lt;br&gt;
the system began accumulating updates and processing them in controlled batches.&lt;/p&gt;

&lt;p&gt;This dramatically reduced scheduler pressure and duplicate recomputation.&lt;/p&gt;

&lt;p&gt;Separating Fitting From Display&lt;br&gt;
Originally, too much of the system shared the same hot path.&lt;/p&gt;

&lt;p&gt;Eventually the architecture started separating:&lt;/p&gt;

&lt;p&gt;ingestion&lt;br&gt;
fitting&lt;br&gt;
persistence&lt;br&gt;
frontend broadcasting&lt;br&gt;
because those components have very different latency and throughput requirements.&lt;/p&gt;

&lt;p&gt;Realtime display updates do not necessarily require the exact same cadence as surface fitting.&lt;/p&gt;

&lt;p&gt;That separation became extremely important.&lt;/p&gt;

&lt;p&gt;Where It Ended Up&lt;br&gt;
After enough optimisation work, the system eventually stabilised at roughly:&lt;/p&gt;

&lt;p&gt;~5,000 market data messages per second&lt;/p&gt;

&lt;p&gt;while still:&lt;/p&gt;

&lt;p&gt;maintaining live smile state&lt;br&gt;
updating the surface in realtime&lt;br&gt;
broadcasting frontend updates&lt;br&gt;
persisting system state&lt;br&gt;
Just.&lt;/p&gt;

&lt;p&gt;And honestly, the “just” is probably the important part.&lt;/p&gt;

&lt;p&gt;Because the interesting thing about realtime systems is that the bottleneck is rarely where you initially expect it to be.&lt;/p&gt;

&lt;p&gt;You start by thinking about websocket throughput.&lt;/p&gt;

&lt;p&gt;Then eventually you’re thinking about:&lt;/p&gt;

&lt;p&gt;event loop starvation&lt;br&gt;
scheduler pressure&lt;br&gt;
recomputation graphs&lt;br&gt;
batching windows&lt;br&gt;
cache invalidation&lt;br&gt;
state propagation&lt;br&gt;
downstream fan-out costs&lt;br&gt;
whether a 0.05% move is even worth processing immediately&lt;br&gt;
At some point the project stopped feeling like a quant modelling exercise and started feeling much more like distributed systems engineering.&lt;/p&gt;

&lt;p&gt;What Comes Next&lt;br&gt;
The current architecture still wouldn’t scale cleanly forever.&lt;/p&gt;

&lt;p&gt;It would struggle with:&lt;/p&gt;

&lt;p&gt;any more exchanges&lt;br&gt;
any more currencies&lt;br&gt;
substantially higher throughput&lt;br&gt;
The next stage will involve a more distributed ingestion and processing model using Kafka or Redpanda-style fan-out.&lt;/p&gt;

&lt;p&gt;The direction now looks more like:&lt;/p&gt;

&lt;p&gt;independent ingestion services&lt;br&gt;
distributed state management&lt;br&gt;
asynchronous fit workers&lt;br&gt;
decoupled persistence&lt;br&gt;
scalable broadcast infrastructure&lt;br&gt;
Rather than one large realtime process trying to do everything.&lt;/p&gt;

&lt;p&gt;But that evolution is part of what has made the project so interesting.&lt;/p&gt;

&lt;p&gt;The quant model matters.&lt;/p&gt;

&lt;p&gt;But the systems engineering around the model matters just as much.&lt;/p&gt;

&lt;p&gt;Live dashboard:&lt;br&gt;
&lt;a href="https://dashboard.derivasys.com" rel="noopener noreferrer"&gt;https://dashboard.derivasys.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you work on realtime options infrastructure, volatility systems, market data pipelines, or high-frequency analytics systems, I’d genuinely be interested to compare notes.&lt;/p&gt;

</description>
      <category>cryptocurrency</category>
      <category>vol</category>
      <category>distributedsystems</category>
      <category>python</category>
    </item>
  </channel>
</rss>
