DEV Community: hpc group

How We Built a Sub-Millisecond Crypto Market Data Feed in C++

hpc group — Tue, 14 Apr 2026 13:10:05 +0000

Every crypto exchange speaks its own dialect. Binance sends depthUpdate messages with "b" and "a" arrays. Coinbase wraps updates in a channel/type envelope. OKX gzip-compresses its WebSocket frames. Bybit uses a different snapshot synchronization protocol than any of them. If you want to build anything that consumes order book data from multiple exchanges, you are stuck writing and maintaining a bespoke parser for each one, each with its own reconnect logic, snapshot sync state machine, and symbol naming convention.

We built Microverse to solve this problem: a single C++ pipeline that normalizes real-time order book data from 20 exchanges into a uniform stream, and serves it over a free WebSocket API. This article walks through the architecture, the hard engineering problems we hit, and the techniques we used to keep end-to-end latency under one millisecond.

The Pipeline at a Glance

The data path from exchange to client has seven stages:

Exchange WS → WS Driver → Parser → Book → SHM Ring → mdf_server → Gateway → Client

Each exchange runs as its own handler process. The handler connects to the exchange, parses messages, maintains a local order book, and writes normalized updates into a shared-memory ring buffer. A central mdf_server process reads from all 20 ring buffers and distributes updates to downstream consumers: a web dashboard, a WebSocket gateway for external clients, and internal analytics. No message broker. No serialization framework. Just lock-free shared memory and TCP.

Let us walk through each stage.

Stage 1: WebSocket Driver

Each handler spawns an SSL WebSocket connection to its exchange. The driver (mcast_websocket.cpp) handles the full lifecycle: TLS handshake, WebSocket frame decoding, ping/pong keepalives, and transparent decompression for exchanges that gzip their payloads (HTX, OKX, and others).

When a complete text frame arrives, the driver writes the raw JSON into a buffer and tags it with a port number: 1 for incremental depth updates, 2 for snapshots. Every message is also written to a binary capture file (24-byte header plus JSON payload) so we can replay production traffic through the pipeline deterministically during development.

The driver also runs a separate snapshot thread. On initial subscription or when a sequence gap is detected, it makes an HTTPS REST call to fetch a full book snapshot and pushes it into an internal message queue, which the main recv() loop picks up on the next iteration.

Stage 2: Parser (simdjson)

Each exchange has a dedicated parser class (e.g., BinanceParser, CoinbaseParser, KrakenParser) that implements a common interface: processPacket(buffer, len, channel). The parser's job is to extract price level updates from exchange-specific JSON and translate them into uniform levelAdd / levelDelete calls on the book.

We use simdjson for JSON parsing. It processes JSON at gigabytes per second using SIMD instructions, which matters when you are parsing hundreds of thousands of messages per second across all exchanges. One critical lesson we learned the hard way: simdjson's on-demand parser modifies the buffer in-place during string unescaping. The escaped bytes \"bids\" get rewritten to bids\0..., destroying the structural quote characters. A second iterate() call over the same buffer silently returns zero results. Every parser must do exactly one parse pass per buffer.

The parser also handles the trickiest part of exchange integration: price normalization. All prices are converted to fixed-point integers with 8 decimal places. The string "98500.12" becomes the integer 9850012000000. This eliminates floating-point comparison issues entirely and keeps the book operations branch-free.

Stage 3: Snapshot Synchronization

Every exchange uses a variant of the same pattern: subscribe to a WebSocket stream of incremental updates, fetch a REST snapshot to establish a baseline, then apply only those incremental updates whose sequence numbers come after the snapshot.

The devil is in the details. Binance gives you an updateId on both snapshots and deltas; you buffer deltas until the snapshot arrives, discard any with updateId <= snapshot.lastUpdateId, and apply the rest in order. If you detect a gap (updateId != lastUpdateId + 1), you need to re-snapshot. OKX uses a checksum field you can validate against. Coinbase has a completely different sequencing model.

Each parser maintains per-symbol sync state:

struct SymbolState {
    bool snapshot_synced;
    bool needs_resnapshot;
    uint64_t seq_last_applied;
    std::vector<PendingUpdate> pending_updates;
};

When the parser detects a gap or stale data, it sets needs_resnapshot = true. The handler's main loop polls for this via popResnapshot() and triggers a new REST snapshot fetch. Until the snapshot arrives and sync is re-established, all incremental updates for that symbol are silently dropped. This is a deliberate design choice: we would rather show stale data for a fraction of a second than apply updates to a book that is out of sync, which would produce silently wrong prices.

Stage 4: The Order Book

The book (mdf_book.h) stores price levels in a sorted linked list per side (bid/ask). When a parser calls levelAdd, the book finds or inserts the price level, updates its quantity, and calls a virtual priceLevelChanged() callback. When levelDelete is called (quantity goes to zero), the level is removed from the list and the same callback fires.

The linked list uses a slab allocator (SlabbedVector) rather than std::vector to avoid pointer invalidation on growth. Slabs are allocated in fixed-size chunks (128 elements) and never freed until the container is destroyed. This gives us O(1) allocation, zero reallocation copies, and stable pointers.

Stage 5: Shared-Memory Ring Buffers

This is where the latency story gets interesting. Each handler writes to a shared-memory ring buffer mapped at /dev/shm/<exchange>_response. The ring is a single-producer, single-consumer (SPSC) lock-free queue implemented with two cache-aligned atomic counters:

Offset 0:    [header]
Offset 64:   atomic<long> r   // reader position  (CACHE_ALIGNED)
Offset 128:  atomic<long> w   // writer position   (CACHE_ALIGNED)
Offset 256+: [data: variable-length MDFMsg records]

The reader and writer positions are on separate cache lines (64-byte aligned) to eliminate false sharing. The writer advances with store(release), the reader reads with load(acquire). There are no locks, no syscalls, and no kernel involvement in the hot path. A Linux futex is used only when the reader has no data and wants to sleep rather than spin.

Messages are variable-length and written directly as packed C structs. A price level change is 48 bytes:

MDFPriceLevelChangeMsg {
    uint16_t  size;       // message size
    uint8_t   type;       // 42
    secid_t   secid;      // symbol ID
    side_t    side;       // BID=0, ASK=1
    price_t   price;      // fixed-point, 8 decimals
    int64_t   shares;     // quantity
    int       num_orders; // order count at level
    timestamp_t timestamp;// nanoseconds
};

No serialization, no deserialization. The mdf_server reads the struct directly out of shared memory. This is true zero-copy: the data written by the handler is the exact byte layout read by the server.

The handler also batches writes to amortize the cost of the atomic store. It accumulates messages in a local buffer and flushes to the shared-memory ring when a threshold is reached or the main loop goes idle.

Stage 6: mdf_server (Aggregator)

The mdf_server process attaches to all 20 handler ring buffers and runs a tight poll loop:

for each ring:
    while ring has data:
        read MDFMsg from ring
        route to subscribed clients via TCP

It maintains a subscription table mapping symbol IDs to connected clients. When a web dashboard or gateway subscribes to "binance:BTCUSDT", the server writes a subscription request into the handler's request ring (/dev/shm/binance_request). The handler picks it up, fetches a snapshot, builds the initial book, and writes a full MDFRefreshMsg (containing all bid and ask levels) back through the response ring. From that point on, incremental updates flow automatically.

The server also handles heartbeats, connection management, and a subscription protocol that lets clients dynamically add and remove symbols.

Stage 7: WebSocket Gateway

The gateway (mdf_gateway.cpp) connects to mdf_server as an internal TCP client, maintains its own in-memory copy of every book it subscribes to, and serves external clients over WebSocket with JSON payloads. It supports per-exchange subscriptions, consolidated cross-exchange views, and top-of-book snapshots.

The gateway includes an embedded HTML test page, so you can point a browser at it and immediately see live order books rendered with a cyberpunk-themed dashboard. But more practically, you can connect with any WebSocket client and get structured JSON updates.

Performance Characteristics

The pipeline achieves sub-millisecond end-to-end latency from exchange WebSocket receipt to client delivery. Here is where the time goes:

SSL read + WebSocket decode: ~50-100us
simdjson parse + book update: ~10-30us
SHM ring write + read: ~1-5us
TCP send to gateway/viewer: ~50-200us

The key design decisions that keep latency low:

No serialization layer. Messages are packed C structs written directly to shared memory and read directly by the consumer. No protobuf, no flatbuffers, no JSON encoding between internal components.
SPSC lock-free rings. The only synchronization primitive in the hot path is a pair of atomic load/store operations on cache-aligned counters. No mutexes, no condition variables.
Slab allocation. The order book never calls malloc or free in the hot path. Price levels are allocated from pre-allocated slabs that grow but never shrink.
Fixed-point arithmetic. All prices and quantities are 64-bit integers. No floating-point comparison, no rounding issues, no epsilon checks.
Per-exchange process isolation. Each handler is a separate OS process. A crash or hang in the Kraken parser does not affect Binance. The mdf_server simply stops seeing updates on that ring until the watchdog restarts the handler.

20 Exchanges, One API

The system currently normalizes data from: Ascendex, Binance, BingX, Bitfinex, Bitget, Bitmart, Bybit, Coinbase, CoinEx, Crypto.com, Gate.io, Gemini, HTX, Kraken, KuCoin, LBank, MEXC, OKX, Phemex, and Upbit. Each required writing a dedicated parser, figuring out its snapshot sync protocol, handling its compression scheme, and mapping its symbol naming convention to our normalized format.

Adding a new exchange typically takes a day of work: study the WebSocket API docs, write the parser class, add snapshot sync logic, test against captures, and deploy. The driver, book, ring buffer, and distribution layers are all reusable.

Try It

The WebSocket API is free and requires no authentication. Connect and subscribe to any symbol across any supported exchange:

const ws = new WebSocket('wss://api.microversesystems.com');

ws.onopen = () => {
  // Subscribe to BTC/USDT books from Binance and Coinbase
  ws.send(JSON.stringify({
    op: 'subscribe',
    symbols: ['binance:BTCUSDT', 'coinbase:BTC-USD']
  }));
};

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  if (msg.type === 'book') {
    console.log(`${msg.exchange}:${msg.symbol}`);
    console.log(`  Best bid: ${msg.bids[0][0]} @ ${msg.bids[0][1]}`);
    console.log(`  Best ask: ${msg.asks[0][0]} @ ${msg.asks[0][1]}`);
  }
};

You can see live order books from all 20 exchanges on the dashboard, read the API documentation, or learn more at microversesystems.com.

Wrapping Up

Building a low-latency market data feed is not about any single optimization. It is about eliminating unnecessary work at every stage: no serialization overhead, no lock contention, no memory allocation in the hot path, no floating-point arithmetic. Each decision compounds.

The hardest part was not the C++ or the performance work. It was the exchange integration: 20 different WebSocket APIs, 20 different snapshot sync protocols, 20 different ways of naming BTC/USDT. That is the real engineering work, and it is the reason we built this as a service rather than a library. You should not have to reverse-engineer Phemex's sequence number semantics just to get a clean order book.

If you are building trading systems, analytics, or dashboards that need real-time crypto data, give the API a try. It is free, it is fast, and it covers 20 exchanges with a single WebSocket connection.

How We Built a Sub-Millisecond Crypto Feed in C++

hpc group — Thu, 02 Apr 2026 21:13:22 +0000

Most crypto market data APIs give you top-of-book prices with 100ms+ latency. We wanted full L2 order books from 21 exchanges, all normalized into a single WebSocket stream, at sub-millisecond speed. So we built it.

This post covers the core engineering decisions behind Microverse Systems — a free, real-time order book API that aggregates depth-of-market data across major crypto exchanges.

The Problem

If you're building a trading bot, an arbitrage scanner, or even a simple price dashboard, you hit the same wall: every exchange has its own WebSocket protocol, its own message format, its own rate limits. Binance sends JSON. Bybit sends JSON but structures it differently. Some exchanges batch updates, others stream individual changes.

Normalizing all of this in Python or Node means you're spending more time parsing messages than actually using the data. And if latency matters to your strategy, the language overhead alone puts you at a disadvantage.

Why C++

We went with C++ for the core feed handler, not because we enjoy debugging segfaults, but because it was the only way to hit our latency targets:

Zero-copy message parsing: Incoming WebSocket frames are parsed in-place using pointer arithmetic rather than deserializing into intermediate objects. This avoids heap allocations on the hot path.
Lock-free order book structures: Each exchange's order book is maintained in a lock-free data structure that allows readers (subscriber threads) to access snapshots without blocking the writer (the feed handler thread).
Kernel bypass networking: On our production boxes, we use DPDK to bypass the kernel's TCP/IP stack entirely. This shaves off ~15 microseconds per packet compared to standard socket reads.

The result is internal tick-to-publish latency under 50 microseconds for most exchanges. The bottleneck is almost always the exchange's own WebSocket server, not our processing.

Architecture Overview

Exchange WS Feeds ──► C++ Feed Handlers ──► Normalized Book Builder
                                                    │
                                                    ▼
                                            Snapshot Cache (shared memory)
                                                    │
                                            ┌───────┴───────┐
                                            ▼               ▼
                                    WebSocket Gateway   REST API
                                    (user-facing)       (historical)

Each exchange gets its own feed handler process. These are independent — if Bybit's feed dies, it doesn't take down Binance. The handlers write normalized book updates into a shared-memory ring buffer that the WebSocket gateway reads from.

The gateway fans out to subscribers. When a new client connects and requests, say, BTC/USDT on Binance, it gets an immediate full-depth snapshot from the cache, then a stream of incremental updates.

The Normalization Layer

This is where most of the complexity lives. Every exchange represents order books slightly differently:

Binance sends an initial snapshot + diff updates with firstUpdateId / lastUpdateId for sequencing
Bybit sends periodic snapshots + delta updates with a sequence number
OKX batches multiple instruments in a single message with checksums
Kraken uses a completely different depth model with republish flags

Our normalization layer maintains a state machine per exchange per instrument. It handles:

Initial sync (requesting a snapshot, buffering diffs until the snapshot arrives)
Sequence validation (detecting gaps and re-syncing)
Cross normalization (converting all price/qty to the same decimal format)

We checksum the book state after every update and compare it against exchange-provided checksums where available (OKX, Kraken). If there's a mismatch, we force a full re-sync.

What We Ship to Users

The API is intentionally simple. Connect via WebSocket, send a subscribe message:

{
  "action": "subscribe",
  "exchange": "binance",
  "symbol": "BTC/USDT",
  "depth": 25
}

You get back a full snapshot, then a stream of incremental updates. All exchanges use the same message format — no need to learn 21 different APIs.

No API key required. No rate limits on the WebSocket stream. We want this to be the easiest way to get institutional-grade market data without paying institutional-grade prices (it's free).

Lessons Learned

Shared memory is underrated. We initially tried passing data between the feed handlers and the gateway over Unix sockets. Switching to mmap-backed ring buffers cut our internal latency by 10x and eliminated a whole class of backpressure issues.

Exchange WebSocket connections are fragile. We've seen Binance silently stop sending updates without closing the connection. We now have heartbeat monitors on every feed that force a reconnect if no message arrives within 2x the expected interval.

Don't trust exchange timestamps. Some exchanges report timestamps in seconds, some in milliseconds, some with timezone offsets, some without. We stamp everything with our own receive time and treat exchange timestamps as advisory.

Try It

The API is live now at microversesystems.com. The docs have code samples for Python, Node, and Rust. There's also a live dashboard where you can see the order books updating in real time.

If you're building anything that needs crypto market data — trading bots, analytics dashboards, academic research — give it a shot. We'd love feedback.

Built by Microverse Systems. Questions? Drop a comment or open an issue on our GitHub.