DEV Community

BaldQuant
BaldQuant

Posted on

How to capture gap-free L2 order book data from Binance

How to capture gap-free L2 order book data from Binance

Most homemade order book recorders are subtly wrong. They work fine in a terminal demo, produce files that open in pandas, and then quietly hand you garbage data — crossed books, missing updates, phantom price levels — that only shows up when a backtest produces an edge that evaporates live.

This post explains the failure modes, why they happen, and the protocol that prevents them. At the end I'll show the open-source tool I built that implements all of it.


Why order book capture is harder than it looks

Binance doesn't give you a live order book. It gives you two things:

  1. A REST snapshot — a point-in-time full book you fetch on demand
  2. A WebSocket diff stream — a sequence of incremental updates

Your job is to merge them into a coherent, continuously-updated book. That merge is where every homemade implementation goes wrong.

The common failures

Connecting to the diff stream after fetching the snapshot. If you fetch the snapshot first, then subscribe to diffs, you've already missed the updates that happened between the two. The gap is silent — the book just drifts wrong from the start.

Not buffering diffs during the snapshot fetch. The snapshot fetch takes 50–200ms over the network. You need to subscribe to the diff stream first, buffer every event that arrives while the snapshot is in flight, then replay the buffer. If you don't buffer, you drop updates.

Ignoring the sequence ID. Every diff event has an u field (the final update ID it covers) and a U field (the first). The snapshot has a lastUpdateId. Only diffs where U <= lastUpdateId + 1 <= u are valid seeds. If you find a gap — an event where the previous event's u doesn't match this event's expected predecessor — you're looking at missing data and your book is wrong.

Logging gaps instead of halting. A sequence break should be fatal. A book that's missing updates is not a book with a warning attached — it's a bad book. Writing it to disk with a log entry is worse than not writing it at all, because you won't see the warning in a year when you're building a model on the data.


The correct protocol for Binance USDT-M Futures

Binance documents a six-step process. Here it is, with the parts they underemphasize:

Step 1: Subscribe to the diff stream first

ws = connect("wss://fstream.binance.com/stream?streams=btcusdt@depth@100ms")
Enter fullscreen mode Exit fullscreen mode

Start collecting events immediately. Don't wait for the snapshot. Don't process them yet — just buffer them.

Step 2: Fetch the REST snapshot

snapshot = GET("https://fapi.binance.com/fapi/v1/depth?symbol=BTCUSDT&limit=1000")
# snapshot["lastUpdateId"] is your seed
Enter fullscreen mode Exit fullscreen mode

While this request is in flight, your WebSocket buffer is filling up with diffs. That's correct.

Step 3: Discard stale buffered events

Any buffered diff where u < lastUpdateId is older than your snapshot. Discard it.

Step 4: Find the first applicable diff

You need the first buffered event where:

U <= lastUpdateId + 1 <= u
Enter fullscreen mode Exit fullscreen mode

This is the first diff that picks up exactly where the snapshot left off. If no buffered event satisfies this, your buffer window was too short — drop everything and restart from Step 1.

Step 5 (futures-specific): Verify the pu field

This is where futures differs from spot, and where most implementations copied from spot tutorials fail.

On USDT-M futures, every diff event has a pu field: the u value of the previous event. For every event after the first:

event["pu"] == previous_event["u"]
Enter fullscreen mode Exit fullscreen mode

If this breaks, you have a gap — a missed event — and the book is corrupted. Halt and resync.

On spot, the equivalent check is U == last_u + 1. On futures, use pu == last_u. Don't mix them up.

Step 6: Apply diffs and maintain the book

For each diff event, update your price levels:

  • Qty > 0: set level
  • Qty == 0: remove level

After every update, check for a crossed book: if best bid >= best ask, something is wrong. Halt.


Invariants that must halt capture (not log-and-continue)

These are not warnings. If any of these fire, you stop writing data and resync:

Invariant What it catches
pu != last_u Missed diff event — book has a hole
best_bid >= best_ask Crossed book — merge logic is wrong
Out-of-order u Stale or duplicate event
Clock skew > 1s Local timestamps are unreliable
Stream silence > threshold Dead connection that didn't disconnect cleanly

The key point: log-and-continue produces a file that looks valid. Halt-and-resync produces a gap you can see. A visible gap is always better than invisible corruption.


What clean data looks like

After implementing this correctly, you get three streams per symbol, written to Parquet, partitioned by date:

books — full L2 snapshot at every diff event (~100ms cadence)

Field Notes
timestamp_ms Exchange event time
received_at_ms Local receive time — received_at_ms − timestamp_ms is your capture latency
update_id Sequence ID for gap verification
microprice (bid_qty × ask + ask_qty × bid) / (bid_qty + ask_qty)
imbalance bid_qty / (bid_qty + ask_qty) at best level
mid, spread Convenience columns
bid_price_N, bid_qty_N Full ladder, N levels per side

trades — aggregated trade events with taker_sign (+1 taker bought, −1 taker sold)

mark_price — Binance mark price, index price, and next funding rate at 1-second intervals

Having all three lets you correlate order flow imbalance with trade aggression and funding dynamics — the combination that most signal research requires.


The tool

I built binance-l2-capture to implement exactly this protocol. It runs on Python 3.11+, self-hosted, bring your own API key. The data never leaves your machine.

git clone https://github.com/Balleing/binance-l2-capture.git
cd binance-l2-capture
pip install -e .
cp .env.example .env   # add BINANCE_API_KEY
l2cap run              # data starts landing in ./data/
Enter fullscreen mode Exit fullscreen mode

It implements the full six-step merge protocol, checks pu continuity on every event, halts on invariant violations rather than logging them, and auto-resyncs after a gap. On a $6/month VPS it captures two symbols continuously with no intervention.

The code is MIT, the core will stay free. If you're running more symbols or want a monitoring dashboard, I'm building a Pro tier — star the repo to follow along.


The one-line test for your existing capture

If you have an existing order book recorder, run this against a day of data:

import polars as pl

df = pl.scan_parquet("data/BTCUSDT/books/2024-06-01/*.parquet").collect()
gaps = (df["update_id"].diff().drop_nulls() != 1).sum()
print(f"Sequence gaps: {gaps}")
Enter fullscreen mode Exit fullscreen mode

If gaps > 0, your book has holes. If it prints 0 but you weren't checking pu, re-read Step 5.


Questions, corrections, or "my backtest still blows up" — I'm @BaldQuant on X. The repo issues tab works too.

Top comments (0)