BaldQuant

Posted on Jun 2

How to capture gap-free L2 order book data from Binance

#python #algotrading #datascience #opensource

How to capture gap-free L2 order book data from Binance

Most homemade order book recorders are subtly wrong. They work fine in a terminal demo, produce files that open in pandas, and then quietly hand you garbage data — crossed books, missing updates, phantom price levels — that only shows up when a backtest produces an edge that evaporates live.

This post explains the failure modes, why they happen, and the protocol that prevents them. At the end I'll show the open-source tool I built that implements all of it.

Why order book capture is harder than it looks

Binance doesn't give you a live order book. It gives you two things:

A REST snapshot — a point-in-time full book you fetch on demand
A WebSocket diff stream — a sequence of incremental updates

Your job is to merge them into a coherent, continuously-updated book. That merge is where every homemade implementation goes wrong.

The common failures

Connecting to the diff stream after fetching the snapshot. If you fetch the snapshot first, then subscribe to diffs, you've already missed the updates that happened between the two. The gap is silent — the book just drifts wrong from the start.

Not buffering diffs during the snapshot fetch. The snapshot fetch takes 50–200ms over the network. You need to subscribe to the diff stream first, buffer every event that arrives while the snapshot is in flight, then replay the buffer. If you don't buffer, you drop updates.

Ignoring the sequence ID. Every diff event has an u field (the final update ID it covers) and a U field (the first). The snapshot has a lastUpdateId. Only diffs where U <= lastUpdateId + 1 <= u are valid seeds. If you find a gap — an event where the previous event's u doesn't match this event's expected predecessor — you're looking at missing data and your book is wrong.

Logging gaps instead of halting. A sequence break should be fatal. A book that's missing updates is not a book with a warning attached — it's a bad book. Writing it to disk with a log entry is worse than not writing it at all, because you won't see the warning in a year when you're building a model on the data.

The correct protocol for Binance USDT-M Futures

Binance documents a six-step process. Here it is, with the parts they underemphasize:

Step 1: Subscribe to the diff stream first

ws = connect("wss://fstream.binance.com/stream?streams=btcusdt@depth@100ms")

Start collecting events immediately. Don't wait for the snapshot. Don't process them yet — just buffer them.

Step 2: Fetch the REST snapshot

snapshot = GET("https://fapi.binance.com/fapi/v1/depth?symbol=BTCUSDT&limit=1000")
# snapshot["lastUpdateId"] is your seed

While this request is in flight, your WebSocket buffer is filling up with diffs. That's correct.

Step 3: Discard stale buffered events

Any buffered diff where u < lastUpdateId is older than your snapshot. Discard it.

Step 4: Find the first applicable diff

You need the first buffered event where:

U <= lastUpdateId + 1 <= u

This is the first diff that picks up exactly where the snapshot left off. If no buffered event satisfies this, your buffer window was too short — drop everything and restart from Step 1.

Step 5 (futures-specific): Verify the `pu` field

This is where futures differs from spot, and where most implementations copied from spot tutorials fail.

On USDT-M futures, every diff event has a pu field: the u value of the previous event. For every event after the first:

event["pu"] == previous_event["u"]

If this breaks, you have a gap — a missed event — and the book is corrupted. Halt and resync.

On spot, the equivalent check is U == last_u + 1. On futures, use pu == last_u. Don't mix them up.

Step 6: Apply diffs and maintain the book

For each diff event, update your price levels:

Qty > 0: set level
Qty == 0: remove level

After every update, check for a crossed book: if best bid >= best ask, something is wrong. Halt.

Invariants that must halt capture (not log-and-continue)

These are not warnings. If any of these fire, you stop writing data and resync:

Invariant	What it catches
`pu != last_u`	Missed diff event — book has a hole
`best_bid >= best_ask`	Crossed book — merge logic is wrong
Out-of-order `u`	Stale or duplicate event
Clock skew > 1s	Local timestamps are unreliable
Stream silence > threshold	Dead connection that didn't disconnect cleanly

The key point: log-and-continue produces a file that looks valid. Halt-and-resync produces a gap you can see. A visible gap is always better than invisible corruption.

What clean data looks like

After implementing this correctly, you get three streams per symbol, written to Parquet, partitioned by date:

books — full L2 snapshot at every diff event (~100ms cadence)

Field	Notes
`timestamp_ms`	Exchange event time
`received_at_ms`	Local receive time — `received_at_ms − timestamp_ms` is your capture latency
`update_id`	Sequence ID for gap verification
`microprice`	`(bid_qty × ask + ask_qty × bid) / (bid_qty + ask_qty)`
`imbalance`	`bid_qty / (bid_qty + ask_qty)` at best level
`mid`, `spread`	Convenience columns
`bid_price_N`, `bid_qty_N`	Full ladder, N levels per side

trades — aggregated trade events with taker_sign (+1 taker bought, −1 taker sold)

mark_price — Binance mark price, index price, and next funding rate at 1-second intervals

Having all three lets you correlate order flow imbalance with trade aggression and funding dynamics — the combination that most signal research requires.

The tool

I built binance-l2-capture to implement exactly this protocol. It runs on Python 3.11+, self-hosted, bring your own API key. The data never leaves your machine.

git clone https://github.com/Balleing/binance-l2-capture.git
cd binance-l2-capture
pip install -e .
cp .env.example .env   # add BINANCE_API_KEY
l2cap run              # data starts landing in ./data/

It implements the full six-step merge protocol, checks pu continuity on every event, halts on invariant violations rather than logging them, and auto-resyncs after a gap. On a $6/month VPS it captures two symbols continuously with no intervention.

The code is MIT, the core will stay free. If you're running more symbols or want a monitoring dashboard, I'm building a Pro tier — star the repo to follow along.

The one-line test for your existing capture

If you have an existing order book recorder, run this against a day of data:

import polars as pl

df = pl.scan_parquet("data/BTCUSDT/books/2024-06-01/*.parquet").collect()
gaps = (df["update_id"].diff().drop_nulls() != 1).sum()
print(f"Sequence gaps: {gaps}")

If gaps > 0, your book has holes. If it prints 0 but you weren't checking pu, re-read Step 5.

Questions, corrections, or "my backtest still blows up" — I'm @BaldQuant on X. The repo issues tab works too.

Top comments (1)

Oliver Zehentleitner • Jun 11

Really good explanation of the snapshot/buffer/diff protocol, especially the distinction between Spot sequence checks and Futures pu.

I recently investigated a related failure mode that appears after all of this is working correctly: a local Binance order book can remain gap-free, pass sequence validation, and still accumulate stale price levels over time if retention is not explicitly bounded.

In a 25-hour BTCUSDT test, a naive cache grew to 20,758 bid levels, while only 24.09% of them still matched the REST snapshot. The issue was not a missed diff, but ghost levels outside the actively maintained depth corridor.

I wrote up the experiment and the pruning/resync implications here:

dev.to/oliverzehentleitner/your-bi...

Your article explains how to prevent gaps; mine covers why gap-free alone is not a sufficient long-running correctness guarantee. They complement each other nicely.

How to capture gap-free L2 order book data from Binance

Why order book capture is harder than it looks

The common failures

The correct protocol for Binance USDT-M Futures

Step 1: Subscribe to the diff stream first

Step 2: Fetch the REST snapshot

Step 3: Discard stale buffered events

Step 4: Find the first applicable diff

Step 5 (futures-specific): Verify the pu field

Step 6: Apply diffs and maintain the book

Invariants that must halt capture (not log-and-continue)

What clean data looks like

The tool

The one-line test for your existing capture

Step 5 (futures-specific): Verify the `pu` field