<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: BaldQuant</title>
    <description>The latest articles on DEV Community by BaldQuant (@baldquant).</description>
    <link>https://dev.to/baldquant</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3963220%2F42501970-0dc8-4272-bad5-289f2252723a.png</url>
      <title>DEV Community: BaldQuant</title>
      <link>https://dev.to/baldquant</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/baldquant"/>
    <language>en</language>
    <item>
      <title>How to capture gap-free L2 order book data from Binance</title>
      <dc:creator>BaldQuant</dc:creator>
      <pubDate>Tue, 02 Jun 2026 13:16:47 +0000</pubDate>
      <link>https://dev.to/baldquant/how-to-capture-gap-free-l2-order-book-data-from-binance-3a70</link>
      <guid>https://dev.to/baldquant/how-to-capture-gap-free-l2-order-book-data-from-binance-3a70</guid>
      <description>&lt;h1&gt;
  
  
  How to capture gap-free L2 order book data from Binance
&lt;/h1&gt;

&lt;p&gt;Most homemade order book recorders are subtly wrong. They work fine in a terminal demo, produce files that open in pandas, and then quietly hand you garbage data — crossed books, missing updates, phantom price levels — that only shows up when a backtest produces an edge that evaporates live.&lt;/p&gt;

&lt;p&gt;This post explains the failure modes, why they happen, and the protocol that prevents them. At the end I'll show the open-source tool I built that implements all of it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why order book capture is harder than it looks
&lt;/h2&gt;

&lt;p&gt;Binance doesn't give you a live order book. It gives you two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A &lt;strong&gt;REST snapshot&lt;/strong&gt; — a point-in-time full book you fetch on demand&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;WebSocket diff stream&lt;/strong&gt; — a sequence of incremental updates&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Your job is to merge them into a coherent, continuously-updated book. That merge is where every homemade implementation goes wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  The common failures
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Connecting to the diff stream after fetching the snapshot.&lt;/strong&gt; If you fetch the snapshot first, then subscribe to diffs, you've already missed the updates that happened between the two. The gap is silent — the book just drifts wrong from the start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not buffering diffs during the snapshot fetch.&lt;/strong&gt; The snapshot fetch takes 50–200ms over the network. You need to subscribe to the diff stream &lt;em&gt;first&lt;/em&gt;, buffer every event that arrives while the snapshot is in flight, then replay the buffer. If you don't buffer, you drop updates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ignoring the sequence ID.&lt;/strong&gt; Every diff event has an &lt;code&gt;u&lt;/code&gt; field (the final update ID it covers) and a &lt;code&gt;U&lt;/code&gt; field (the first). The snapshot has a &lt;code&gt;lastUpdateId&lt;/code&gt;. Only diffs where &lt;code&gt;U &amp;lt;= lastUpdateId + 1 &amp;lt;= u&lt;/code&gt; are valid seeds. If you find a gap — an event where the previous event's &lt;code&gt;u&lt;/code&gt; doesn't match this event's expected predecessor — you're looking at missing data and your book is wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Logging gaps instead of halting.&lt;/strong&gt; A sequence break should be fatal. A book that's missing updates is not a book with a warning attached — it's a bad book. Writing it to disk with a log entry is worse than not writing it at all, because you won't see the warning in a year when you're building a model on the data.&lt;/p&gt;




&lt;h2&gt;
  
  
  The correct protocol for Binance USDT-M Futures
&lt;/h2&gt;

&lt;p&gt;Binance documents a six-step process. Here it is, with the parts they underemphasize:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Subscribe to the diff stream first
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ws&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wss://fstream.binance.com/stream?streams=btcusdt@depth@100ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start collecting events immediately. Don't wait for the snapshot. Don't process them yet — just buffer them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Fetch the REST snapshot
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;snapshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GET&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://fapi.binance.com/fapi/v1/depth?symbol=BTCUSDT&amp;amp;limit=1000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# snapshot["lastUpdateId"] is your seed
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While this request is in flight, your WebSocket buffer is filling up with diffs. That's correct.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Discard stale buffered events
&lt;/h3&gt;

&lt;p&gt;Any buffered diff where &lt;code&gt;u &amp;lt; lastUpdateId&lt;/code&gt; is older than your snapshot. Discard it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Find the first applicable diff
&lt;/h3&gt;

&lt;p&gt;You need the first buffered event where:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;U &amp;lt;= lastUpdateId + 1 &amp;lt;= u
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the first diff that picks up exactly where the snapshot left off. If no buffered event satisfies this, your buffer window was too short — drop everything and restart from Step 1.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5 (futures-specific): Verify the &lt;code&gt;pu&lt;/code&gt; field
&lt;/h3&gt;

&lt;p&gt;This is where futures differs from spot, and where most implementations copied from spot tutorials fail.&lt;/p&gt;

&lt;p&gt;On USDT-M futures, every diff event has a &lt;code&gt;pu&lt;/code&gt; field: the &lt;code&gt;u&lt;/code&gt; value of the &lt;em&gt;previous&lt;/em&gt; event. For every event after the first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;pu&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nx"&gt;previous_event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;u&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this breaks, you have a gap — a missed event — and the book is corrupted. Halt and resync.&lt;/p&gt;

&lt;p&gt;On spot, the equivalent check is &lt;code&gt;U == last_u + 1&lt;/code&gt;. On futures, use &lt;code&gt;pu == last_u&lt;/code&gt;. Don't mix them up.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: Apply diffs and maintain the book
&lt;/h3&gt;

&lt;p&gt;For each diff event, update your price levels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Qty &amp;gt; 0: set level&lt;/li&gt;
&lt;li&gt;Qty == 0: remove level&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After every update, check for a crossed book: if best bid &amp;gt;= best ask, something is wrong. Halt.&lt;/p&gt;




&lt;h2&gt;
  
  
  Invariants that must halt capture (not log-and-continue)
&lt;/h2&gt;

&lt;p&gt;These are not warnings. If any of these fire, you stop writing data and resync:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Invariant&lt;/th&gt;
&lt;th&gt;What it catches&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pu != last_u&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Missed diff event — book has a hole&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;best_bid &amp;gt;= best_ask&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Crossed book — merge logic is wrong&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Out-of-order &lt;code&gt;u&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Stale or duplicate event&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Clock skew &amp;gt; 1s&lt;/td&gt;
&lt;td&gt;Local timestamps are unreliable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stream silence &amp;gt; threshold&lt;/td&gt;
&lt;td&gt;Dead connection that didn't disconnect cleanly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key point: log-and-continue produces a file that looks valid. Halt-and-resync produces a gap you can see. A visible gap is always better than invisible corruption.&lt;/p&gt;




&lt;h2&gt;
  
  
  What clean data looks like
&lt;/h2&gt;

&lt;p&gt;After implementing this correctly, you get three streams per symbol, written to Parquet, partitioned by date:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;books&lt;/strong&gt; — full L2 snapshot at every diff event (~100ms cadence)&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;timestamp_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Exchange event time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;received_at_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Local receive time — &lt;code&gt;received_at_ms − timestamp_ms&lt;/code&gt; is your capture latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;update_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sequence ID for gap verification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;microprice&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(bid_qty × ask + ask_qty × bid) / (bid_qty + ask_qty)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;imbalance&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;bid_qty / (bid_qty + ask_qty)&lt;/code&gt; at best level&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;mid&lt;/code&gt;, &lt;code&gt;spread&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Convenience columns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;bid_price_N&lt;/code&gt;, &lt;code&gt;bid_qty_N&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Full ladder, N levels per side&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;trades&lt;/strong&gt; — aggregated trade events with &lt;code&gt;taker_sign&lt;/code&gt; (+1 taker bought, −1 taker sold)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;mark_price&lt;/strong&gt; — Binance mark price, index price, and next funding rate at 1-second intervals&lt;/p&gt;

&lt;p&gt;Having all three lets you correlate order flow imbalance with trade aggression and funding dynamics — the combination that most signal research requires.&lt;/p&gt;




&lt;h2&gt;
  
  
  The tool
&lt;/h2&gt;

&lt;p&gt;I built &lt;a href="https://github.com/Balleing/binance-l2-capture" rel="noopener noreferrer"&gt;&lt;code&gt;binance-l2-capture&lt;/code&gt;&lt;/a&gt; to implement exactly this protocol. It runs on Python 3.11+, self-hosted, bring your own API key. The data never leaves your machine.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Balleing/binance-l2-capture.git
&lt;span class="nb"&gt;cd &lt;/span&gt;binance-l2-capture
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env   &lt;span class="c"&gt;# add BINANCE_API_KEY&lt;/span&gt;
l2cap run              &lt;span class="c"&gt;# data starts landing in ./data/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It implements the full six-step merge protocol, checks &lt;code&gt;pu&lt;/code&gt; continuity on every event, halts on invariant violations rather than logging them, and auto-resyncs after a gap. On a $6/month VPS it captures two symbols continuously with no intervention.&lt;/p&gt;

&lt;p&gt;The code is MIT, the core will stay free. If you're running more symbols or want a monitoring dashboard, I'm building a Pro tier — star the repo to follow along.&lt;/p&gt;




&lt;h2&gt;
  
  
  The one-line test for your existing capture
&lt;/h2&gt;

&lt;p&gt;If you have an existing order book recorder, run this against a day of data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;polars&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/BTCUSDT/books/2024-06-01/*.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;gaps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;update_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;diff&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;drop_nulls&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sequence gaps: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;gaps&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;gaps &amp;gt; 0&lt;/code&gt;, your book has holes. If it prints 0 but you weren't checking &lt;code&gt;pu&lt;/code&gt;, re-read Step 5.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Questions, corrections, or "my backtest still blows up" — I'm &lt;a href="https://x.com/BaldQuant" rel="noopener noreferrer"&gt;@BaldQuant on X&lt;/a&gt;. The repo issues tab works too.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>algotrading</category>
      <category>datascience</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
