88% of the order-book rows in my dataset were fake. Here's how I caught it.

#datascience #dataset #python #trading

I've been collecting Polymarket prediction-market data on a single Mac for 89 days. As of this morning the SQLite file holds:

18,207,844 price snapshots (15-minute OHLC)
21,871 markets
1,816,392 order-book rows
span: 2026-03-28 → 2026-06-25 (89 days)

For weeks I described it as "18M price snapshots + 1.8M order-book records." That second number was a lie I was telling by accident. This post is the autopsy, because if you buy or build on market data, the failure mode I hit is one you will hit too.

The smell test that should have run on day one

I went to compute realized spreads — best_ask - best_bid over time — expecting a tight distribution for liquid markets. Instead almost every row looked like this:

best_bid | best_ask | spread | mid
   0.001 |    0.999 |  0.998 | 0.5
   0.001 |    0.999 |  0.998 | 0.5
   0.001 |    0.999 |  0.998 | 0.5

A 99.8-cent spread on a market that resolves between $0 and $1 is not a quote. It's the placeholder my collector wrote when the CLOB returned an empty book — and I had been counting every one of those as a "record."

One query settled it:

SELECT
  COUNT(*) AS total,
  SUM(CASE WHEN (best_ask - best_bid) >= 0.99 THEN 1 ELSE 0 END) AS effectively_empty
FROM orderbooks;
-- total            = 1,816,392
-- effectively_empty = 1,602,616   (88.2%)

88.2% of my order-book rows carry the 0.001 / 0.999 placeholder. Only ~11.8% have a spread tight enough to even be a candidate for a real two-sided quote, and the genuinely tradeable subset is smaller still. The "1.8M order-book records" headline was real rows in a table and almost entirely empty as information.

Why this happens (and why it's not Polymarket's fault)

Prediction markets are thin. Most of the 21,871 markets are long-tail contracts — a niche election, a sports prop, a "will X tweet by Friday." At any given 15-minute poll, most of those books are genuinely empty. The collector dutifully recorded "no bid, no ask" as a row. Nothing was broken. The bug was in how I summarized it.

The price series, by contrast, is dense and reliable: a market's last-trade / OHLC ticks along even when the book is empty, because it reflects executed prices, not resting orders. 18.2M price rows are 18.2M real observations. 1.8M book rows are ~200K real observations wearing a costume.

What I changed

Dropped "order-book records" as a headline feature everywhere I controlled — README, the dataset article, the listing copy. Selling an 88%-empty column as a feature is how you earn refund requests.
Re-led with the price series, which is what the dataset is actually good for: backtesting, calibration studies, favorite-longshot analysis.
Shipped the empty book anyway, labeled honestly — as a sparse bonus with its real coverage stated, not as a selling point. A few hundred thousand real top-of-book snapshots still have uses; pretending it's 1.8M does not.

The transferable lesson

COUNT(*) is not coverage. Before you trust — or sell, or backtest on — any market dataset:

Run the spread/sanity distribution, not just the row count. A column can be 100% populated and ~0% informative.
Separate observations from rows. Placeholders are rows. Only non-degenerate values are observations.
State coverage as a percentage of plausible values, not as a raw total. "1.8M rows, 12% non-empty" is honest; "1.8M records" is marketing.

I'd rather a buyer learn this from my article than discover it after paying. The dense, verified layer — 18.2M price snapshots across 89 days — is the thing worth having, and it's the thing I now lead with.

The full dataset (price series + the honestly-labeled sparse book) is on Gumroad: Polymarket Quant Toolkit. There's a free sample so you can run your own sanity queries before paying — which, after reading this, you absolutely should.