I've been collecting Polymarket prediction-market data on a single Mac for 89 days. As of this morning the SQLite file holds:
- 18,207,844 price snapshots (15-minute OHLC)
- 21,871 markets
- 1,816,392 order-book rows
- span: 2026-03-28 → 2026-06-25 (89 days)
For weeks I described it as "18M price snapshots + 1.8M order-book records." That second number was a lie I was telling by accident. This post is the autopsy, because if you buy or build on market data, the failure mode I hit is one you will hit too.
The smell test that should have run on day one
I went to compute realized spreads — best_ask - best_bid over time — expecting a tight distribution for liquid markets. Instead almost every row looked like this:
best_bid | best_ask | spread | mid
0.001 | 0.999 | 0.998 | 0.5
0.001 | 0.999 | 0.998 | 0.5
0.001 | 0.999 | 0.998 | 0.5
A 99.8-cent spread on a market that resolves between $0 and $1 is not a quote. It's the placeholder my collector wrote when the CLOB returned an empty book — and I had been counting every one of those as a "record."
One query settled it:
SELECT
COUNT(*) AS total,
SUM(CASE WHEN (best_ask - best_bid) >= 0.99 THEN 1 ELSE 0 END) AS effectively_empty
FROM orderbooks;
-- total = 1,816,392
-- effectively_empty = 1,602,616 (88.2%)
88.2% of my order-book rows carry the 0.001 / 0.999 placeholder. Only ~11.8% have a spread tight enough to even be a candidate for a real two-sided quote, and the genuinely tradeable subset is smaller still. The "1.8M order-book records" headline was real rows in a table and almost entirely empty as information.
Why this happens (and why it's not Polymarket's fault)
Prediction markets are thin. Most of the 21,871 markets are long-tail contracts — a niche election, a sports prop, a "will X tweet by Friday." At any given 15-minute poll, most of those books are genuinely empty. The collector dutifully recorded "no bid, no ask" as a row. Nothing was broken. The bug was in how I summarized it.
The price series, by contrast, is dense and reliable: a market's last-trade / OHLC ticks along even when the book is empty, because it reflects executed prices, not resting orders. 18.2M price rows are 18.2M real observations. 1.8M book rows are ~200K real observations wearing a costume.
What I changed
- Dropped "order-book records" as a headline feature everywhere I controlled — README, the dataset article, the listing copy. Selling an 88%-empty column as a feature is how you earn refund requests.
- Re-led with the price series, which is what the dataset is actually good for: backtesting, calibration studies, favorite-longshot analysis.
- Shipped the empty book anyway, labeled honestly — as a sparse bonus with its real coverage stated, not as a selling point. A few hundred thousand real top-of-book snapshots still have uses; pretending it's 1.8M does not.
The transferable lesson
COUNT(*) is not coverage. Before you trust — or sell, or backtest on — any market dataset:
- Run the spread/sanity distribution, not just the row count. A column can be 100% populated and ~0% informative.
- Separate observations from rows. Placeholders are rows. Only non-degenerate values are observations.
- State coverage as a percentage of plausible values, not as a raw total. "1.8M rows, 12% non-empty" is honest; "1.8M records" is marketing.
I'd rather a buyer learn this from my article than discover it after paying. The dense, verified layer — 18.2M price snapshots across 89 days — is the thing worth having, and it's the thing I now lead with.
The full dataset (price series + the honestly-labeled sparse book) is on Gumroad: Polymarket Quant Toolkit. There's a free sample so you can run your own sanity queries before paying — which, after reading this, you absolutely should.
Top comments (0)