DEV Community

manja316
manja316

Posted on

I archived 17.9M Polymarket price snapshots. Three things the data shows that the order book hides.

If you only ever look at a prediction market through its live order book, you see one number: the current price. That number is the market's best guess at a probability — "73¢ = ~73% YES." Useful, but it throws away everything about how the market got there.

I've been archiving Polymarket order-book snapshots every 15 minutes for about ten weeks. The dataset is now 17.9M+ price points across ~17,900 markets, most of them resolved (so we know the ground-truth outcome). Looking at prediction markets in bulk, instead of one at a time, surfaces structure that's invisible tick-by-tick. Three findings worth your time.

1. Favorite-longshot bias is real here too

The oldest result in betting-market research is the favorite-longshot bias: longshots are systematically overpriced and heavy favorites are systematically underpriced. People overpay for the lottery-ticket thrill of a 5¢ "it could happen" and underpay for the boring 95¢ near-certainty.

It shows up cleanly when you bucket thousands of resolved markets by their pre-resolution price and compare each bucket's implied probability to its actual hit rate:

  • Contracts trading in the 2–10¢ band resolve YES less often than their price implies — the longshot tax.
  • Contracts in the 90–98¢ band resolve YES slightly more often than their price implies — favorites are a hair cheap.

You cannot see this in one market. You can only see it across a thousand resolved ones — which is exactly what a bulk archive is for.

2. Convergence is late and lumpy, not smooth

Intuition says a market should drift smoothly toward 0 or 1 as resolution nears. The data says otherwise: most markets sit in a noisy band for the majority of their life and then convergence happens in a short burst near the resolving event — a debate, an earnings print, an election night.

The practical implication: time-to-resolution matters more than price level when you're reasoning about how much a contract can still move. A 60¢ market with three weeks left and a 60¢ market with three hours left are completely different objects, and a single live quote can't tell them apart. A timestamped history can.

3. Volume ≠ movement

The markets with the biggest 24h price moves are usually not the ones with the biggest volume. High-volume markets are liquid and efficient — lots of participants, tight spreads, slow to move. The violent re-pricings happen in thinner markets where a single piece of news has nobody on the other side to absorb it.

If you screen for "interesting" markets by volume, you'll mostly find markets that have already finished being interesting. Screening by realized movement relative to liquidity finds the ones still in motion. (I built a free screener around exactly this idea — separate post.)

How to reproduce this yourself

Every claim above is just pandas over a long-format table:

import pandas as pd

df = pd.read_parquet("polymarket_snapshots.parquet")
# columns: market_id, timestamp, price_yes, volume, resolved_outcome

# Favorite-longshot calibration: take each market's last pre-resolution price,
# bucket it, and compare implied probability to the actual YES rate.
last = (df.dropna(subset=["resolved_outcome"])
          .sort_values("timestamp")
          .groupby("market_id").tail(1))
last["bucket"] = (last["price_yes"] * 10).round() / 10
calib = last.groupby("bucket").agg(
    implied=("price_yes", "mean"),
    actual=("resolved_outcome", "mean"),
    n=("market_id", "size"))
print(calib)  # actual < implied for low buckets = longshot bias
Enter fullscreen mode Exit fullscreen mode

The hard part isn't the analysis — it's getting clean, timestamped, resolution-labeled history out of an API that's built for live trading, not bulk export. That collection problem is the whole reason this dataset exists.

The dataset

If you want the raw archive instead of building the collector yourself, it's here:

👉 Polymarket Full Dataset — 17.9M+ price snapshots, ~17,900 markets

Parquet + CSV, resolution labels included, one-time purchase. If you just want to poke at the live numbers first, the market index is free at protodex.io.

What would you check first with 17.9M labeled prediction-market snapshots? I'll run reader requests against the archive and post the results.

Top comments (0)