When Polymarket says 70%, does it happen 70%? Why price-only data can't answer that — and what it can

#machinelearning #python #datascience #statistics

Correction (2026-06-27): An earlier version of this post implied the dataset ships with per-market 0/1 resolution labels and that I had measured calibration directly. That was wrong, and I'd rather fix it loudly than quietly. This is a price dataset — it does not contain settled outcome labels. True calibration needs realized outcomes you have to join in from an external source. Below is the honest version: what price-only data can tell you, the legit proxy it gives you, and exactly what you'd have to add to score real calibration.

If you trade, model, or just read prediction markets, there's one question that decides whether the price means anything: when the market says 70%, does the thing actually happen about 70% of the time?

That's calibration, and it's the single most decision-relevant property of any probabilistic forecaster. A market can be liquid, popular, and heavily traded and still be systematically wrong in a way that's invisible until you score it against what actually resolved.

Here's the honest catch I have to lead with: measuring calibration needs two things — a dense price history AND the realized outcome of each market. I have the first cleanly. The second is not in this dataset, and any vendor (me included) who shows you a "calibration curve" derived from prices alone is measuring the market against itself, not against reality.

The dataset (what's actually in it)

Since late March I've logged Polymarket every 15 minutes. The frozen export holds:

22,410 markets
18,611,636 price snapshots (≈831 per market with a series)
1,856,388 order-book snapshots
15-minute cadence, 92 continuous days (2026-03-28 → 2026-06-28)
Each row: market id, timestamp, yes-price, plus volume / liquidity / best bid-ask features

What it does not carry: a per-market settled 0/1 outcome. SELECT COUNT(*) FROM markets WHERE resolved=1 returns 0 for all 22,410 markets in the export — these are price paths, not graded results. So you cannot, from this file alone, compute "of the times the price sat at 70%, how often did the event happen." You need the resolution labels, and those live outside the price feed (Polymarket's resolution / the on-chain settlement).

What price-only data CAN give you: the convergence proxy

You don't get true calibration, but you do get a useful, honest proxy. Of the 7,101 markets that ended inside the window and still carry a terminal price series, 6,836 (96.3%) closed decisively — last yes-price ≥ 0.95 or ≤ 0.05. Only 265 (3.7%) were still mushy in the middle at the end. (Denominator matters: measured across **all 19,584* markets whose end-date fell in the window — many with no live terminal quote — the decisive share drops sharply, so always state which denominator you mean.)*

That terminal price is a noisy stand-in for the outcome: ~96.3% of the time the market made up its mind hard enough that "did it close near 1?" is a defensible label. It's not ground truth (the 3.7% ambiguous tail and any post-close revision are exactly where it breaks), but it lets you study the shape of price convergence — how and when a market sharpens — which is genuinely informative on its own.

The real calibration measurement (what you'd add)

The classic check is a reliability diagram, and the method is fine — it just needs labels you join in:

Get realized outcomes for resolved markets from an external resolution source (the price feed won't give them to you).
Bin every historical price into deciles (0–10%, 10–20%, … 90–100%).
For each bin, compute the empirical resolution rate — of all the times the price sat in that bin, how often did the event actually happen?
Plot empirical rate vs. stated price. Perfect calibration is the 45° diagonal.

import requests, pandas as pd

BASE = "https://api.protodex.io"   # free price API, no signup

# Prices come from the dataset/API. LABELS DO NOT — you must supply them.
# `resolutions` here is an external {market_id: 0/1} you join in yourself
# (Polymarket resolution / on-chain settlement). It is NOT in this price feed.
resolutions = load_external_resolution_labels()   # <-- the part you provide

rows = []
for market_id, label in resolutions.items():
    prices = requests.get(f"{BASE}/prices", params={"market_id": market_id}).json()
    for p in prices:
        rows.append((p["yes_price"], label))

df = pd.DataFrame(rows, columns=["price", "outcome"])
df["bin"] = (df["price"] * 10).clip(0, 9).astype(int)
reliability = df.groupby("bin")["outcome"].mean()   # empirical rate per decile
print(reliability)   # compare each row to its bin midpoint -> the diagonal

If you only have the price feed and substitute the convergence proxy for label, be explicit that you're scoring the market against its own terminal price — a different, weaker claim than calibration against reality.

What to look for (and the trap)

The literature on real-money markets has a well-documented signature: the favorite–longshot bias — longshots tend to be overpriced, heavy favorites slightly underpriced. But you cannot confirm it from prices alone. It's a statement about realized outcomes vs. price, so it lives or dies on the labels you join in. Same for any Brier-score-over-time trajectory: great question, needs ground truth.

Honest caveats

No resolution labels in the dataset. The single biggest limit — restated because it's the one most listings hide.
Convergence ≠ truth. The 96.3% terminal-decisiveness proxy mislabels the ambiguous tail and ignores any post-close correction.
Survivorship / selection. Ended markets ≠ all markets; conditioning on resolution biases some analyses.
Mid ≠ executable price. Fees and spread are real the moment you make a trading claim.
86 days is one window, not multiple regimes. Treat any "markets are calibrated" conclusion as provisional.

Get the data

The free read-only API reproduces the price side with no signup:

https://api.protodex.io — endpoints /stats, /markets, /market/{id}, /prices, /orderbook, /categories.

If you'd rather not page the API market-by-market and want the full 18.6M-snapshot, 92-day price history as one indexed SQLite file for offline work, the one-time archive is here: Polymarket Historical Price Dataset. It's price history, honestly scoped — bring your own resolution labels if calibration is the goal.

I'd genuinely like to know how others source clean resolution labels for prediction-market calibration work — that's the open edge here. If you've benchmarked a market's calibration, where did your ground truth come from?