DEV Community

John Leslie
John Leslie

Posted on • Originally published at polymarket-calibration.vercel.app

Are Prediction Markets Well-Calibrated? I Analyzed 7,661 Resolved Polymarket Markets

By John Leslie

Prediction markets are booming. Polymarket alone processes billions in volume. But a fundamental question remains: when a market says there's a 60% chance of something happening, does that event actually happen 60% of the time?

I pulled 7,661 resolved binary markets from Polymarket's API, grabbed price history for the top 2,000 by volume, and built an interactive calibration tracker to find out.

Try it here: Polymarket Calibration Tracker

What is calibration?

A forecaster is well-calibrated if their predicted probabilities match observed frequencies. If you say "70% chance" for 100 different events, roughly 70 of them should happen. Perfect calibration means every point sits on the diagonal line where predicted = actual.

This is the gold standard metric for forecast quality, used everywhere from weather forecasting (the NWS publishes calibration curves) to machine learning model evaluation.

The results

Markets are excellent at the extremes

Events priced at 0-5% almost never happen (0.1% actual resolution rate). Events at 95-100% always happen (100% actual). This is reassuring: when the crowd is very confident, the crowd is right.

Systematic underpricing in the middle range

This is where it gets interesting. Markets trading at 65-75% actually resolved YES 96.5% of the time (29 markets). At 45-55%, the actual rate was 60% (25 markets). The sample sizes are small in these middle bins, so interpret cautiously, but the pattern is consistent: mid-range markets seem to underprice the YES outcome.

One hypothesis: multi-outcome markets inflate the low-probability bins. If there are 10 candidates in an election, 9 of them trade at 0-5% and lose, pulling the average down. The winners cluster higher. This creates a distributional skew that looks like underpricing in the middle.

Calibration improves closer to resolution

This makes intuitive sense. The 24-hour Brier score (0.025) is significantly better than the 30-day score (0.042). As resolution approaches, uncertainty resolves, and prices converge to 0 or 1.

For context: random guessing on balanced outcomes gives a Brier score of 0.25. A Brier score of 0.025 is excellent.

Category breakdown

The tool lets you filter by category. Some highlights:

  • Politics (1,312 markets): Well-calibrated at extremes, biggest mid-range deviation
  • Sports (1,314 markets): Similar pattern to politics, slightly tighter calibration
  • Crypto (752 markets): More volatile, noisier calibration curve
  • Geopolitics (404 markets): Small sample but interesting patterns
  • Science/Tech (163 markets): Too few markets for reliable conclusions

Technical details

Data collection: Polymarket's Gamma API for resolved market metadata. CLOB (Central Limit Order Book) midpoint prices at 24h, 7d, and 30d before resolution for the top 2,000 markets by volume.

Methodology: Markets binned into probability ranges (0-5%, 5-15%, ..., 95-100%). For each bin, the actual resolution rate is compared to the predicted probability. Proportional dot sizes indicate sample count.

Caveats:

  • 75% of price-tracked markets fall in the 0-5% bin, driven by multi-outcome market losers
  • Small sample sizes in mid-range bins (20-50 markets each) limit statistical significance
  • Only resolved markets are included (survivorship consideration)

Why this matters

Prediction markets are being used for increasingly high-stakes decisions. DARPA studied them for intelligence forecasting. The Federal Reserve has explored them for inflation expectations. Companies like Google and HP have used internal prediction markets for product planning.

If these markets are well-calibrated, their prices can be treated as genuine probability estimates. If they're systematically biased, traders and decision-makers need to adjust.

The data suggests Polymarket is well-calibrated overall (Brier score 0.025), with some interesting biases in the middle range worth investigating further as sample sizes grow.

Explore the data yourself: polymarket-calibration.vercel.app


I write The Market Oracle, a weekly prediction market intelligence newsletter. Read more at our site.

Top comments (0)