DEV Community

manja316
manja316

Posted on

Free Polymarket Dataset on Kaggle: 13,963 Active Markets + 100K Price Sample (May 2026)

I just pushed a fresh sample of the Polymarket dataset to Kaggle. Free download, Apache 2.0, no email gate.

👉 https://www.kaggle.com/datasets/luciferforge/polymarket-markets-prices-sample-2026

What's in it

Two CSV files:

File Rows What it gives you
markets.csv 13,963 Every currently-active Polymarket market with question, category, volume, liquidity, status, end date, slug
prices_sample.csv 100,000 The most recent 15-minute price snapshots across the universe — preview of the full 10.8M-snapshot corpus

8 MB compressed. Loads instantly in pandas, DuckDB, Polars, or any spreadsheet.

What you can build in 20 minutes

  • A screener. Filter markets by category × volume × spread × days-to-resolution.
  • A volume leaderboard. markets.csv has volume, volumeNum, volume24hr — rank, sort, group by category.
  • A "lottery ticket" finder. Find low-price (<$0.10) outcomes with non-trivial liquidity. There are hundreds.
  • A category dashboard. What's the average spread in sports vs politics vs crypto? The CSV has the data.
  • A correlation map. Even with 100K snapshots, you can compute simple correlations between sub-markets in the same event (semi-final A vs final winner, for example).

What it's NOT

This is the sample, not the full historical corpus.

If you need:

  • Every 15-minute snapshot for 43+ days (10.8M+ rows)
  • Orderbook depth (1.07M+ snapshots)
  • The full SQLite database for joins
  • Continuous updates

...that's the $9 Polymarket Full Dataset on Gumroad. The Kaggle sample is the on-ramp.

How it was collected

Automated pipeline running 24/7 since March 2026:

  • Gamma API → market metadata + prices for every active market
  • CLOB API → orderbook depth (top 10 levels) for top 200 markets by volume
  • SQLite storage with indexed timestamps for analytical queries
  • Sample CSVs regenerated when the dataset is republished

Source: protodex.io — the MCP-server directory with security scores. I built the Polymarket collector to feed a separate trading bot project; the dataset is the byproduct.

Why a Kaggle release at all

Two reasons:

  1. It's where the prediction-market quant crowd lives. Kaggle has a dedicated "prediction markets" search niche. The other dataset I have there (polymarket-historical-prices) is sitting at 68 downloads from zero promo. Putting structured data where researchers already are has been higher-leverage than waiting for Reddit posts to go viral.

  2. Sample-to-paid funnels work. If you find an edge in the free sample, paying $9 for the full corpus is an easy yes. If the sample data is unusable to you, you'd never have bought $9 anyway. Aligns incentives.

Limitations to know

  • The 100K-row price sample is the most recent 100K snapshots only — not a uniform random sample across the full 43-day window.
  • "Active markets" here means markets that were live as of the last collection run. Resolved markets aren't included in markets.csv.
  • Polymarket pulled a few markets between the snapshot and publish. Cross-check by slug if you find any 404s.
  • The dataset is updated when I refresh it — not in real time. For live data, hit the Gamma API directly (it's free and rate-limited generously).

License

Apache 2.0. Use commercially, modify, redistribute. Credit Protodex if you publish derived work; no obligation otherwise.

What I'd love feedback on

If you download this and build something — even a half-finished notebook — drop a comment with what you tried. I'm watching which use cases come up so I know what the V2 corpus should prioritize (more orderbook depth? more historical reach? resolved-market history? a labeled crash-event dataset?).


The dataset is live now: Polymarket Markets + Price Sample (2026) on Kaggle.

If it's useful, an upvote on Kaggle helps it surface in the prediction-markets search — that's the only ask.

Top comments (0)