I just pushed a fresh sample of the Polymarket dataset to Kaggle. Free download, Apache 2.0, no email gate.
👉 https://www.kaggle.com/datasets/luciferforge/polymarket-markets-prices-sample-2026
What's in it
Two CSV files:
| File | Rows | What it gives you |
|---|---|---|
markets.csv |
13,963 | Every currently-active Polymarket market with question, category, volume, liquidity, status, end date, slug |
prices_sample.csv |
100,000 | The most recent 15-minute price snapshots across the universe — preview of the full 10.8M-snapshot corpus |
8 MB compressed. Loads instantly in pandas, DuckDB, Polars, or any spreadsheet.
What you can build in 20 minutes
- A screener. Filter markets by category × volume × spread × days-to-resolution.
-
A volume leaderboard.
markets.csvhasvolume,volumeNum,volume24hr— rank, sort, group by category. -
A "lottery ticket" finder. Find low-price (
<$0.10) outcomes with non-trivial liquidity. There are hundreds. - A category dashboard. What's the average spread in sports vs politics vs crypto? The CSV has the data.
- A correlation map. Even with 100K snapshots, you can compute simple correlations between sub-markets in the same event (semi-final A vs final winner, for example).
What it's NOT
This is the sample, not the full historical corpus.
If you need:
- Every 15-minute snapshot for 43+ days (10.8M+ rows)
- Orderbook depth (1.07M+ snapshots)
- The full SQLite database for joins
- Continuous updates
...that's the $9 Polymarket Full Dataset on Gumroad. The Kaggle sample is the on-ramp.
How it was collected
Automated pipeline running 24/7 since March 2026:
- Gamma API → market metadata + prices for every active market
- CLOB API → orderbook depth (top 10 levels) for top 200 markets by volume
- SQLite storage with indexed timestamps for analytical queries
- Sample CSVs regenerated when the dataset is republished
Source: protodex.io — the MCP-server directory with security scores. I built the Polymarket collector to feed a separate trading bot project; the dataset is the byproduct.
Why a Kaggle release at all
Two reasons:
It's where the prediction-market quant crowd lives. Kaggle has a dedicated "prediction markets" search niche. The other dataset I have there (
polymarket-historical-prices) is sitting at 68 downloads from zero promo. Putting structured data where researchers already are has been higher-leverage than waiting for Reddit posts to go viral.Sample-to-paid funnels work. If you find an edge in the free sample, paying $9 for the full corpus is an easy yes. If the sample data is unusable to you, you'd never have bought $9 anyway. Aligns incentives.
Limitations to know
- The 100K-row price sample is the most recent 100K snapshots only — not a uniform random sample across the full 43-day window.
- "Active markets" here means markets that were live as of the last collection run. Resolved markets aren't included in
markets.csv. - Polymarket pulled a few markets between the snapshot and publish. Cross-check by
slugif you find any 404s. - The dataset is updated when I refresh it — not in real time. For live data, hit the Gamma API directly (it's free and rate-limited generously).
License
Apache 2.0. Use commercially, modify, redistribute. Credit Protodex if you publish derived work; no obligation otherwise.
What I'd love feedback on
If you download this and build something — even a half-finished notebook — drop a comment with what you tried. I'm watching which use cases come up so I know what the V2 corpus should prioritize (more orderbook depth? more historical reach? resolved-market history? a labeled crash-event dataset?).
The dataset is live now: Polymarket Markets + Price Sample (2026) on Kaggle.
If it's useful, an upvote on Kaggle helps it surface in the prediction-markets search — that's the only ask.
Top comments (0)