Most prediction-market datasets you find online are a one-time dump someone scraped, posted, and abandoned. They go stale the day after they're published. I wanted a living archive of Polymarket — every market, sampled every 15 minutes, running continuously — and I wanted it to cost nothing to operate.
Here's what it looks like today, counted straight from the database this morning:
- 17,536,192 price snapshots
- 20,931 markets tracked
- 1,749,212 order-book snapshots
- ~86 days of continuous 15-minute history (and counting)
All of it runs on a single Mac, with a recurring infrastructure bill of $0/month. No cloud database, no managed queue, no Kubernetes. This post is how.
The architecture is boring on purpose
The whole thing is three moving parts:
- A collector — a Python process on a 15-minute timer (launchd, the macOS-native scheduler — no cron daemon, no external trigger). Each tick pulls the live market list from Polymarket's public API, then for each active market records the current YES/NO prices and the top of the order book.
-
One SQLite file — not Postgres, not a warehouse. SQLite handles tens of millions of rows on a laptop without complaint as long as you index the access path you actually use. The entire archive is a single
.dbfile you canscpanywhere. - A daily exporter — dumps the SQLite tables to Parquet so the dataset is portable and loads in one line of pandas/Polars.
That's it. The "$0/month" isn't a trick — a laptop you already own, the OS scheduler you already have, and a file format that doesn't bill you.
The two decisions that actually mattered
Index the query, not the table. The naive mistake is to over-index and watch your write throughput collapse every 15 minutes. The prices table has exactly one composite index — (market_id, ts) — because the only read pattern that matters is "give me the price history of this market over this window." One index, sized to the actual query, keeps both the 15-minute writes and the backtest reads fast.
Append, never mutate. Every snapshot is an immutable row stamped with an ISO-8601 UTC timestamp. Nothing is ever updated in place. That means the archive is a true time series — you can reconstruct the order book as it looked at any 15-minute mark in the last 86 days, not just "latest state." Mutable rows would have quietly destroyed the history I was trying to capture.
Why this is hard to copy (and why that matters if you trade)
The dataset isn't valuable because the code is clever — it's three boring parts. It's valuable because of the one thing you cannot backfill: time. You cannot retroactively collect the order book as it stood on April 3rd at 14:15 UTC. Either a process was running and recording it, or that moment is gone forever.
So if you want 86 days of 15-minute Polymarket history to backtest a mean-reversion or calibration strategy, you have two options: stand up a collector today and wait three months, or start from the archive that's already been running since March 28th.
Get the data
- Free sample + schema + loader: github.com/LuciferForge/polymarket-historical-data — grab the sample, check the schema, run the example query.
- Full archive (17.5M snapshots, one-time): on Gumroad.
If you'd want this kept fresh automatically — a quarterly refresh so your backtests never run on a frozen file — that's the thing I'm deciding whether to build next. There's an open roadmap thread on the repo; tell me what cadence and format you'd actually use.
All figures in this post were counted from the live database on 2026-06-22 and are not estimates.
Top comments (0)