17 million Polymarket price snapshots, collected on one Mac for $0/month

#python #datascience #sqlite #opensource

Most prediction-market datasets you find online are a one-time dump someone scraped, posted, and abandoned. They go stale the day after they're published. I wanted a living archive of Polymarket — every market, sampled every 15 minutes, running continuously — and I wanted it to cost nothing to operate.

Here's what the archive holds today, counted straight from the database:

18,611,636 price snapshots
22,410 markets tracked
~92 days of continuous 15-minute history (and counting)

All of it runs on a single Mac, with a recurring infrastructure bill of $0/month. No cloud database, no managed queue, no Kubernetes. This post is how.

The architecture is boring on purpose

The whole thing is three moving parts:

A collector — a Python process on a 15-minute timer (launchd, the macOS-native scheduler — no cron daemon, no external trigger). Each tick pulls the live market list from Polymarket's public API, then for each active market records the current YES/NO prices and, where one exists, the top of the order book.
One SQLite file — not Postgres, not a warehouse. SQLite handles tens of millions of rows on a laptop without complaint as long as you index the access path you actually use. The entire archive is a single .db file you can scp anywhere.
A daily exporter — dumps the SQLite tables to Parquet so the dataset is portable and loads in one line of pandas/Polars.

That's it. The "$0/month" isn't a trick — a laptop you already own, the OS scheduler you already have, and a file format that doesn't bill you.

One honest caveat: the price series is the real product, not the book

The collector also samples top-of-book (best bid / best ask) every tick — but be clear-eyed about it: most Polymarket markets are thin, so a live two-sided book simply doesn't exist at most sample times. Counted straight from the DB, only about 6% of the 1.86M book rows captured a real two-sided quote; the rest are markets with no live book at that moment. So treat the dense, reliable layer as the price time series (18.6M real snapshots) — that's what you backtest on. The book samples are a sparse bonus for the handful of liquid markets, not a full reconstructable order book for everything. I'd rather tell you that up front than have you discover it after download.

Index the query, not the table. The naive mistake is to over-index and watch your write throughput collapse every 15 minutes. The prices table has exactly one composite index — (market_id, ts) — because the only read pattern that matters is "give me the price history of this market over this window." One index, sized to the actual query, keeps both the 15-minute writes and the backtest reads fast.

Append, never mutate. Every snapshot is an immutable row stamped with an ISO-8601 UTC timestamp. Nothing is ever updated in place. That means the archive is a true time series — you can reconstruct the YES/NO price as it stood at any 15-minute mark in the last 92 days, not just "latest state." Mutable rows would have quietly destroyed the history I was trying to capture.

Why this is hard to copy (and why that matters if you trade)

The dataset isn't valuable because the code is clever — it's three boring parts. It's valuable because of the one thing you cannot backfill: time. You cannot retroactively collect the price as it stood on April 3rd at 14:15 UTC. Either a process was running and recording it, or that moment is gone forever.

So if you want 92 days of 15-minute Polymarket history to backtest a mean-reversion or calibration strategy, you have two options: stand up a collector today and wait three months, or start from the archive that's already been running since March 28th.

Get the data

Free sample + schema + loader: github.com/LuciferForge/polymarket-historical-data — grab the sample, check the schema, run the example query before you commit to anything.
Full dataset (one-time download): on Gumroad.

If you'd want this kept fresh automatically — a recurring refresh so your backtests never run on a frozen file — that's the thing I'm deciding whether to build next. There's an open roadmap thread on the repo; tell me what cadence and format you'd actually use.

All figures in this post were counted from the database export built on 2026-06-28 and are not estimates.