DEV Community

manja316
manja316

Posted on

17 million Polymarket price snapshots, collected on one Mac for $0/month

Most prediction-market datasets you find online are a one-time dump someone scraped, posted, and abandoned. They go stale the day after they're published. I wanted a living archive of Polymarket — every market, sampled every 15 minutes, running continuously — and I wanted it to cost nothing to operate.

Here's what it looks like today, counted straight from the database this morning:

  • 17,536,192 price snapshots
  • 20,931 markets tracked
  • 1,749,212 order-book snapshots
  • ~86 days of continuous 15-minute history (and counting)

All of it runs on a single Mac, with a recurring infrastructure bill of $0/month. No cloud database, no managed queue, no Kubernetes. This post is how.

The architecture is boring on purpose

The whole thing is three moving parts:

  1. A collector — a Python process on a 15-minute timer (launchd, the macOS-native scheduler — no cron daemon, no external trigger). Each tick pulls the live market list from Polymarket's public API, then for each active market records the current YES/NO prices and the top of the order book.
  2. One SQLite file — not Postgres, not a warehouse. SQLite handles tens of millions of rows on a laptop without complaint as long as you index the access path you actually use. The entire archive is a single .db file you can scp anywhere.
  3. A daily exporter — dumps the SQLite tables to Parquet so the dataset is portable and loads in one line of pandas/Polars.

That's it. The "$0/month" isn't a trick — a laptop you already own, the OS scheduler you already have, and a file format that doesn't bill you.

The two decisions that actually mattered

Index the query, not the table. The naive mistake is to over-index and watch your write throughput collapse every 15 minutes. The prices table has exactly one composite index — (market_id, ts) — because the only read pattern that matters is "give me the price history of this market over this window." One index, sized to the actual query, keeps both the 15-minute writes and the backtest reads fast.

Append, never mutate. Every snapshot is an immutable row stamped with an ISO-8601 UTC timestamp. Nothing is ever updated in place. That means the archive is a true time series — you can reconstruct the order book as it looked at any 15-minute mark in the last 86 days, not just "latest state." Mutable rows would have quietly destroyed the history I was trying to capture.

Why this is hard to copy (and why that matters if you trade)

The dataset isn't valuable because the code is clever — it's three boring parts. It's valuable because of the one thing you cannot backfill: time. You cannot retroactively collect the order book as it stood on April 3rd at 14:15 UTC. Either a process was running and recording it, or that moment is gone forever.

So if you want 86 days of 15-minute Polymarket history to backtest a mean-reversion or calibration strategy, you have two options: stand up a collector today and wait three months, or start from the archive that's already been running since March 28th.

Get the data

If you'd want this kept fresh automatically — a quarterly refresh so your backtests never run on a frozen file — that's the thing I'm deciding whether to build next. There's an open roadmap thread on the repo; tell me what cadence and format you'd actually use.

All figures in this post were counted from the live database on 2026-06-22 and are not estimates.

Top comments (0)