manja316

Posted on Apr 28

308 Labeled Polymarket Crash Trades — Free Dataset For Mean-Reversion Research

#python #datascience #machinelearning #opensource

If you want to study mean-reversion on prediction markets, the data you actually need does not exist publicly. Most "Polymarket datasets" are either:

Synthetic — generated for academic papers, no real money behind them.
Aggregate — hourly volume and last-price across thousands of markets. Useless for tactical signal research.

So I built one and open-sourced it: cross-signal-data.

pip install cross-signal-data

from cross_signal_data import load
df = load()                          # pandas DataFrame, 308 rows
print(df["is_profitable"].mean())    # 0.802

This is the actual labeled outcomes of 308 closed trades from a live Polymarket crash-recovery bot, with the signal features and the resolved outcome for each trade.

Also mirrored on HuggingFace: huggingface.co/datasets/LuciferForge/cross-signal-data.

What's in the dataset

19 columns, one row per closed trade:

Column	Description
`trade_id`	Sequential 0-indexed
`market_id`	Polymarket market ID (queryable via `gamma-api.polymarket.com`)
`question`	Market question text
`outcome_label`	YES/NO outcome the bot bet on
`entry_time`	When the crash signal fired (ISO-8601 UTC)
`exit_time`	When the position closed
`entry_price`	Per-share price at entry (0–1, Polymarket prices are probabilities)
`exit_price`	Per-share price at exit
`pre_crash_high`	Recent local-window high before the crash trigger
`drop_pct`	`(pre_crash_high − entry_price) / pre_crash_high × 100`
`size_usd`	USD allocated (typically $5)
`shares`	Share count purchased
`hold_hours`	Wall-clock hours from entry to exit
`pnl_usd`	Realized P&L (theoretical, see below)
`is_profitable`	1 if `pnl_usd > 0` else 0
`exit_reason`	RECOVERY / TIMEOUT_48H / TIMEOUT
`entry_hour_utc`	Hour-of-day at entry
`entry_dow`	Day-of-week at entry (0=Monday)
`recovered_to_pct_of_high`	`exit_price / pre_crash_high × 100`

Aggregate stats

308 trades, 247 profitable (80.2% WR)
Date range: March 2026 – April 2026
Median hold: ~3 hours
Average drop_pct at entry: ~22%
Average recovery: ~85% of pre-crash high

Exit reason distribution

Reason	Count	What it means
RECOVERY	235	Price climbed back to ~90% of pre-crash high. Took profit.
TIMEOUT_48H	62	Held 48 hours without recovery. Sold at whatever the bid offered.
TIMEOUT	11	Older shorter-window timeout from earlier in the dataset.

Sports markets where the team had already lost the underlying game often end up in TIMEOUT_48H. So do political markets that crashed because the resolution fundamentals shifted, not just because of momentary panic. The bot's job is to filter those out before entering; the dataset shows where the filter fails.

How I used it

Loaded the data with the bundled loader, ran a logistic regression and a random forest, got 79.9% cross-validated accuracy from 7 features:

Feature	RF importance
`drop_pct`	0.254
`shares`	0.200
`entry_price`	0.174
`pre_crash_high`	0.171
`entry_hour_utc`	0.110
`entry_dow`	0.059
`size_usd`	0.031

Translation: the bot's trigger filter is doing 100% of the work. A simple model that just learns "crashes with bigger drop_pct in the right time-of-day window are more likely to recover" basically reproduces the bot's actual win rate. There's no obvious feature engineering trick that beats the trigger.

The diurnal pattern is interesting. Hours 16, 21, 22 UTC have ~100% WR (small samples). Hour 8 UTC dips to ~55%. Off-peak hours (when US/EU traders are asleep, books are thin) are punishing.

df.groupby("entry_hour_utc")["is_profitable"].mean()

Run that yourself and see.

Important caveat: theoretical P&L vs on-chain P&L

The pnl_usd column is theoretical — computed from the bot's recorded entry_price and exit_price. This assumes you got every share filled at those prices. In practice on thin Polymarket books, fills come in slightly worse, especially for TIMEOUT exits.

I built a separate audit tool that reconciles the bot's records against on-chain fills: pnl-truthteller. On this same 308-trade dataset, it surfaces:

Theoretical P&L:  +$33.49
Actual P&L:       -$89.01
Slippage cost:    -$122.50  (-365.8% of theoretical)

So the bot has 80.2% trigger-level WR but is slightly underwater once slippage is included. That gap is worth more than the trigger itself — it tells you that the exit ladder strategy was walking thin books down. Interesting research question, exactly the kind of thing the dataset enables.

pip install pnl-truthteller
pnl-truthteller --wallet 0xYourProxyAddress

If you build a strategy on top of the dataset, run pnl-truthteller against your live wallet too. Otherwise you'll think you're profitable when you aren't.

What this dataset is good for

Mean-reversion alpha studies — does crash-recovery actually work? At what drop_pct does it start working? The data has all the inputs.
Time-of-day effects — entry_hour_utc × is_profitable reveals diurnal patterns.
Hold-time tradeoffs — the win-rate vs hold-hours curve is in here.
Feature-engineering exercises — if you can predict is_profitable better than 80% accuracy from these features, you've found something.
Backtesting frameworks — real labeled data with real prices, suitable for cross-validation.

What it's NOT good for

General Polymarket research. Too narrow a slice (one bot, one signal, two months).
High-frequency studies. Only entry/exit timestamps, not tick-level.
Counterfactuals ("what would a different bot have done?"). Only triggered trades are recorded.

Known biases

1. Survivorship in the trigger

Only contains markets where the trigger fired (>20% drop, $0.04–$0.30 entry range). If you'd used a different threshold, you'd see different markets.

2. Selection in the entry-price band

Most rows are concentrated in $0.04–$0.30. Markets that crashed from $0.80 → $0.50 are absent (above the range). Markets at $0.02 are absent (below the floor).

3. Theoretical PnL ≠ realized PnL

See above. Use pnl-truthteller for slippage-adjusted analysis.

4. Time period

March–April 2026. Includes one Polymarket V1 → V2 migration window, various political events specific to the period, and Polygon-specific gas conditions.

Don't assume the patterns extrapolate forward indefinitely. Re-run the dataset extraction quarterly as it grows.

Reproducibility

The script that generated the dataset from the bot's positions.json is checked in: scripts/extract.py. Anyone with the bot's source data can rerun it and get the same output.

git clone https://github.com/LuciferForge/cross-signal-data
cd cross-signal-data
python scripts/extract.py \
    --positions /path/to/positions.json \
    --output data/crashes_v1.csv

The dataset file is also bundled inside the pip package — cross_signal_data.load() returns the data without any external download.

License & citation

MIT. Use it, fork it, train on it, build a competitor strategy. The chain is public; the data is public; the code is public.

If you publish research using it:

@dataset{cross_signal_data_2026,
    title  = {cross-signal-data: Polymarket crash-recovery labeled dataset},
    author = {LuciferForge},
    year   = {2026},
    url    = {https://github.com/LuciferForge/cross-signal-data}
}

Resources

Repo: github.com/LuciferForge/cross-signal-data
PyPI: pip install cross-signal-data (pypi.org/project/cross-signal-data)
HuggingFace mirror: huggingface.co/datasets/LuciferForge/cross-signal-data
Slippage audit tool: pip install pnl-truthteller — github
Bot source: github.com/LuciferForge/polymarket-crash-bot — same bot that produced this data

If you build a model that beats 80% on this dataset, I want to know what feature you used. The bot's edge is mine until someone finds a better one.

LuciferForge runs a public-audited Polymarket trading bot, protodex.io (5,800+ MCP servers indexed), and the free Polymarket data API.

DEV Community