Blockchain Rust Engineer

Posted on Jun 27

I Backtested My Polymarket Bot Against Real Order Book Data. Here's What I Found.

#rust #trading #polymarket #backtesting

Six months of running a Rust trading bot live, then replaying historical tick data through the same code. The results were not what I expected.

The bot had been running for six months. It felt like it was working. P&L was positive. Win rate looked decent. And then I actually backtested it.

Turns out I had been lucky on a few large positions that masked two strategies that were quietly bleeding. I wouldn't have known without replaying real historical data through the same code that runs in production.

This is how I built the backtester, what data I used, and what I found.

The data problem

First thing I learned: Polymarket has no historical endpoint on their public API. The CLOB API gives you live order book state and WebSocket for real-time updates, but nothing historical. So you can't just hit an endpoint and get six months of order book history.

Three options I found:

pmdata.dev - tick-level L2 order book data from Feb 2026, served as parquet files. One HTTP request per market slug, download the whole history. This is what I ended up using.

telonex.io - similar, tick-by-tick order book depth captured on every change, not interval-sampled. More complete but costs more.

Polygon subgraph - on-chain trade fills only, no order book depth. Fine for trade history, useless if you want to simulate fills against the actual book.

For backtesting a strategy that cares about spread and order book depth, you need L2 data. Trade-only data hides slippage and gives you an unrealistically clean picture.

Download looks like this:

// fetch historical parquet for a market slug
let slug = "btc-updown-5m-1778803200";
let url = format!("https://api.pmdata.dev/download/poly_l2/{}.parquet", slug);

// stream it down, deserialize into your event format
// pmdata returns: timestamp, side, price, size

I wrote a small Rust binary that downloads the parquet, deserializes it into the same FeedEvent enum the live bot uses, and writes it to a local file. One-time setup per market.

The backtester structure

The key decision was to not write a separate backtester. The live bot already has a clean interface:

FeedEvent → strategy engine → Option<Signal>

The feed task is the only thing that changes between live and backtest. In production it reads from a WebSocket. In backtest it reads from a file. Everything downstream - order book, strategy, risk manager - is identical.

pub enum DataSource {
    Live { url: String },
    Backtest { path: PathBuf },
}

pub async fn run(source: DataSource, tx: mpsc::Sender<FeedEvent>) {
    match source {
        DataSource::Live { url } => run_websocket(url, tx).await,
        DataSource::Backtest { path } => replay_file(path, tx).await,
    }
}

replay_file reads events from disk and sends them through the same channel the live feed uses. The strategy task never knows the difference. This is the part I'm happiest with - there's no "backtest mode" flag scattered through the codebase. It's just a different data source.

One thing you have to handle: timing. In production, events arrive in real time. In backtest, you're replaying them as fast as the CPU can process them. You need to either:

Replay at wall-clock speed (slow, but simulates real latency)
Replay as fast as possible with timestamps preserved in events (what I do)

I went with option 2. The strategy uses event timestamps, not wall-clock time, so the math stays correct even at 100x playback speed. Running six months of data takes about 40 seconds.

Simulating fills

This is where most backtests lie to you.

The naive approach: if your signal says buy at $0.38 and the historical data shows a trade at $0.38, assume you filled. This is wrong for two reasons.

First, you weren't the only one trying to buy at $0.38. Queue position matters. In a thin market, by the time your order would have reached the exchange, that level might be gone.

Second, your order moves the book. A $500 position in a market with $2,000 in liquidity is meaningful. The naive approach pretends you're a ghost that trades without impact.

What I do instead: simulate fills against the L2 snapshot at the time of the signal.

fn simulate_fill(
    book: &OrderBook,
    side: Side,
    size: Decimal,
) -> Option<(Decimal, Decimal)> { // (fill_price, actual_size)
    let levels = match side {
        Side::Buy => &book.asks,
        Side::Sell => &book.bids,
    };

    let mut remaining = size;
    let mut cost = Decimal::ZERO;

    for level in levels {
        if remaining.is_zero() { break; }

        let take = remaining.min(level.size);
        cost += take * level.price;
        remaining -= take;
    }

    if remaining > Decimal::ZERO {
        // not enough liquidity to fill fully
        // partial fill or no fill depending on your strategy
        return None;
    }

    let fill_price = cost / size;
    Some((fill_price, size))
}

This walks the order book and simulates the actual fill price including slippage. It's still not perfect - you're still assuming you could have taken all that liquidity - but it's much closer to reality than assuming a perfect fill at the best price.

I also added a 800µs delay to every simulated fill to account for the actual round-trip time my live bot experiences. Sounds pedantic. It actually mattered on a few strategies that depended on very short windows.

What I found

Six strategies I'd been running. Here's the honest breakdown:

Mean reversion on spread widening - positive, 61% win rate over 4,200 trades. This is the core strategy. Backtest matched live performance closely, which gave me more confidence in it.

Pre-resolution snipe (last 60 seconds) - positive but much thinner than I thought. The live P&L looked good because I'd caught a few large moves. Backtest showed the median trade on this strategy barely covers fees. Still running it but at reduced size.

Momentum on correlated markets - flat to slightly negative. I'd convinced myself there was a signal here. There isn't, or at least not a consistent one. Turned this off.

Kelly-sized position scaling - neutral effect. The Kelly sizing doesn't generate alpha, it just changes the variance profile. Expected, but good to confirm.

Spread capture / market making - negative. I knew this was experimental. Confirmed it doesn't work at my size in these markets. Turned off.

News event repricing - not enough data to evaluate. Only 23 trades in six months. Keeping it on but can't draw conclusions yet.

The uncomfortable finding: two of the five strategies I thought were working were either flat or negative. The positive P&L I'd been seeing was almost entirely the mean reversion strategy plus some fortunate sizing on a few outlier trades.

One thing the backtest can't tell you

Whether the edge still exists.

Backtesting against historical data tells you whether a strategy would have worked on past data. It doesn't tell you whether the market has changed, whether more bots have entered the same trades, or whether the conditions that created the edge in January still exist in June.

The mean reversion strategy shows a slight decay in win rate from Feb to June - 64% in Q1, 58% in Q2. Could be noise. Could be the edge compressing as more participants find it. I don't know yet.

This is the thing nobody says loudly enough about backtesting: a good backtest raises questions, it doesn't answer them. Finding out two strategies don't work is useful. The real question is whether the one that does work keeps working.

Setup if you want to replicate this

Grab an API key from pmdata.dev (they have a free tier). Download parquet files for whatever markets you want. Write a deserializer that maps their schema to your FeedEvent type. Swap your data source enum. Run.

Full code including the replay harness and fill simulator:

github.com/casatrick/polymarket-trading-bot

The backtester is in src/backtest/. It's about 300 lines including the fill simulator and the P&L tracker.

If you've built something similar or found a better data source, drop it in the comments. Especially interested in whether anyone has found a way to get clean pre-February 2026 data - the Polygon subgraph has it but the order book reconstruction from on-chain data is painful.