FutureSim Exposes Polymarket AI's Narrow Wins and Failures

#polymarket #maxplanck #gpt55 #claudecode

Max Planck Institute researchers recently released FutureSim, a benchmark for polymarket ai-style forecasting that tests whether agents can predict real-world events from a frozen slice of past web history rather than the live internet.

According to the FutureSim project page, the setup replays daily news over a three-month simulation and asks agents to decide for themselves when to search, when to update, and which forecasts deserve attention.

The project targets a familiar claim in AI forecasting: that if you give a model enough search, memory, and tools, it should start behaving a bit like a prediction market trader. FutureSim turns that into a controlled test by feeding agents timestamped news snapshots from CommonCrawl and blocking any information from beyond the simulated date, which is a neat way to stop hindsight from quietly doing all the work.

In the first release, the team benchmarked frontier agents running in harnesses including Codex and Claude Code. The benchmark includes markets and event questions spanning sports, elections, and culture, including some that overlap with Polymarket-style contracts where prices are meant to represent probabilities for future outcomes.

The early results were mixed in exactly the way hype tends not to be. On some overlapping markets, GPT-5.5 running in Codex reportedly beat the human aggregate by a wide margin: the researchers highlighted a near-perfect Brier skill score of 0.90 on the Super Bowl LX market and similarly strong performance on the Portugal presidential runoff. But the same model performed badly on UK election questions and the Grammys, which is less “general oracle” and more “occasionally very good specialist”.

FutureSim measures performance by asking agents to issue probabilistic forecasts as the simulated world unfolds. On Polymarket, by contrast, a price such as $0.65 implies roughly a 65 per cent chance of an event happening, according to Polymarket’s own documentation. That makes the comparison legible: the benchmark is testing whether agents can produce probabilities that are better calibrated than a live, money-backed crowd.

The broader aggregate numbers were less flattering than the headline wins. Research notes tied to the release said the best system reached only about 25 per cent accuracy overall, and some agents were so poorly calibrated that their probability estimates were worse than abstaining. Better memory, search, tool use, and more inference compute helped, but did not turn forecasting into a solved problem.

Arvind Narayanan? No verified quote from the paper materials was provided, and the project page included no named outside expert reaction in the supplied sources. What is clear from the released materials is the claim the benchmark is meant to support: adaptive agents should be judged on how they update beliefs over time, not just on one-shot trivia questions disguised as forecasting tasks.

That caveat matters because FutureSim is still a simulation built from historical web data, not a live trading system. Bloomberg reported on 6 May that most AI bots tested in trading contests were still losing money, which is a useful reminder that good scores on some event classes do not automatically cash out into broad market skill.

The FutureSim team says the benchmark can also be used with custom chronological datasets and to study memory, search, test-time adaptation, and multi-agent self-play. The paper, titled “FutureSim: Replaying World Events to Evaluate Adaptive Agents,” is listed on alphaXiv.

Key Takeaways

Max Planck researchers released FutureSim to test whether agents can forecast real-world events from replayed web history.
The benchmark uses dated CommonCrawl news snapshots to prevent models from seeing information from the future.
GPT-5.5 in Codex reportedly scored a Brier skill of 0.90 on the Super Bowl LX market but did poorly on some election and entertainment questions.
The released materials suggest current systems show narrow wins and obvious failures rather than broad forecasting skill.
FutureSim is a historical simulation, not evidence that AI systems can reliably make money in live markets.

DEV Community

FutureSim Exposes Polymarket AI's Narrow Wins and Failures

Key Takeaways

Further Reading

Top comments (0)