DEV Community

Cover image for I Placed #2 in a Prediction Market Challenge Using Autoresearch
Octavian Cristea
Octavian Cristea

Posted on

I Placed #2 in a Prediction Market Challenge Using Autoresearch

My first strategy lost $17 every simulation. My last one earned $47. The gap wasn't closed by me — it was closed by a loop.

Paradigm's Automated Research Hackathon was built around Karpathy's autoresearch idea: let AI do the research, not just the coding. The challenge — build a market-making strategy for a simulated prediction market. The constraint — 8 hours. The question I wanted to answer: can an AI-driven research loop compete with deep domain expertise?

It placed #2. The #1 used 1,039 strategies and 20 parallel agents. I used 110 iterations and a tighter loop. The gap: $1.23.

Full source code on GitHub

The Loop

The pipeline was simple:

Anara for deep research — analyze the problem, read the simulation source code, identify unexplored regimes, propose structural changes to the strategy.

Claude Code for implementation — write the code, run parameter sweeps, test across simulations, measure the score.

The cycle: Anara proposes a breakthrough idea → Claude Code implements and iterates until the score plateaus → back to Anara for the next breakthrough.

This isn't "AI writes code." It's "AI does research, then AI implements the findings." Two different jobs, two different tools, one loop.

Here's how that loop played out across 8 hours.

The Problem (2 minutes of context)

A simulated binary prediction market. A contract pays $1 or $0. You place limit orders on an order book. Three agents trade against you:

  • Arbitrageur — knows the true probability, sweeps every mispriced order before anyone else
  • Retail — random uninformed traders, ~0.25 per step, ~$4.5 notional
  • Competitor — static ladder, always present

Your score is edge: how good your price was vs true probability at fill time. Positive = you priced better than reality. The arb punishes bad prices. Retail rewards good ones.

Cycle 1: Foundation → Plateau at -$3

Anara's first pass analyzed the simulation source code and proposed a standard market-making setup: quote inside the competitor's spread, track volatility with an EMA, use inventory skew to mean-revert positions.

Claude Code implemented it, iterated through sizing, skew rates, and threshold parameters. 50 versions later:

  • v1: -$17 (losing $24 to arb, earning $7 from retail)
  • v10: +$4 (asymmetric skew cut arb losses to $5)
  • v50: -$3 (z-score filter tripled retail to $21, but arb crept back to $24)

Plateau. Every parameter tweak that reduced arb losses also reduced retail fills. Claude Code ran dozens of sweeps — sigma priors from 30 to 50, skew rates from 0.04 to 0.12, size formulas from flat to probability-scaled. Nothing broke through breakeven.

Back to Anara.

Cycle 2: The Monopoly Breakthrough → +$41

I sent Anara back to the simulation code with one question: what are we missing?

Anara identified something I'd overlooked entirely. When the true probability moves to extremes (near 0% or 100%), the competitor's quotes vanish on one side of the book. My strategy was doing nothing in this regime — just canceling orders and waiting.

Anara's insight: this is the highest-edge moment in the entire simulation. When you're the only liquidity provider, retail has no choice but to trade with you. And the arbitrageur has nothing to sweep — your prices are already on the right side of fair value.

Claude Code implemented monopoly mode in one iteration:

# When price is near 0, buy YES shares for almost nothing
base_size = max(20.0, 85.0 / max(0.005, prob_est))

for tick in range(1, min(6, comp_ask)):
    frac = 1.0 if tick <= 2 else 0.5
    sz = min(base_size * frac, max(0.0, max_pos - net_inv))
Enter fullscreen mode Exit fullscreen mode

v74: +$41. From negative to dominant in one cycle. This regime accounted for 60% of all edge in the final strategy.

This is the kind of discovery that's hard to make incrementally. I was optimizing parameters in the normal regime for 50 iterations. Anara stepped back, read the simulation mechanics, and found an entire regime I wasn't playing in.

Cycle 3: Retail Matching → +$43

Back to Claude Code for implementation work. The normal regime (when both sides of the book are present) was still leaving money on the table.

Anara analyzed the retail flow mechanics and found that retail fills ~$4.5 mean notional regardless of your order size. At prob=0.5, that's ~9 shares. If you post 50 shares, 9 get filled by retail (+edge) and 41 sit there for the arb to sweep (-edge).

The fix: size = 14/prob. At p=0.5 → 28 shares. At p=0.05 → 280. Match your order size to expected retail at every price level.

Claude Code swept the numerator from 7 to 20. Results:

  • 10/prob: $46.02 (too small, missing retail)
  • 12/prob: $46.15
  • 14/prob: $46.70 (sweet spot)
  • 16/prob: $46.45 (excess arb exposure)

Cycle 4: Final Tuning → +$47

The last cycle was Claude Code grinding marginal gains:

  • Monopoly sizing from 38/prob to 85/prob (+$2)
  • Position limit from 1,000 to 3,000 (+$2)
  • Tiered z-score threshold (+$0.35)
  • 100/prob tested and rejected (cash constraints bind)
  • 120/prob tested and rejected (-$9, catastrophic)

Each change validated, each rejection logged. The score plateaued at $46.70 local, which translated to $41.09 on the final fresh-seed evaluation.

What the Loop Found That I Wouldn't Have

Looking back, the autoresearch loop made three discoveries I wouldn't have reached manually in 8 hours:

1. The monopoly regime. I was stuck optimizing parameters in the normal regime. Anara read the simulation source and identified an entire game state I was ignoring. This was +$44 in one shot.

2. The retail-matching formula. I would have guessed "bigger is better" or "flat size is safest." Anara analyzed the retail flow distribution and derived that sizing should be inversely proportional to probability to match expected fill sizes.

3. The volatility convergence. The empirically-discovered formula phi_factor * 39.9 / sqrt(steps_remaining) independently converged on the same structure as Paradigm's analytical solution from their pm-AMM paper. The AI-driven parameter search arrived at the same answer as the math — without knowing the math existed.

What Failed

Failures discovered by the loop, confirmed across hundreds of simulations:

What Expected Actual Why
Multi-level quoting (5 levels) More fills -$7.50 Arb swept all 5 levels before retail arrived
Smaller sizes (10/prob) Less arb damage -$1.00 Retail fill reduction exceeded arb savings
No cash buffer (100%) More capital -$1.20 Blew up on bad seeds
Monopoly 120/prob More edge -$9.00 Cash constraints bound hard
Adaptive z-threshold Smarter -$0.30 Added noise, no signal

Me vs #1: Two Approaches to Autoresearch

The #1 used a different autoresearch architecture: 20 parallel Claude Code agents, each exploring independently, 1,039 total strategies, a 900-line final strategy he says he doesn't fully understand.

My approach: sequential Anara → Claude Code loop, 110 iterations, 80-line final strategy I can explain line by line.

#1 Me
Strategies 1,039 110
Parallel agents 20 1 loop
Final strategy ~900 lines ~80 lines
Understands it "I barely read the problem" Every line documented
Score $42.32 $41.09

Both approaches independently discovered the same core insights — monopoly regime, skip tight spreads, retail-matching sizing. 10x more compute converged on the same answer.

The $1.23 gap likely came from his multi-seed validation (16 seeds vs my single-seed testing) — his strategy was more robust to the fresh-seed final evaluation, not fundamentally better.

What I'd Do Differently

  1. Multi-seed validation from the start. Test on 4+ seeds, not just one. The #1's robustness came from this discipline.

  2. Add a "start from scratch" cycle. The #1's biggest jump came from telling an agent to ignore all existing code. I never broke out of incremental iteration. Sometimes you need to escape the local optimum.

  3. Spend more Anara cycles on the normal regime. 60% of edge came from monopoly, but monopoly was already near-optimal by cycle 3. The normal regime had more room.

Try It Yourself

The full code is open source:

uv sync --dev
uv run orderbook-pm run strategies/strategy.py --simulations 200 --workers 4
Enter fullscreen mode Exit fullscreen mode

GitHub: octavi42/prediction-market-maker

7 milestone versions, documented strategy, benchmark scripts, and every failed attempt.

Top comments (0)