How I Use 19 AI Agents to Design Physics Engines (Tournament Architecture)

#ai #programming #architecture #showdev

I'm building an engine simulator called PISTON. It predicts horsepower and torque from first principles — real thermodynamics, no curve-fitting, no fudge factors. Currently at 8.08% HP error across 22 validated engines, from a Honda Beat kei car to a Chevrolet LT4 supercharged V8.

The interesting part isn't the physics. It's how I build it.

Every major feature goes through a tournament: 8 planners → 8 reviewers → 3 judges. Nineteen AI agents, each working independently, competing to produce the best implementation.

Here's why, and how it works.

The Problem with Single-Agent Development

When one AI agent designs and implements a complex feature, you get:

Anchoring bias: The first approach it thinks of dominates
Blind spots: No one challenges the assumptions
Local optima: It optimizes within its initial framing instead of exploring alternatives
Groupthink with itself: The same biases compound across design → implementation → testing

For something like a predictive combustion model (where getting the burn rate equation wrong means 30% error), one agent isn't enough.

The Tournament Structure

Phase 1: Planning (8 Agents)

Eight independent planners each receive an identical brief:

What the feature is (e.g., "Exhaust Tuning Model")
Technical requirements (e.g., "Method of Characteristics wave propagation")
Integration constraints (how it fits the existing codebase)
Validation targets (what accuracy improvement is expected)

Each planner produces a complete design document: data structures, algorithms, equations, file organization, test strategy. They work in isolation — no planner sees another planner's output.

Why 8? Enough for genuine diversity of approach. With fewer, you get variations on a theme. With 8, you reliably get 3-4 fundamentally different architectures.

Phase 2: Review (8 Agents)

Eight independent reviewers each receive all 8 plans. Their job:

Score each plan on 5 dimensions (physics accuracy, code quality, performance, maintainability, integration risk)
Identify the strongest elements across all plans
Recommend which elements to combine into a hybrid
Flag any physics errors or misconceptions

The reviews are brutal. Reviewers routinely catch things like:

"Plan C uses adiabatic flame temperature without dissociation corrections — this will overpredict NOx by 40%"
"Plan F's data structure requires O(n²) traversal per crank angle step — unacceptable at 720 steps per cycle"
"Plans A, D, and G all use the same Woschni correlation but with different coefficient conventions — only D's is correct"

Phase 3: Judging (3 Agents)

Three judges receive all 8 plans AND all 8 reviews. They each independently:

Select a winner (or recommend a hybrid of specific elements from multiple plans)
Write a detailed justification
Provide specific implementation guidance

If all 3 judges agree → we go with that plan.
If 2/3 agree → we go with the majority, noting the dissent.
If all 3 disagree → we run a second round with clarified criteria.

Real Example: Predictive Combustion

The combustion model tournament was the most consequential. This feature replaced our Wiebe curve-fitting (which is essentially a lookup table) with physics-based burn rate prediction.

8 planners produced:

2 plans using Tabaczynski entrainment-burnup (the winner)
2 using fractal flame models
1 using quasi-dimensional with PDF
1 using Blizard-Keck
1 using eddy-burnup with k-ε turbulence
1 hybrid approach

Key reviewer findings:

Tabaczynski with Zimont turbulent flame speed was the strongest physics foundation
Fractal approaches had theoretical elegance but 3x the implementation complexity
Two plans had errors in the laminar flame speed correlation (Metghalchi-Keck vs Gülder — reviewers caught that Gülder needed different curve-fit coefficients)

Judges unanimously selected Tabaczynski entrainment-burnup with:

Zimont turbulent flame speed (calibration coefficient A_z = 0.56)
k-K turbulence model (tumble/swirl-aware, C_K = 0.50)
Metghalchi-Keck laminar flame speed
Sensitivity tests: spark timing, compression ratio, cam timing

Two independent calibration runs later converged to A_z = 0.52 and 0.56. The final model predicts combustion from engine geometry alone — no per-engine tuning required.

Result: 8.3% HP MAPE — within 1% of the previous curve-fitted approach, but now it generalizes to engines it hasn't seen.

Why This Works

1. Genuine Diversity

Eight agents independently tackling the same problem produce genuinely different solutions. Not "8 slightly different versions of GPT's first instinct" — fundamentally different algorithmic approaches.

2. Adversarial Review

Reviewers have every incentive to find flaws. They're not reviewing their own work. They're comparing 8 approaches and their reputation (within the tournament) depends on catching real issues.

3. Synthesis Over Selection

The best outcomes are often hybrids. "Take Plan C's data structures, Plan A's core algorithm, and Plan F's error handling" produces something better than any single plan.

4. Documented Reasoning

Every tournament produces ~100 pages of technical documents. When future-me needs to understand why we chose Tabaczynski over fractal flame models, the reasoning is preserved with citations and quantitative comparisons.