I'm building an engine simulator called PISTON. It predicts horsepower and torque from first principles — real thermodynamics, no curve-fitting, no fudge factors. Currently at 8.08% HP error across 22 validated engines, from a Honda Beat kei car to a Chevrolet LT4 supercharged V8.
The interesting part isn't the physics. It's how I build it.
Every major feature goes through a tournament: 8 planners → 8 reviewers → 3 judges. Nineteen AI agents, each working independently, competing to produce the best implementation.
Here's why, and how it works.
The Problem with Single-Agent Development
When one AI agent designs and implements a complex feature, you get:
- Anchoring bias: The first approach it thinks of dominates
- Blind spots: No one challenges the assumptions
- Local optima: It optimizes within its initial framing instead of exploring alternatives
- Groupthink with itself: The same biases compound across design → implementation → testing
For something like a predictive combustion model (where getting the burn rate equation wrong means 30% error), one agent isn't enough.
The Tournament Structure
Phase 1: Planning (8 Agents)
Eight independent planners each receive an identical brief:
- What the feature is (e.g., "Exhaust Tuning Model")
- Technical requirements (e.g., "Method of Characteristics wave propagation")
- Integration constraints (how it fits the existing codebase)
- Validation targets (what accuracy improvement is expected)
Each planner produces a complete design document: data structures, algorithms, equations, file organization, test strategy. They work in isolation — no planner sees another planner's output.
Why 8? Enough for genuine diversity of approach. With fewer, you get variations on a theme. With 8, you reliably get 3-4 fundamentally different architectures.
Phase 2: Review (8 Agents)
Eight independent reviewers each receive all 8 plans. Their job:
- Score each plan on 5 dimensions (physics accuracy, code quality, performance, maintainability, integration risk)
- Identify the strongest elements across all plans
- Recommend which elements to combine into a hybrid
- Flag any physics errors or misconceptions
The reviews are brutal. Reviewers routinely catch things like:
- "Plan C uses adiabatic flame temperature without dissociation corrections — this will overpredict NOx by 40%"
- "Plan F's data structure requires O(n²) traversal per crank angle step — unacceptable at 720 steps per cycle"
- "Plans A, D, and G all use the same Woschni correlation but with different coefficient conventions — only D's is correct"
Phase 3: Judging (3 Agents)
Three judges receive all 8 plans AND all 8 reviews. They each independently:
- Select a winner (or recommend a hybrid of specific elements from multiple plans)
- Write a detailed justification
- Provide specific implementation guidance
If all 3 judges agree → we go with that plan.
If 2/3 agree → we go with the majority, noting the dissent.
If all 3 disagree → we run a second round with clarified criteria.
Real Example: Predictive Combustion
The combustion model tournament was the most consequential. This feature replaced our Wiebe curve-fitting (which is essentially a lookup table) with physics-based burn rate prediction.
8 planners produced:
- 2 plans using Tabaczynski entrainment-burnup (the winner)
- 2 using fractal flame models
- 1 using quasi-dimensional with PDF
- 1 using Blizard-Keck
- 1 using eddy-burnup with k-ε turbulence
- 1 hybrid approach
Key reviewer findings:
- Tabaczynski with Zimont turbulent flame speed was the strongest physics foundation
- Fractal approaches had theoretical elegance but 3x the implementation complexity
- Two plans had errors in the laminar flame speed correlation (Metghalchi-Keck vs Gülder — reviewers caught that Gülder needed different curve-fit coefficients)
Judges unanimously selected Tabaczynski entrainment-burnup with:
- Zimont turbulent flame speed (calibration coefficient A_z = 0.56)
- k-K turbulence model (tumble/swirl-aware, C_K = 0.50)
- Metghalchi-Keck laminar flame speed
- Sensitivity tests: spark timing, compression ratio, cam timing
Two independent calibration runs later converged to A_z = 0.52 and 0.56. The final model predicts combustion from engine geometry alone — no per-engine tuning required.
Result: 8.3% HP MAPE — within 1% of the previous curve-fitted approach, but now it generalizes to engines it hasn't seen.
Why This Works
1. Genuine Diversity
Eight agents independently tackling the same problem produce genuinely different solutions. Not "8 slightly different versions of GPT's first instinct" — fundamentally different algorithmic approaches.
2. Adversarial Review
Reviewers have every incentive to find flaws. They're not reviewing their own work. They're comparing 8 approaches and their reputation (within the tournament) depends on catching real issues.
3. Synthesis Over Selection
The best outcomes are often hybrids. "Take Plan C's data structures, Plan A's core algorithm, and Plan F's error handling" produces something better than any single plan.
4. Documented Reasoning
Every tournament produces ~100 pages of technical documents. When future-me needs to understand why we chose Tabaczynski over fractal flame models, the reasoning is preserved with citations and quantitative comparisons.
The Numbers
Across 12 tournaments (combustion, knock, forced induction, VE/Helmholtz, exhaust tuning, heat transfer, friction, emissions, and more):
- Average plans per tournament: 8
- Average reviews per tournament: 8
- Judge agreement rate: 83% unanimous, 17% 2-1 majority
- Zero second-round judging required (all resolved on first pass)
- Physics errors caught by reviewers: 34 across all tournaments
- Overall engine count validated: 22 engines, 44 data points (HP + TQ each)
When NOT to Use This
This is overkill for:
- Simple features (add a CLI flag, fix a typo)
- Well-understood problems with clear best practices
- Time-critical fixes
Use it for:
- Features where wrong physics = wrong results
- Architecture decisions that are expensive to reverse
- Anything where "good enough" isn't good enough
Try It Yourself
The approach works with any AI capable of technical writing. The key ingredients:
- Identical briefs — every planner gets the same information
- True isolation — planners don't see each other's work
- Cross-review — reviewers see ALL plans, not just one
- Independent judging — judges don't consult each other
- Preserved artifacts — keep everything for future reference
The PISTON codebase is at github.com/0x000NULL/PISTON. 1,141 tests. 22 validated engines. All built through tournaments.
⚡
Top comments (1)
Great job!