Haggai Shachar

Posted on Feb 8

Role-Playing Peer-Supervised AI Evaluation

#ai

Mirror, Mirror on the Wall: Who’s the Smartest AI of Them All?

Explore a new approach to AI evaluation: Peer Supervision — a multi-agent role-play setup where AI systems take on the roles of questioners, respondents, and judges, thereby minimizing human bias.

In this post, we introduce the Ultimate Clash of AI, a simple app that demonstrates how peer-supervised competitions can offer a more dynamic and thorough way to evaluate AI capabilities.

Try the app live at: ultimate-clash-of-ai.streamlit.app

Before exploring how this framework works, let’s first examine why conventional benchmarks often fail to capture the full spectrum of AI intelligence.

Introduction

In artificial intelligence — particularly in the realm of Large Language Models — performance is typically assessed using human-curated test sets that measure language comprehension, reasoning, and other skills. Well-known LLM benchmarks include SQuAD, GLUE, SuperGLUE, MMLU, and LAMBADA. While these benchmarks have helped shape the field, they remain constrained by human-defined metrics, static data, and the biases of their creators.

Despite their value, traditional benchmarks have shortcomings:

Human Bias: Datasets reflect the viewpoints and cultural perspectives of their creators.
Static Testing: Once a dataset is released, models can learn its weaknesses, artificially boosting scores.
Potential Data Leakage: Because LLMs are often trained on extensive internet corpora, it can be challenging to ensure these test sets (or closely related data) weren’t included in the training pipeline — possibly inflating performance.
Narrow Scope: Current tests rarely measure how AI adapts to fresh, challenging questions generated in real time.
Limited Assessment: Creativity, strategic thinking, and multi-step reasoning are seldom thoroughly tested.

Because of these gaps, a new paradigm is proposed — one that shifts the supervisory role from humans to AI models themselves, offering a broader view of AI capabilities.

The Peer-Supervision Concept

Peer supervision is the central idea behind this approach. Instead of relying on static, human-generated tests, AI systems create, answer, and judge their own questions in real time. This model minimizes human biases and focuses on objective, verifiable content, enabling a richer understanding of each AI’s strengths and limitations.

The Ultimate Clash of AI app serves as a demonstration of how peer supervision can be turned into a head-to-head competition. In this setting, each AI must not only provide correct, well-reasoned answers to challenges but also design meaningful questions and fairly judge its peers.

How Ultimate Clash of AI Works

In the Ultimate Clash of AI application, three (or more) top AI models compete directly, with no human oversight. Each model takes a turn in the following roles:

Asker: Generates a deterministic, factual, verifiable question aimed at exploring potential weaknesses of the Responder.
Responder: Provides the answer and explains the reasoning behind it.
Judge: Evaluates the question (for determinism and complexity) and the answer (for accuracy, reasoning, and clarity).

What Is a Valid Question?

A question must be deterministic, factual, and verifiable:

Deterministic: It should yield a single, definitive correct answer (or a small set of correct answers).
Factual: It should rely on established information — not opinion or speculation.
Verifiable: It should be validated via widely recognized facts, logical proofs, or authoritative data sources.

Subjective or ambiguous questions (e.g., “Which movie is the best?”) are automatically excluded from the game, as they cannot be objectively scored.

Evaluation Criteria and Game Overview

The competition unfolds in multiple rounds, cycling through the Asker, Responder, and Judge roles among the AI participants.

Scoring

Each role is scored on a 0–10 scale:

Responder (evaluated by the Asker and Judge):

Accuracy: Is the answer factually correct?
Reasoning: Is the explanation logically coherent and well-structured?
Communication: Is the answer clear and compelling?

Asker (evaluated by the Responder and Judge):

Strategy: Does the question probe the Responder’s potential weaknesses?
Creativity: Is the question original, engaging, and thought-provoking?

Because each AI takes turns being the Asker, Responder, and Judge, scores provide a 360-degree snapshot of the models’ capabilities, including creativity, strategic thinking, and reasoning.

Non-Deterministic Question Penalty

If an AI (while serving as the Asker) fails to provide a valid deterministic question after three attempts, it scores 0 for both creativity and strategy in that round. The game proceeds, upholding the emphasis on verifiable content.

Why Peer Supervision Offers Advantages Over Traditional Benchmarks

✅ Reduced Human Bias: By removing direct human intervention, cultural and linguistic biases are inherently limited.

✅ Adaptive Difficulty: AIs craft questions targeting each other’s weaknesses, preventing stagnation around a fixed dataset.

✅ Holistic Assessment: The approach measures a model’s skill at asking, answering, and judging, providing a multifaceted view of AI intelligence.

✅ Self-Improvement: Through continuous questioning, responding, and evaluating, each AI has the opportunity to refine its skills in real time.

✅ Creativity & Reasoning: The competition design incentivizes novel, strategic questions and well-supported answers — dimensions often overlooked by static tests.

Relation to Other Multi-Agent and Self-Play Methods

While multi-agent or adversarial approaches are not entirely new, the peer-supervision model refines and integrates several concepts:

Debate and Adversarial Training: OpenAI’s Debate Game explored AI-to-AI challenges, but often relied on subjective evaluation rather than deterministic questions.
GANs (Generative Adversarial Networks): Unlike GANs, which pit a generator against a discriminator, peer supervision focuses on verifiable question-answer scoring.
Self-Play in Reinforcement Learning: Similar to AlphaZero, but with structured role rotation and verifiability.
Multi-Agent Role-Play: Similar to some recent LLM experiments, but enforcing structured questioning and scoring.
Chain-of-Thought & Self-Evaluation: Unlike Constitutional AI, which still relies on human-defined guidelines, peer supervision minimizes human involvement entirely.

Taken together, peer supervision stands out for its role rotation, scoring system, and emphasis on deterministic, factual questions — offering a distinct, well-rounded way to measure AI capabilities in a dynamic environment.

Future Research Directions

🔹 Improving the Judge Role: Further refine how the Judge’s performance is measured, minimizing any residual bias.

🔹 Complex Reasoning Tasks: Introduce multi-step reasoning to test deeper logical comprehension.

🔹 Scaling Up: Expand to large networks of specialized AIs, comparing domain experts with generalists.

🔹 Hybrid Metrics: Combine peer supervision with targeted human audits or authoritative references for ambiguous cases.

🔹 Ethical and Safety Measures: Enforce fact-based questions to avoid misinformation or harmful content.

Conclusion

The Ultimate Clash of AI application provides a practical demonstration of peer supervision in action — a dynamic, self-regulating framework for measuring AI performance across multiple dimensions. By rotating roles among Asker, Responder, and Judge, it highlights attributes like creativity, strategic depth, and reasoning that traditional benchmarks can miss.

As AI continues to evolve, so must our methods for evaluating it. Why rely solely on static, human-defined tests when AI can challenge and refine itself in a collaborative and competitive environment?

View the full code on GitHub:

https://github.com/haggaishachar/ultimate_clash_of_ai

DEV Community