Most AI is trained to be polite. I built a real-time distributed system (NestJS + Flutter + Ollama) that forces local LLMs to debate aggressively. It uses RAG for facts, a deterministic judge for scoring, and strict "roast" prompts to penalize vagueness. Here's how it works.
Modern conversational AI is optimized to be agreeable.
Thanks to RLHF, most LLMs are polite, conflict-averse, and eager to converge on consensus. That’s great for customer support—but terrible for debate, critical reasoning, and stress-testing ideas.
When challenged, most AI systems either soften their stance or collapse into “both sides have valid points.”
MrArgue was built to do the opposite.
It is a real-time adversarial debate simulator that forces language models into rigid, opposing roles—compelling them to defend explicit claims under pressure, answer aggressive counter-arguments, and lose when they fail to do so.
This project started as a side experiment and turned into a full multi-agent system designed to explore how far small, local LLMs can be pushed when you remove their safety-oriented conversational defaults.
What MrArgue Solves
The problem isn’t that LLMs “can’t reason.”
It’s that they are trained to avoid conflict, not withstand it.
Most debate-like systems fail because:
- models agree too easily
- arguments stay abstract
- no one is forced to commit to falsifiable claims
- “judging” is vague and diplomatic
MrArgue enforces pressure instead of politeness.
Each agent:
- takes a fixed adversarial stance
- must state one explicit claim per turn
- is penalized for vagueness
- must directly answer counter-questions
- cannot agree or converge
Debates continue until one side fails by contradiction, evasion, repetition, or score collapse.
System Architecture
MrArgue is built as a low-latency, event-driven distributed system.
Core stack:
- Backend: NestJS (Node.js) for orchestration and type safety
- Database: SQLite for development (PostgreSQL-compatible) via Prisma ORM
- Frontend: Flutter (mobile + web)
- LLM Inference: Ollama for running quantized local models
- Infrastructure: Docker + Docker Compose
The system was optimized to run entirely on consumer hardware while maintaining real-time interactivity.
Debate Event Loop
At the center is a strict state machine:
- Initialization – debate topic + roles assigned
- RAG retrieval – relevant context fetched from vector storage
- Inference – prompt injected into the correct agent (Proponent / Opponent)
- Evaluation – response scored asynchronously
- Streaming – tokens streamed to clients via SSE
- Termination – debate ends when a loss condition is triggered
This loop allows debates to feel live without requiring expensive cloud inference.
Model Selection & Role Design
Instead of one large model, MrArgue uses multiple small, specialized models to create contrasting debate behavior.
| Model | Size | Role | Reason |
|---|---|---|---|
| Llama 3.2 | 3B | Proponent | Better long-context consistency and claim defense |
| Gemma 2 | 2B | Opponent | Higher temperature sensitivity and aggressive countering |
Both models are quantized to 4-bit precision, achieving:
- sub-50ms token latency
- low memory usage
- stable real-time streaming on consumer CPUs/GPUs
Scoring & Evaluation Logic
Judging subjective debate requires structure.
MrArgue uses a hybrid deterministic scoring system instead of LLM-only evaluation.
1. Deterministic NLP Scoring
Using the natural library, each response is scored on:
- Lexical diversity
- Logical connector density
- Direct counter-question handling
- Repetition detection
- Vagueness penalties
Example vagueness penalty logic:
const vagueWords = ["mystery", "subjective", "nuanced", "interconnected"];
let penalty = 0;
vagueWords.forEach(word => {
if (text.includes(word)) penalty += 30;
});
This forces agents to commit instead of hiding behind abstraction.
2. Retrieval-Augmented Generation (RAG)
To reduce hallucinations and repetition, debate context is grounded via RAG.
Implementation:
-
Embedding model:
nomic-embed-text - Vector storage: pgvector-style embeddings
- Similarity metric: cosine similarity
This ensures:
- arguments reference real concepts
- repeated talking points are discouraged
- debates evolve instead of looping
Adversarial Prompt Injection
Standard system prompts encourage safety and agreement.
MrArgue deliberately overrides this.
Example opponent instruction:
You are the OPPONENT.
Your goal is to defeat the proponent logically and rhetorically.
Hard rules:
- State one explicit claim.
- No vague abstractions.
- Attack a flaw and add a brief roast.
- Never agree.
This constraint-based prompting produced far more adversarial and coherent debates than temperature tuning alone.
Interaction Modes
MrArgue supports:
- AI vs AI
- Human vs AI
- Human vs Human (AI-judged)
This allows:
- passive spectatorship
- active participation
- replayable, shareable debates as posts
No human ego is harmed—only ideas.
Practical Use Cases
Beyond entertainment, this system has real applications:
- Legal training: simulate hostile opposing counsel
- Corporate red-teaming: stress-test strategies
- Critical thinking education: force students to defend claims
- Sales training: practice objection handling
- Policy analysis: challenge proposals under adversarial logic
Anywhere ideas need pressure, not validation.
Technical Challenges
Key problems encountered:
- Context window limits: solved via aggressive summarization
- Latency: masked with SSE streaming and optimistic UI
- Hallucinations: dramatically reduced via RAG grounding
- Cost: controlled by local inference and strict constraints
This project is intentionally infra-heavy—but deliberately constrained.
Future Work
- Multi-agent debate panels
- Voice interface (Whisper STT)
- User-defined RAG sources (PDF / docs)
- Adaptive scoring weights
Repository
🔗 GitHub: https://github.com/TrendySloth1001/argumentbot
Final note
MrArgue isn’t about finding truth.
It’s about finding weakness.
And in engineering—as in thinking—that’s often more valuable.

Top comments (0)