<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ronit Mehta</title>
    <description>The latest articles on DEV Community by Ronit Mehta (@ronit26mehta).</description>
    <link>https://dev.to/ronit26mehta</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3849047%2F33fad728-24eb-424c-81c4-ffde055a264c.jpeg</url>
      <title>DEV Community: Ronit Mehta</title>
      <link>https://dev.to/ronit26mehta</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ronit26mehta"/>
    <language>en</language>
    <item>
      <title>I Made 4 AI Agents Debate Each Other. Here's Why You Should Never Trust a Single LLM Answer Again.</title>
      <dc:creator>Ronit Mehta</dc:creator>
      <pubDate>Sun, 29 Mar 2026 09:55:27 +0000</pubDate>
      <link>https://dev.to/ronit26mehta/i-made-4-ai-agents-debate-each-other-heres-why-you-should-never-trust-a-single-llm-answer-again-2anh</link>
      <guid>https://dev.to/ronit26mehta/i-made-4-ai-agents-debate-each-other-heres-why-you-should-never-trust-a-single-llm-answer-again-2anh</guid>
      <description>&lt;p&gt;GPT-4 gave me a confident answer last year.&lt;/p&gt;

&lt;p&gt;Precise numbers. Named researchers. A specific clinical study with exact findings.&lt;/p&gt;

&lt;p&gt;It was entirely fabricated.&lt;/p&gt;

&lt;p&gt;Not partially wrong. Not slightly off. The study did not exist. The researchers were not real. Every single number was invented — delivered with the same calm, authoritative tone the model uses when it is reciting actual facts.&lt;/p&gt;

&lt;p&gt;And that is the problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Issue Is Not Hallucination
&lt;/h2&gt;

&lt;p&gt;Every developer knows LLMs hallucinate. That is old news.&lt;/p&gt;

&lt;p&gt;The real issue is &lt;strong&gt;there is no signal&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A system that is 100% correct and a system that is 100% wrong sound identical. Same confidence. Same tone. Same formatting. No uncertainty score. No source tracing. No audit trail showing &lt;em&gt;how&lt;/em&gt; the model reached its conclusion.&lt;/p&gt;

&lt;p&gt;You are asking a single system — trained to sound confident — to self-evaluate its own reliability.&lt;/p&gt;

&lt;p&gt;That is like asking a witness to also be the judge, the jury, and the fact-checker.&lt;/p&gt;

&lt;p&gt;I got tired of it. So I built something different.&lt;/p&gt;




&lt;h2&gt;
  
  
  What If AI Reasoned Like Science Does?
&lt;/h2&gt;

&lt;p&gt;Science does not trust single sources. Peer review exists because even brilliant researchers need adversarial challenge before their conclusions are accepted.&lt;/p&gt;

&lt;p&gt;A claim must survive scrutiny, not just sound convincing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if AI worked the same way?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not one model. Not one answer. Four specialist agents with defined, conflicting roles — gathering evidence, challenging it, moderating the process, and only accepting a verdict when the probability math converges.&lt;/p&gt;

&lt;p&gt;That is &lt;strong&gt;ARGUS&lt;/strong&gt; — Agentic Research and Governance Unified System.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;argus-debate-ai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  How It Works: The Four Agents
&lt;/h2&gt;

&lt;p&gt;ARGUS makes four agents debate every claim before outputting a verdict.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🟦 The Moderator&lt;/strong&gt;&lt;br&gt;
Creates the debate agenda. Decides what needs investigating. Sets stopping criteria — convergence, round limits, or budget exhaustion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🟩 The Specialist&lt;/strong&gt;&lt;br&gt;
The evidence gatherer. Runs hybrid retrieval across ingested documents and external sources. BM25 sparse search + FAISS dense vector search, fused via Reciprocal Rank Fusion. Finds the strongest supporting evidence and adds it to the debate graph with confidence scores.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🟥 The Refuter&lt;/strong&gt;&lt;br&gt;
Actively adversarial. Its only job is to break the proposition. Find counter-evidence. Expose methodological flaws. Add attack edges to the graph. It does not try to be balanced. This is intentional.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🟨 The Jury&lt;/strong&gt;&lt;br&gt;
Does not argue. Reads the final graph. Computes the Bayesian posterior. Applies calibration corrections. Only renders a verdict when the math converges — with a confidence score and structured reasoning you can audit.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Core: Conceptual Debate Graph (C-DAG)
&lt;/h2&gt;

&lt;p&gt;The underlying data structure is not a prompt chain.&lt;/p&gt;

&lt;p&gt;It is a &lt;strong&gt;directed graph&lt;/strong&gt; where every proposition, piece of evidence, and rebuttal is a node — and edges carry polarity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SUPPORTS edge = +1
ATTACKS edge  = -1
REBUTS edge   = challenges a prior attack
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every edge is weighted by three factors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Confidence of the agent that added it&lt;/li&gt;
&lt;li&gt;Relevance of the evidence to the claim&lt;/li&gt;
&lt;li&gt;Quality of the source&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Belief propagates through this graph in &lt;strong&gt;log-odds space&lt;/strong&gt; for numerical stability:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;posterior = sigmoid( logit(prior) + Σ( wi × log(LRi) ) )

where wi = polarity × confidence × relevance × quality
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ARGUS does not count votes. It weights every piece of evidence by credibility and source quality before updating the posterior. One high-quality peer-reviewed study correctly outweighs five low-quality blog posts.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Real Debate, Step by Step
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Claim:&lt;/strong&gt; &lt;em&gt;"Caffeine improves long-term cognitive performance."&lt;/em&gt;&lt;br&gt;
&lt;strong&gt;Prior:&lt;/strong&gt; 0.5 (no initial bias)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Round 1 — Specialist&lt;/strong&gt;&lt;br&gt;
Finds three RCTs showing short-term attention and reaction time improvements.&lt;br&gt;
Adds SUPPORTS edges. Confidence: 0.82, 0.79, 0.85.&lt;br&gt;
&lt;strong&gt;Posterior → 0.67&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Round 2 — Refuter&lt;/strong&gt;&lt;br&gt;
Finds two meta-analyses showing tolerance development and withdrawal deficits nullify long-term gains.&lt;br&gt;
Adds ATTACKS edges. Confidence: 0.88, 0.91.&lt;br&gt;
&lt;strong&gt;Posterior → 0.44&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Round 3 — Specialist&lt;/strong&gt;&lt;br&gt;
Adds a 2023 longitudinal study (n=3,400) on reduced Alzheimer's risk in long-term moderate consumers.&lt;br&gt;
&lt;strong&gt;Posterior → 0.58&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Round 3 — Refuter&lt;/strong&gt;&lt;br&gt;
Rebuts: study conflates caffeine with other dietary factors. No control for socioeconomic variables. Rebuttal strength: 0.71.&lt;br&gt;
&lt;strong&gt;Posterior → 0.52&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Jury Verdict: UNCERTAIN&lt;/strong&gt;&lt;br&gt;
Posterior: 0.52&lt;br&gt;
Reasoning: Short-term benefits are well-evidenced. Tolerance effects are equally documented. Long-term effects remain genuinely contested. Recommend domain-specific investigation.&lt;/p&gt;



&lt;p&gt;That verdict took 3 rounds. Cited 6 sources. Every step is recorded in a &lt;strong&gt;hash-chained PROV-O audit ledger&lt;/strong&gt;. You can replay the entire debate and verify nothing was tampered with.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Part That Surprised Me Most
&lt;/h2&gt;

&lt;p&gt;I assumed the multi-agent debate logic would be the hard part.&lt;/p&gt;

&lt;p&gt;It was not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Calibration was harder.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A confidence score only means something if it is accurate. A system that says "87% confident" should be right 87% of the time across many claims. Most LLM-based systems are wildly overconfident.&lt;/p&gt;

&lt;p&gt;ARGUS addresses this with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Temperature scaling&lt;/strong&gt; on the jury's outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expected Calibration Error (ECE)&lt;/strong&gt; measurement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Brier Score&lt;/strong&gt; tracking across debates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you run ARGUS on a benchmark where ground truth is known, you can measure how calibrated the verdicts actually are and adjust until the confidence scores are meaningful.&lt;/p&gt;

&lt;p&gt;A fact-checking system that says 91% confident when it should say 62% is &lt;strong&gt;worse than useless&lt;/strong&gt;. It gives you false certainty.&lt;/p&gt;


&lt;h2&gt;
  
  
  The CRUX Protocol: Epistemic State as a First-Class Primitive
&lt;/h2&gt;

&lt;p&gt;Standard multi-agent systems pass messages.&lt;/p&gt;

&lt;p&gt;ARGUS agents pass &lt;strong&gt;epistemic state&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The CRUX Protocol treats every claim as a bundle carrying a &lt;strong&gt;Beta distribution&lt;/strong&gt; over confidence — not a point estimate, but a full distribution.&lt;/p&gt;

&lt;p&gt;Beta(8, 2) and Beta(80, 20) both have a mean of 0.8. But the second agent has seen ten times more evidence. They should not be treated equally. CRUX does not treat them equally.&lt;/p&gt;

&lt;p&gt;Agents also maintain a &lt;strong&gt;Credibility Ledger&lt;/strong&gt; — a hash-chained record of past predictions versus actual outcomes, updated ELO-style. Historically well-calibrated agents get more weight in the final verdict.&lt;/p&gt;

&lt;p&gt;When agents contradict each other, the &lt;strong&gt;Belief Reconciliation Protocol&lt;/strong&gt; merges their Beta distributions via Bayesian parameter addition and issues a proof certificate showing exactly how the merge was performed.&lt;/p&gt;

&lt;p&gt;Nothing is swept under the rug.&lt;/p&gt;


&lt;h2&gt;
  
  
  Try It in 10 Lines
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;argus&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RDCOrchestrator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_llm&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RDCOrchestrator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_rounds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;debate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Does caffeine improve long-term cognitive performance?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prior&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;      &lt;span class="c1"&gt;# UNCERTAIN / SUPPORTED / REFUTED
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;posterior&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# e.g. 0.52
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reasoning&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Full structured reasoning
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Works with GPT-4o, Claude, Gemini, and &lt;strong&gt;fully local via Ollama&lt;/strong&gt; — no cloud required.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# For local use&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;argus-debate-ai[ollama]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What ARGUS Is Not
&lt;/h2&gt;

&lt;p&gt;It is not fast. A 5-round debate with hybrid retrieval takes 45–90 seconds.&lt;/p&gt;

&lt;p&gt;For real-time applications — that is a problem.&lt;/p&gt;

&lt;p&gt;For research, fact-checking, enterprise document analysis, legal review, medical decision support, or any domain where being confidently wrong has real consequences — &lt;strong&gt;the latency is worth it&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What It Supports
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Details&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LLM Providers&lt;/td&gt;
&lt;td&gt;27+ including OpenAI, Anthropic, Gemini, Groq, Mistral, Ollama&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tools&lt;/td&gt;
&lt;td&gt;50+ including ArXiv, DuckDuckGo, Wikipedia, BigQuery, Pinecone&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval&lt;/td&gt;
&lt;td&gt;BM25 + FAISS + Cross-encoder reranking via RRF&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Document formats&lt;/td&gt;
&lt;td&gt;PDF, TXT, HTML, Markdown, JSON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interfaces&lt;/td&gt;
&lt;td&gt;Python API, CLI, Streamlit sandbox, Bloomberg-style TUI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;License&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Honest Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Standard LLM&lt;/th&gt;
&lt;th&gt;ARGUS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Source tracing&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅ Full provenance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Uncertainty score&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅ Calibrated posterior&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adversarial challenge&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅ Dedicated Refuter agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit trail&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅ Hash-chained ledger&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-model support&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅ 27+ providers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;td&gt;45–90 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;🔗 GitHub: &lt;a href="https://github.com/Ronit26Mehta/argus-ai-debate" rel="noopener noreferrer"&gt;github.com/Ronit26Mehta/argus-ai-debate&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📦 PyPI: &lt;code&gt;pip install argus-debate-ai&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;📄 MIT Licensed — contributions welcome&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  One Question For You
&lt;/h2&gt;

&lt;p&gt;What claim would you want to put through an adversarial AI debate?&lt;/p&gt;

&lt;p&gt;Drop it in the comments. I will run a live ARGUS debate on the most interesting one and post the full verdict — evidence nodes, posterior evolution, and jury reasoning — as a reply.&lt;/p&gt;

&lt;p&gt;Law, medicine, finance, tech, anything. The more contested, the better.&lt;/p&gt;




</description>
      <category>ai</category>
      <category>python</category>
      <category>opensource</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
