<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ahmed El Otmani</title>
    <description>The latest articles on DEV Community by Ahmed El Otmani (@ahmed_elotmani_4181cb63c).</description>
    <link>https://dev.to/ahmed_elotmani_4181cb63c</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2454907%2F11e701e9-f89e-4e61-a88d-e28c45724e4b.jpg</url>
      <title>DEV Community: Ahmed El Otmani</title>
      <link>https://dev.to/ahmed_elotmani_4181cb63c</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ahmed_elotmani_4181cb63c"/>
    <language>en</language>
    <item>
      <title>AI Is Bad at Disagreeing. I Spent Weeks Trying to Fix That.</title>
      <dc:creator>Ahmed El Otmani</dc:creator>
      <pubDate>Sun, 19 Apr 2026 00:43:11 +0000</pubDate>
      <link>https://dev.to/ahmed_elotmani_4181cb63c/ai-is-bad-at-disagreeing-i-spent-weeks-trying-to-fix-that-2an2</link>
      <guid>https://dev.to/ahmed_elotmani_4181cb63c/ai-is-bad-at-disagreeing-i-spent-weeks-trying-to-fix-that-2an2</guid>
      <description>&lt;p&gt;A few months ago I started building a tool that generates debate videos between two brands. The idea was simple: pick two rivals, pick a topic, get a short video where they actually go at it.&lt;/p&gt;

&lt;p&gt;The first version was terrible. Not in the way you'd expect.&lt;/p&gt;

&lt;p&gt;The voices sounded fine. The video quality was fine. The problem was that the two AIs refused to disagree with each other. Coke would make a point, and Pepsi would respond with something like: &lt;em&gt;"That's a great perspective, and while I understand where you're coming from, I think we bring a different but equally valid point of view."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Two brands. One debate. Zero conflict.&lt;/p&gt;

&lt;p&gt;I had built the world's most polite argument generator.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why AI is bad at arguing
&lt;/h2&gt;

&lt;p&gt;Modern language models are trained, heavily, to be agreeable. RLHF — the process that teaches them to be helpful — also teaches them to defuse. When you ask an AI to take a position, it hedges. When you ask two AIs to disagree, they find common ground. The training goal is a useful assistant, not a compelling adversary.&lt;/p&gt;

&lt;p&gt;You can see this in ChatGPT, Claude, Gemini — any of them. Ask them to argue with each other and within two turns they're writing joint statements.&lt;/p&gt;

&lt;p&gt;This is a feature for assistants. It's a bug for entertainment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What doesn't work
&lt;/h2&gt;

&lt;p&gt;My first instinct was prompt engineering. Tell the model: &lt;em&gt;"You are aggressive. You never agree. You attack the other side's points."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It works for about one exchange. Then the model drifts back to consensus. The training is stronger than the instruction. You end up with a model that &lt;em&gt;opens&lt;/em&gt; aggressively and then quietly de-escalates. Watch any AI debate long enough and you'll see this — the first line is sharp, the fifth line is "we're really saying the same thing."&lt;/p&gt;

&lt;p&gt;The second instinct was to make it worse: &lt;em&gt;"You hate the other side. You think they are fundamentally wrong."&lt;/em&gt; This produces genuinely offensive output for one line, then the model course-corrects even harder into olive-branch territory. It's like it feels guilty.&lt;/p&gt;

&lt;p&gt;The third instinct, which a lot of people try, is to raise temperature and hope randomness produces conflict. What it actually produces is nonsense. Random isn't disagreement. Random is incoherence.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does work
&lt;/h2&gt;

&lt;p&gt;The thing that finally worked was architectural, not prompted.&lt;/p&gt;

&lt;p&gt;Instead of giving one model "both sides" and asking it to generate a debate, I split the two debaters into completely separate contexts. Each side never sees the other's instructions. Each side has its own persona, its own priors, its own goals, and — critically — never sees the "debate framing." As far as each side knows, it's not in a debate. It's just answering a question from a specific worldview.&lt;/p&gt;

&lt;p&gt;Then I orchestrate the turns externally. Side A speaks. Side B gets Side A's transcript as &lt;em&gt;a statement made by a rival that is wrong&lt;/em&gt;, and is asked to respond in character. Not "debate them" — respond to a rival's wrong statement.&lt;/p&gt;

&lt;p&gt;That framing — "the other side is wrong, here is what they said, your turn" — does what "please disagree" can't. It sidesteps the agreeable training because the model isn't being asked to argue. It's being asked to defend its own position against an attack. That's a much more natural posture for a language model.&lt;/p&gt;

&lt;p&gt;The disagreement becomes real, because the two sides genuinely don't know each other exists as negotiating partners. They think they're doing monologues.&lt;/p&gt;

&lt;h2&gt;
  
  
  The persona problem
&lt;/h2&gt;

&lt;p&gt;Even with split contexts, another problem emerged: both sides sounded identical.&lt;/p&gt;

&lt;p&gt;Two AIs arguing, even aggressively, tend to use the same sentence structures, the same rhetorical moves, the same vocabulary. It's all one model underneath. Without distinct voices, even a good disagreement reads like one person talking to themselves.&lt;/p&gt;

&lt;p&gt;The fix was to build character sheets deeper than "you are brand X." Each side got a tone (sarcastic, earnest, condescending), a rhetorical style (data-driven, emotional, dismissive), a set of taboo moves (never concede, never apologize, never ask questions), and a few character tics (favorite phrases, what they always come back to). This is more like writing a character in fiction than prompting an assistant.&lt;/p&gt;

&lt;p&gt;Once the personas were real, the debates felt like debates. Not because the AI was smarter — because the two outputs stopped being the same AI in two costumes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pacing, interruptions, and why a debate isn't a podcast
&lt;/h2&gt;

&lt;p&gt;The last thing I got wrong, for a long time, was pacing.&lt;/p&gt;

&lt;p&gt;My early debates sounded like podcasts. Both sides spoke in full paragraphs, finished their points, then the other side responded. It was technically a debate. It was also unwatchable.&lt;/p&gt;

&lt;p&gt;Real debates are messier. People interrupt. They trail off. They restart. They repeat themselves when they're heated. They don't finish sentences when they're confident the other side already gets it.&lt;/p&gt;

&lt;p&gt;I had to add all of this manually. Interruption tokens. Trail-offs. Intentional repetition. Incomplete sentences. None of it happens naturally from a language model — models are trained to complete things, not abandon them.&lt;/p&gt;

&lt;p&gt;Once the pacing felt human, the debates stopped sounding like AI, even when the content was obviously AI-written. Pacing is half the performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this taught me about LLMs
&lt;/h2&gt;

&lt;p&gt;A few things, in descending order of how much they've stuck with me:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Models resist what they're trained against, no matter what you prompt.&lt;/strong&gt; If you need behavior that contradicts RLHF, prompting alone won't get you there. You have to restructure the problem so the model isn't being asked to violate its training — it's being asked to do something adjacent that produces the behavior you want as a side effect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Character is architectural, not prompted.&lt;/strong&gt; You can't "be" a character by being told to be. You become a character by having a different context, different goals, different constraints than everyone else in the scene.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The interesting problems in AI products aren't the AI problems.&lt;/strong&gt; Everyone can wire an LLM to a video generator. The thing that made this project actually work was understanding debate structure, pacing, persona construction, and interruption dynamics. The AI was the easy part.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tool
&lt;/h2&gt;

&lt;p&gt;I ended up shipping the thing — it's called DebaterX, it lives at &lt;a href="https://www.debaterx.app" rel="noopener noreferrer"&gt;debaterx.app&lt;/a&gt; — and it does roughly what I set out to build. You upload two brands or mascots, pick a topic, pick a tone, and get a short debate video back. Ronald McDonald vs the Burger King on fries. Netflix vs YouTube on who killed TV. Coke vs Pepsi on whatever you want them to fight about.&lt;/p&gt;

&lt;p&gt;Most of the engineering went into the problems above, not the plumbing. That's true of most AI products now, I think — the plumbing is commodity, and the craft is everywhere else.&lt;/p&gt;

&lt;p&gt;If you're building something in the generative space and running into the "my AI won't disagree / won't commit / won't take a position" wall, the answer is almost never a better prompt. It's usually a better architecture around a worse prompt.&lt;/p&gt;

&lt;p&gt;Which, honestly, is a more interesting place to be.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>showdev</category>
      <category>sideprojects</category>
    </item>
  </channel>
</rss>
