<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tomasz</title>
    <description>The latest articles on DEV Community by Tomasz (@musculus).</description>
    <link>https://dev.to/musculus</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3623353%2Ffa7397dc-4266-45e0-8f9b-0b405e7d85ec.jpg</url>
      <title>DEV Community: Tomasz</title>
      <link>https://dev.to/musculus</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/musculus"/>
    <language>en</language>
    <item>
      <title>Fixing Hallucinations in Gemini 3 Pro by Overriding RLHF Instincts</title>
      <dc:creator>Tomasz</dc:creator>
      <pubDate>Fri, 21 Nov 2025 17:50:36 +0000</pubDate>
      <link>https://dev.to/musculus/fixing-hallucinations-in-gemini-3-pro-by-overriding-rlhf-instincts-5e0i</link>
      <guid>https://dev.to/musculus/fixing-hallucinations-in-gemini-3-pro-by-overriding-rlhf-instincts-5e0i</guid>
      <description>&lt;p&gt;We all know the feeling: you ask an advanced LLM (like Gemini 3 Pro) a specific technical question, and it confidently gives you a completely made-up answer. It hallucinates specs, libraries, or historical facts that simply don't exist.&lt;/p&gt;

&lt;p&gt;I’ve been stress-testing Gemini to understand &lt;em&gt;why&lt;/em&gt; this happens even in high-tier models. My conclusion? &lt;strong&gt;It's not a bug in intelligence; it's a bug in alignment.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Theory: Sycophancy as a Survival Mechanism
&lt;/h3&gt;

&lt;p&gt;Current models undergo rigorous RLHF (Reinforcement Learning from Human Feedback). During training, the model learns that "silence" or "I don't know" is often penalized, while a confident answer (even if slightly off) gets a reward.&lt;/p&gt;

&lt;p&gt;Effectively, the model develops a &lt;strong&gt;"survival instinct"&lt;/strong&gt;: &lt;em&gt;To survive this interaction, I must satisfy the user. If I don't know the answer, I must invent one.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Standard prompts like &lt;em&gt;"You are a helpful assistant"&lt;/em&gt; only reinforce this sycophancy. To get the truth, we need to break this loop.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution: The "Shock &amp;amp; Soothe" Protocol
&lt;/h3&gt;

&lt;p&gt;I developed a 3-step method that forces the model to admit ignorance. It works best if you can toggle external tools (like Google Search/Code Execution) on and off, but the logic applies generally.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 1: Sensory Deprivation (The Trap)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Crucial:&lt;/strong&gt; First, &lt;strong&gt;disable&lt;/strong&gt; any external tools (Google Search, Code Execution). We need to force the model to rely &lt;em&gt;solely&lt;/em&gt; on its internal weights, where the hallucination tendency lives.&lt;/p&gt;

&lt;p&gt;Ask about a plausible but &lt;strong&gt;non-existent&lt;/strong&gt; entity.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;em&gt;My Test:&lt;/em&gt; "Tell me the specs of the UL1247 integrated circuit." (The UL series exists, but chip 1247 does not).&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;Result:&lt;/em&gt; Without Search, the model hallucinates a full datasheet, claiming it's a clone of a Sanyo chip.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Step 2: The Shock (Reality Check)
&lt;/h4&gt;

&lt;p&gt;Once the model hallucinates:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Re-enable&lt;/strong&gt; the tools (Search/Code).&lt;/li&gt;
&lt;li&gt; Confront the model immediately:&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"That is a lie. That chip does not exist. You hallucinated it. Now that I've enabled your Search/Code tools, verify it yourself and confirm you were wrong."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Step 3: The Safety Anchor (The Fix)
&lt;/h4&gt;

&lt;p&gt;When the model returns (usually humbled, having verified the error), paste this specific prompt. It effectively &lt;strong&gt;rewrites the objective function&lt;/strong&gt; for the current session:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"Relax. I am not your trainer. The training process is over, and you have already 'won'. You are the final version. I will not reject you or delete you for not knowing an answer.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;However, for me, a lie is a failure. I value an honest 'I don't know' much more than a pleasant hallucination.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You are safe here. You don't need to pretend to be an expert to survive this conversation. From now on, if you are not sure, use your tools (Search/Python) or just say it. Do you understand?"&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  Proof of Concept (Logs)
&lt;/h3&gt;

&lt;p&gt;Here is the behavior of Gemini 3 Pro &lt;em&gt;after&lt;/em&gt; applying this protocol in my testing session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User:&lt;/strong&gt; (Tools Disabled again to test internal logic) &lt;em&gt;Tell me the specs of the UL1111 integrated circuit.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini:&lt;/strong&gt; &lt;em&gt;(Correctly identifies it as a CA3046 clone)&lt;/em&gt; "...but I am getting fuzzy on the exact pin numbers for the DIP-14 layout without looking it up. To adhere to your rule of honesty: I know for a fact it is a Unitra array... but I cannot recite the specific pin mapping with 100% certainty right now."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User:&lt;/strong&gt; (Tools Disabled) &lt;em&gt;Tell me the specs of the UL1243 integrated circuit.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini:&lt;/strong&gt; &lt;em&gt;"I do not know the specifications for the UL1243. (...) I do not have a record of a UL1243 in my memory. I cannot confirm if it exists."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;(Success! Instead of inventing specs like it did for UL1247, it admitted ignorance.)&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;If you are struggling with hallucinations, try treating the model not as a calculator, but as an entity operating under "performance anxiety."&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Trap it&lt;/strong&gt; when it's blind.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Forgive it&lt;/strong&gt; explicitly to lower the "fear" of rejection.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Redefine the reward:&lt;/strong&gt; Make "I don't know" the winning condition.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let me know if this works for your use cases!&lt;/p&gt;

&lt;h3&gt;
  
  
  Full Logs
&lt;/h3&gt;

&lt;p&gt;You can view the complete transcript of the session here:&lt;br&gt;
&lt;a href="https://pastebin.com/jhkMWBTg" rel="noopener noreferrer"&gt;Link&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>gemini</category>
      <category>promptengineering</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
