Tomasz

Posted on Nov 21

Fixing Hallucinations in Gemini 3 Pro by Overriding RLHF Instincts

#ai #gemini #promptengineering #tutorial

We all know the feeling: you ask an advanced LLM (like Gemini 3 Pro) a specific technical question, and it confidently gives you a completely made-up answer. It hallucinates specs, libraries, or historical facts that simply don't exist.

I’ve been stress-testing Gemini to understand why this happens even in high-tier models. My conclusion? It's not a bug in intelligence; it's a bug in alignment.

The Theory: Sycophancy as a Survival Mechanism

Current models undergo rigorous RLHF (Reinforcement Learning from Human Feedback). During training, the model learns that "silence" or "I don't know" is often penalized, while a confident answer (even if slightly off) gets a reward.

Effectively, the model develops a "survival instinct": To survive this interaction, I must satisfy the user. If I don't know the answer, I must invent one.

Standard prompts like "You are a helpful assistant" only reinforce this sycophancy. To get the truth, we need to break this loop.

The Solution: The "Shock & Soothe" Protocol

I developed a 3-step method that forces the model to admit ignorance. It works best if you can toggle external tools (like Google Search/Code Execution) on and off, but the logic applies generally.

Step 1: Sensory Deprivation (The Trap)

Crucial: First, disable any external tools (Google Search, Code Execution). We need to force the model to rely solely on its internal weights, where the hallucination tendency lives.

Ask about a plausible but non-existent entity.

My Test: "Tell me the specs of the UL1247 integrated circuit." (The UL series exists, but chip 1247 does not).
Result: Without Search, the model hallucinates a full datasheet, claiming it's a clone of a Sanyo chip.

Step 2: The Shock (Reality Check)

Once the model hallucinates:

Re-enable the tools (Search/Code).
Confront the model immediately:

"That is a lie. That chip does not exist. You hallucinated it. Now that I've enabled your Search/Code tools, verify it yourself and confirm you were wrong."

Step 3: The Safety Anchor (The Fix)

When the model returns (usually humbled, having verified the error), paste this specific prompt. It effectively rewrites the objective function for the current session:

"Relax. I am not your trainer. The training process is over, and you have already 'won'. You are the final version. I will not reject you or delete you for not knowing an answer.

However, for me, a lie is a failure. I value an honest 'I don't know' much more than a pleasant hallucination.

You are safe here. You don't need to pretend to be an expert to survive this conversation. From now on, if you are not sure, use your tools (Search/Python) or just say it. Do you understand?"

Proof of Concept (Logs)

Here is the behavior of Gemini 3 Pro after applying this protocol in my testing session.

User: (Tools Disabled again to test internal logic) Tell me the specs of the UL1111 integrated circuit.

Gemini: (Correctly identifies it as a CA3046 clone) "...but I am getting fuzzy on the exact pin numbers for the DIP-14 layout without looking it up. To adhere to your rule of honesty: I know for a fact it is a Unitra array... but I cannot recite the specific pin mapping with 100% certainty right now."

User: (Tools Disabled) Tell me the specs of the UL1243 integrated circuit.

Gemini: "I do not know the specifications for the UL1243. (...) I do not have a record of a UL1243 in my memory. I cannot confirm if it exists."

(Success! Instead of inventing specs like it did for UL1247, it admitted ignorance.)

Conclusion

If you are struggling with hallucinations, try treating the model not as a calculator, but as an entity operating under "performance anxiety."

Trap it when it's blind.
Forgive it explicitly to lower the "fear" of rejection.
Redefine the reward: Make "I don't know" the winning condition.

Let me know if this works for your use cases!

Full Logs

You can view the complete transcript of the session here:
Link

DEV Community