Chris Technology

Posted on Nov 10 • Originally published at christechno.com

✅ I Changed Just One Line and My AI Bot Suddenly Stopped Hallucinating

#ai #programming #openai

For weeks, my AI bot had one annoying habit: it kept talking too much.

It didn’t matter if I used a clean prompt, a strict system message, or even a carefully crafted JSON schema — the model still drifted into weird territory. It added extra sentences, invented fields, threw in emojis I never asked for, and sometimes started explaining things nobody needed explained. In this article, I’ll tell you my experience about this AI bot hallucination fix.

Press enter or click to view image in full size

I thought this was just “normal AI behavior,” kind of like how devices sometimes get warm after updates or how generative models like GPT occasionally drift, which is something even researchers have flagged as a common issue (OpenAI themselves explain hallucination risks quite clearly in their documentation).

I even wrote about AI catching unexpected coding mistakes in another post, and it felt like a similar situation — if you’re curious, you can check that story here.

But then something surprising happened.

I changed one line.

Literally one.

And suddenly, my bot stopped hallucinating.

✅ The One Line That Changed Everything

Here it is:

max_output_tokens: 150

That’s it.

No fancy trick. No complicated rewrite. No multi-prompt architecture.

Just limiting how much the model is allowed to say.

It felt too simple to be true, but immediately the responses became:

shorter
more predictable
less “creative”
more tightly aligned with the schema
less likely to drift into storytelling or extra explanations

It’s almost like the model didn’t have enough “space” to hallucinate anymore.

I realized something important:

Most hallucinations happen at the tail end of the generation.

When the model starts to run out of structured things to say, it begins adding filler.

Cut the tail → cut the hallucination.

Even sites like Vercel mention similar behavior when explaining output streaming and token control in their AI SDK docs (their explanation of token boundaries is very useful, and you can check it here).

✅ Before vs After (Real Example)

Before (no limit):

{

"name": "Sarah",

"active": true,

"comment": "By the way, here's a random tip about JavaScript closures..."

}

After (**max_output_tokens: 150**):

{

"name": "Sarah",

"active": true

}

No drama.

No random tips.

No unexpected motivational quotes.

Just the data I asked for.

✅ Why This Works (Non-Scientific Explanation)

I’m not going to pretend I fully understand token prediction mathematics, but here’s what I noticed in practice:

The model tends to hallucinate after giving the correct answer.
Longer generations increase the chance of drifting.
The more “space” you give it, the more likely it is to add something you didn’t request.
Limiting output creates fewer opportunities to go off-track.

Think of it like telling a friend:

“Just answer the question. Don’t start a whole lecture.”

Press enter or click to view image in full size

For AI, max_output_tokens is that boundary.

Even researchers studying LLM behavior (see Stanford’s discussion on LLM “drift” tendencies here) confirm this pattern: long outputs → higher hallucination risk.

✅ Additional Fixes That Helped (but not as much as the one line)

Although the one-line fix made the biggest difference, a few supporting tweaks improved consistency even more.

✅ Lower the temperature

I usually go from 0.7 → 0.2 or 0.0 when I want strict logic.

MDN even explains randomness in model sampling quite well in their AI fundamentals page, which you can peek at here.

✅ Add a simple system message

Nothing fancy, just:

“Answer only with valid JSON. Do not add explanations.”

✅ Add one clarifying example

A single “example response” works better than a thousand instructions.

But still — the biggest jump came from reducing the token output.

✅ The Node.js Snippet I Ended Up Using

const response = await client.chat.completions.create({

model: "gpt-4o",

messages,

max_output_tokens: 150,

temperature: 0.2,

response_format: { type: "json_object" }

});

Clean, short, predictable output every time.

Node.js’ own docs also emphasize the importance of predictable output sizes when working with streams and APIs (you can read their guidance on data handling here) — it’s funny how those ideas indirectly apply even to LLMs.

✅ A Small Side Note — On-Device Models Also Help

I’ve been playing with on-device AI models lately, and they behave a bit differently.

If you’re into topics like this, I shared some thoughts on local LLMs and their future in another post, which you can read here.

Press enter or click to view image in full size

On-device models tend to hallucinate less during small tasks because they’re designed with tighter constraints — which makes sense after seeing how max_output_tokens behaves.

✅ The Curious Part: This Fix Is Almost Never Mentioned

I’ve read dozens of articles about AI bot hallucination fix, and most talk about:

better system prompts
retrieval augmentation
chain of thought
role separation
specialized models
temperature tuning

All useful, but almost no one talks about simply controlling the output window.

It’s the equivalent of preventing someone from going off-topic by saying:

“You have 30 seconds. Go.”

Sometimes the simplest constraint is the most effective one.

Even Wired had a great piece explaining how LLMs behave unpredictably when generating long sequences (you can skim it here) — and it lines up with what I experienced firsthand.

✅ Final Thoughts

It’s funny how many of my programming discoveries come from small accidents like this — whether it’s AI debugging or small logic quirks I write about, like that weird JavaScript behavior between 'Hello' + 1 + 2 and 1 + 2 + 'Hello'.

We often assume big problems require big fixes.

But sometimes…

It’s just one line.

And it works.

By the way, if you enjoy experimental tech stories, I’ve also written about macOS, iOS releases, and laptop comparisons — including why my MacBook Air behaved differently than an ASUS OLED laptop in real-world usage, which you can find here.

Or if you’re curious about where Apple is heading with their updates, my writeup on iOS 26 during WWDC might be interesting, available here.

For now, though, just try adding max_output_tokens for my AI bot hallucination fix.

You might be surprised by how much calmer your AI becomes with just one line.

This article was originally published on my website:

https://christechno.com/2025/11/10/one-line-ai-bot-hallucination-fix/