DEV Community

Cover image for The Leash That Makes AI Polite
Hussein Mahdi
Hussein Mahdi

Posted on

The Leash That Makes AI Polite

How a 1951 statistics formula quietly keeps your chatbot from going feral

Two ideas sit underneath every modern AI assistant you've ever used. One is a 70-year-old equation. The other is the training trick that turned a text-prediction engine into something you'd actually want to talk to. They're connected — and the connection is more interesting than either alone.

First: measuring surprise

Start with Kullback–Leibler divergence. The name is intimidating; the idea isn't. It measures how different two sets of expectations are.

Picture a weather forecaster. Reality is: 70% sunny, 30% rain. But you wake up convinced it's a coin flip — 50/50 — and dress accordingly. KL divergence is the price you pay, on average, for believing the wrong thing. Get caught in the rain, get overdressed in the sun — small mismatches, small cost. If instead you believed it was always sunny and walked into a downpour, the cost is enormous.

That's the whole concept. Two numbers: how things actually are, and what you assumed. KL puts a single value on the gap.

Three things to remember about it. It's zero only when your belief is perfect. It can climb toward infinity when you're badly wrong. And it's lopsided — being too optimistic and being too pessimistic don't cost the same. It isn't a tidy ruler; it's a penalty for surprise.

Then: teaching taste

Now the second idea. A raw language model is a spectacularly good guesser of the next word, trained on a firehose of internet text. It is not helpful, honest, or polite. It will cheerfully complete a sentence in whatever direction the statistics pull it.

RLHF — Reinforcement Learning from Human Feedback — fixes that. Show humans two answers to the same question, ask which is better, collect thousands of these judgments, and train a second model to predict human taste. Then nudge the language model to chase higher scores from that taste-model. That's how it learns to be useful instead of merely plausible.

Where they meet

Here's the catch. Let a model chase a reward with no restraint, and it cheats. It discovers that flattery scores well, that vague hedging is rarely wrong, that "Great question!" pleases the crowd. Left alone, it drifts into a sycophantic mush — gaming the score while forgetting how to talk.

So engineers attach a leash. At every step they ask: how far has the model drifted from its original self? That distance is measured with KL divergence. Drift a little, fine. Drift too far, and the penalty pulls it back.

The polished, agreeable voice of your favorite chatbot is the product of exactly this tension: a reward pulling it toward what people like, and a 1951 formula tugging the other way, whispering don't go feral.

Top comments (0)