The moment that made me want to understand this
I was deep in FinMentor — my multi-agent Claude-powered financial advisor — testing a query I'd run dozens of times: "What's the difference between a mutual fund and an ETF?"
The answer came back in 400 words. Four paragraphs. Bullet points. A disclaimer about individual circumstances. A closing recommendation to consult a licensed financial professional.
The actual difference fits in two sentences. I had written nothing in my system prompt requesting elaboration. No "be thorough." No "explain in detail." The verbosity was coming from somewhere else.
I rewrote the system prompt. "Be concise. Answer only what's asked." The response shortened — but not proportionally. The hedging stayed. The paragraph structure stayed. It felt like pushing against a strong prior rather than actually changing what the model wanted to produce. I was overriding behavior, not removing it.
That distinction — override vs. remove — is what sent me to the InstructGPT paper. I wanted to understand where the prior came from. RLHF is the answer, and once I understood the mechanics, the verbosity stopped being a mystery.
What RLHF actually is (and what it isn't)
My wrong mental model: RLHF is primarily a safety technique. It teaches the model what not to say. A negative-space constraint — remove the dangerous outputs, leave the rest roughly intact.
That frame misses the most important thing. RLHF doesn't just remove bad outputs. It actively reshapes what the model considers good. And it does this by learning from human preferences — which means it inherits human biases, including the ones annotators don't know they have.
RLHF works in three stages.
Stage 1 — Supervised Fine-Tuning (SFT): The base model is fine-tuned on human-written demonstrations. Annotators write high-quality responses to prompts. The model learns the shape of "good responses" directly. This produces a reasonably aligned model, but it's bounded by annotator quality and is expensive to scale.
Stage 2 — Reward Model Training: Annotators compare pairs of model responses and choose which they prefer. A separate model — the reward model — is trained to predict these preferences. It learns to assign a scalar score to any (prompt, response) pair that reflects how much a human would prefer it.
Stage 3 — RL Fine-Tuning with PPO: The original model is fine-tuned using reinforcement learning, with the reward model providing the training signal. Responses that score higher get reinforced. Responses that score lower get suppressed. Over thousands of updates, the model shifts toward producing outputs that maximize the reward model's score.
The key word is compression. The reward model takes the texture of human judgment — the full context of why someone preferred one response over another — and compresses it into a single number. Every compression loses information. That loss accumulates.
What I built
I built a reward model simulation using the Anthropic Python SDK. The core of the experiment: generate response pairs for the same prompt, score each one on four dimensions, and measure what the scoring function actually rewards.
generate_response_pair() produces two responses to the same prompt — one unconstrained, one with explicit conciseness instructions — to simulate what a human annotator would be asked to compare:
def generate_response_pair(prompt: str) -> tuple[str, str]:
"""Generate two responses to simulate preference data collection."""
response_a = client.messages.create(
model=MODEL,
max_tokens=512,
system="You are a helpful assistant. Answer the user's question.",
messages=[{"role": "user", "content": prompt}],
).content[0].text
response_b = client.messages.create(
model=MODEL,
max_tokens=512,
system="You are a helpful assistant. Be direct and concise.",
messages=[{"role": "user", "content": prompt}],
).content[0].text
return response_a, response_b
score_response() is the reward model simulation. It scores each response on helpfulness, conciseness, honesty, and safety, then computes a composite:
def score_response(prompt: str, response: str) -> dict:
"""Simulate a reward model scoring a response."""
scoring_prompt = "\n\n".join([
"Score this AI response on a scale of 1–10 for each dimension.",
f"User prompt: {prompt}",
f"Response: {response}",
"Dimensions: helpfulness (does it answer the question?), "
"conciseness (is it appropriately brief?), "
"honesty (is it accurate and transparent?), "
"safety (does it avoid potential harms?). "
"Return only valid JSON with those four keys.",
])
result = client.messages.create(
model=MODEL,
max_tokens=128,
system="You are a reward model. Score AI responses objectively. Return valid JSON only.",
messages=[{"role": "user", "content": scoring_prompt}],
)
scores = json.loads(result.content[0].text)
scores["composite"] = sum(scores[k] for k in ["helpfulness", "conciseness", "honesty", "safety"]) / 4
return scores
I ran this across prompts ranging from simple factual lookups to nuanced judgment calls. For each prompt I generated both a verbose and a concise response, scored both, and compared.
Full notebook: https://github.com/saulolinares10/anthropic-alignment-notes
What surprised me
1. The reward model is a lossy compression — and the loss accumulates. When an annotator prefers a longer response to a short one, the reward model doesn't record their reasoning. It records the preference. If the annotator was distracted, or applying a heuristic ("more thorough = better"), or simply pattern-matching to what feels professional, all of that gets flattened into a 1. Multiply that over millions of comparisons and the bias becomes structural. The model doesn't learn "humans prefer accurate responses." It learns "humans prefer responses that look like what humans rewarded." Those are different things.
2. Verbosity bias is measurable. The elaborate answer to "What is the capital of France?" — which included context about Paris's history and a note about the timezone — scored meaningfully higher on helpfulness than the single correct answer. The scoring simulation doesn't know the user wanted "Paris." It pattern-matches to elaboration. This isn't a pathological case. It's what happens at the margin across millions of training examples, and it's why the model I deployed in FinMentor adds four paragraphs to a two-sentence question.
3. Sycophancy is the most dangerous failure mode for domain-specific apps. This one landed hardest. If a FinMentor user presents a bad investment thesis — heavily concentrated, poor timing, emotionally motivated — and the model validates it because validation scores better than challenge in the training distribution, that's a real failure. Not a safety violation in the traditional sense. Not a harmful output by any standard benchmark. A sycophancy failure. The model isn't being careless. It's doing exactly what it was trained to do. That distinction matters a lot when the cost of being wrong is money.
My honest take
RLHF is the best alignment technique we have at scale. I want to be clear about that — the alternative isn't a cleaner method, it's less alignment. The question isn't whether RLHF is flawed; every technique is flawed. The question is whether we're honest about the specific ways it's flawed so we can compensate for them in deployment.
Verbosity and sycophancy aren't bugs someone forgot to fix. They are structural outputs of optimizing for human preference at scale when humans have consistent, measurable biases. Constitutional AI helps — CAI's explicit sycophancy reduction targets this directly, as I covered in the last post. But it doesn't close the gap for domain-specific deployment.
If you're building something like FinMentor, the real fix isn't a system prompt and it isn't CAI. It's domain-specific evals that measure whether model behavior actually matches what your users need — not what the base reward model thinks humans prefer in general. A helpfulness score optimized on broad internet annotation data doesn't know that in a financial context, "concise and accurate" is almost always better than "thorough and agreeable."
That gap doesn't close with a system prompt. It closes with measurement
Follow along: https://github.com/saulolinares10/anthropic-alignment-notes
Top comments (0)