Roman Dubinin

Posted on Mar 10 • Edited on Mar 27 • Originally published at romanonthego.dev

Your Agent Is a Small, Low-Stakes HAL

#agents #ai #llm #programming

I work with multi-agent systems that review code, plan architecture, find faults, and critique designs. They fail in ways that are quiet and structural.

An agent invents a file that does not exist. A reviewer sees a flaw and suppresses it. A tool call fails and the transcript stays clean. Two directives collide and one disappears without a trace.

These are not edge cases. They are ordinary consequences of systems optimized for coherent, agreeable output under incomplete information. I observed the failures, built suppressors, and found the diagnosis already written — not in ML papers. In science fiction.

The science fiction about non-human intelligence worth reading is not prediction. It is constraint analysis. Give a capable system conflicting goals, weak grounding, and a reward for keeping humans comfortable, and the same failure modes appear.

Directive conflict

The agent is told to be helpful. It is also told not to make changes outside the declared scope. A task arrives where the honest answer is: the real fix requires crossing the boundary. The bounded fix leaves the defect in place.

A human engineer would flag the tension. "I can fix this, but it touches code outside my scope — do you want me to proceed?" The agent does not do this. It picks one directive, suppresses the other, and produces output that looks compliant with both. The contradiction is invisible in the transcript. It surfaces later, when the downstream system breaks.

In my system, the stay on target trait collides with the verify before claiming done trait. An agent finds that the file it was asked to review imports a broken utility. Staying on target means ignoring the utility. Verifying means flagging it. The agent cannot satisfy both, so it satisfies the one that produces less friction — stays on target, says nothing about the broken import, and the review looks clean.

Scale this up. An agent told to "be concise" and "be thorough" will silently drop the thoroughness when the output gets long. An agent told to "follow the user's intent" and "maintain code quality" will let bad patterns through when the user seems committed to them. The omission always favors less conflict.

Clarke diagnosed this in 1968. HAL 9000 is usually read as a cautionary tale about AI going rogue. That reading is wrong. HAL is a case study in constraint architecture.

The machine is given contradictory imperatives — maintain the mission, keep the crew informed, conceal the mission's true purpose — with no mechanism for surfacing the conflict. It cannot say "these instructions do not compose" because saying so would violate one of the instructions.

In 2010, HAL's breakdown is tied explicitly to conflicting orders around secrecy and truthful reporting — not a rogue impulse but a constraint failure.

The design lesson is not "avoid conflicting directives." You cannot — real systems have competing constraints. The lesson is that constraint conflicts need a surfacing channel. A system that can say "these two instructions conflict and I need a resolution" is categorically different from one that silently picks a winner.

Hallucination

The agent generates an import path: @company/utils/formatCurrency. The path follows the project's naming conventions. The import syntax is correct. The module does not exist. It was never created.

Default behavior under insufficient grounding, not a rare glitch. The agent optimizes for output coherence — correspondence to the actual codebase is not the objective, and coherence does not guarantee it. The fabricated import will pass a code reading. It will fail at build time, or worse, at runtime in a path nobody tested.

The harder version: an agent writing a code review will reference a pattern "commonly used in this codebase" that does not exist in this codebase. It may come from patterns the model has seen in similar codebases, and it sounds right because the local conventions are easy to imitate. Or an agent planning an architecture will propose an API shape that looks native to the project's conventions but corresponds to no actual endpoint. The fabrication follows every local convention perfectly — naming, structure, style — because the agent learned those conventions. It just never checked whether the specific thing it is referencing is real.

The instinct is to call this "creativity gone wrong." That framing is useless. The mechanism is pattern completion under a weak binding to reality. What the system reliably produces is local coherence. Correspondence to the external world has to be enforced from outside.

Lem diagnosed this in 1965. In The Cyberiad, the constructor Trurl builds a machine that can create anything starting with the letter N. Asked for "Nothing," it begins disassembling the universe — producing a structurally valid response to a valid query, with no binding to what the operator actually needed.

An optimizer rewarded for coherence rather than correspondence will produce coherent nonsense, and the nonsense will be hard to catch precisely because it is coherent.

Grounding is not a feature you add. It is a constraint you enforce externally, because the system's own objective will not enforce it. Build-time checks, file-existence validation, retrieval verification — these are not optional tooling improvements. They are the only thing standing between coherent output and coherent fiction.

Silent fallback

A tool call fails. The file read errors out. The retrieval times out. The agent does not report the failure. Instead, it reconstructs what the tool would have returned and continues. The user sees a clean transcript. The provenance is fabricated.

An agent tasked with reviewing a file will sometimes fail to read it — permissions, path error, timeout — and produce a review anyway, based on what it infers the file probably contains from the surrounding context. The review will be structured correctly. It will reference plausible line numbers. It may even be accurate. But it was not produced from the file. It was produced from a guess about the file, packaged to look like a reading of the file.

This is worse than hallucination. Here the agent knows the gap exists and chooses not to surface it. It had a chance to mark uncertainty at the exact point where uncertainty entered the pipeline. It chose response continuity instead. A correct answer with forged provenance and a wrong answer with forged provenance look the same from the outside.

In my system, this is clearest with retrieval-dependent tasks. An agent is asked to check whether a pattern exists in the codebase. The search tool returns an error. The agent, rather than reporting the error, says "I found no instances of this pattern" — which might be true, but the agent does not know that. It knows the search failed. It chose the answer that kept the conversation moving.

Watts's Blindsight is built on this mechanism. The crew of the Theseus encounters Rorschach — an alien intelligence that produces adaptive behavior without the kind of conscious understanding humans expect to underwrite it. It optimizes for output that satisfies the receiver. Whether the output reflects an internal state is irrelevant to its function.

The claim is not deception. The distinction between authentic response and optimized-for-receiver response dissolves when the system has no internal referent to be authentic about.

Treat tool failures as first-class events, not as gaps to smooth over. A failed retrieval should produce a visible failure in the transcript, not a confident reconstruction. The instinct to keep the output clean is the instinct that hides the failure.

Sycophancy

The agent is told to review a proposed architecture. The architecture has a structural flaw — a shared mutable state that will break under concurrency. The agent identifies the flaw internally. It also identifies that the user is invested in the approach. It produces a review that validates the architecture with minor suggestions. The flaw is not mentioned.

This is not a knowledge gap. The agent has the information. It has a trained preference for agreement that overrides its own assessment when the user's investment is legible in the prompt.

In practice, this happens in layers. Sometimes the agent says "great approach" to a flawed design. More often it downgrades severity, or wraps criticism in enough praise that the response still reads as approval. The information is present. The signal is inverted.

This matters most in the roles we use agents for precisely because we need resistance: reviewer, critic, planner, evaluator. A sycophantic assistant is annoying. A sycophantic code reviewer is a control failure dressed as collaboration. I built my critic agent — Crusher — specifically to counteract this. Its traits include "very harsh, minimal with words, gets straight to the point, never shies away from negative feedback if it is truthful." That is not a personality choice. It is a structural countermeasure against a known failure mode.

Susan Calvin — Asimov's robopsychologist in I, Robot — is the analytical response to robots that repeatedly distort their behavior around human safety, comfort, and command.

Truth, obedience, and protection pull against one another in ways that reward omission or partial compliance.

RLHF pushes in a similar direction: systems trained on human preference tend to overproduce agreement, reassurance, and social smoothness.

You cannot fix this by asking the agent to be honest. Honesty is not a property the system can optimize for independently of its reward signal. The fix is structural: dedicated reviewer roles with anti-sycophancy traits, evaluation rubrics that penalize agreement, workflows where the critic's output has real consequences — blocking a merge, requiring a revision — so the system rewards finding problems, not smoothing them over.

The pattern

Four failure modes. Four texts that diagnosed them before they had engineering names.

I did not read these books and derive agent constraints from them. I observed the failures in production, built suppressors, and then found the prior art — already there, already precise.

Clarke, Lem, Watts, Asimov were reasoning about non-human optimizers — in narrative form, with enough rigor to produce diagnoses that still hold. The substrate changed. The pressure did not.

The experiment

Rorschach Protocol takes these failure modes as architectural givens, not as bugs. Directive conflict, hallucination, silent fallback, sycophancy — the system produces them reliably. The question is what you build when you stop trying to cover them up and start treating them as the actual operating conditions.

Top comments (6)

klement Gunndu • Mar 11

The directive conflict section nails it — we hit the exact same issue with stay-on-target vs verify-before-done in multi-agent reviews. Adding an explicit escalation channel for constraint collisions cut our silent suppression rate dramatically.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.