Daniel Romitelli

Posted on Mar 10 • Edited on Mar 23 • Originally published at craftedbydaniel.com

Defensive Multi‑Agent Scoring: How I Made LLM Reviews Clamp, Stream, and Fail Loudly

#llms #typescript #multiagentsystems #reliabilityengineering

A few days ago my review stage did the most dangerous thing a multi‑agent system can do: it looked like it worked.

The UI showed progress. The pipeline marched forward. And yet one of the agents had effectively returned “nothing,” which meant my final decision was being computed from a lie—an average that quietly pretended a missing opinion existed.

That’s the moment you stop thinking about “LLM evals” and start thinking about defensive systems engineering.

This post is about one very specific feature in my codebase: the multi‑agent scoring utilities used by my blog review stage (running on a real project site; I’m intentionally not naming the domain or repo layout here). The pattern is simple to say and annoyingly subtle to get right:

parse whatever the model emits (even when it’s truncated or malformed),
normalize it into a strict shape,
clamp numeric fields so they can’t poison the aggregate,
treat “no response” as a first‑class failure state,
stream intermediate output for debugging and replay,
and only then do weighted scoring and gating.

I’ll stick to what’s supported by the retrieved code context: the review shape exists, there is a streaming/broadcast hook imported for agent output, and there is a centralized quality gate object in shared utilities. Where implementation details weren’t present in the excerpt, I won’t manufacture them.

The key insight: multi‑agent scoring is input validation, not statistics

A weighted average is the easy part. The real engineering is deciding what to do when one judge:

emits JSON that doesn’t parse,
returns a partial object missing fields,
produces out‑of-range numbers,
or fails completely (timeout / refusal / empty stream).

If you don’t make those states explicit, you get what I call polite failure: the system continues, produces a number, and gives you false confidence.

In my pipeline, I standardized on a single sentinel for a completely failed agent:

score = 0 and confidence = 0

That pair means “this agent did not meaningfully participate.” Everything else is treated as a real opinion—possibly low quality, but at least present.

The contract: one strict review shape

At the center is a shared helper module used by the review stage. The codebase defines the review payload shape like this:

export interface AgentReview {
  agent: string;
  score: number;
  confidence: number;
  issues: string[];
  suggestions: string[];
}

That interface looks boring until you treat it as a hard boundary between:

a streaming LLM call,
whatever “JSON-ish” bytes the model dribbles out,
normalization,
scoring,
and the downstream quality gate.

If any of those boundaries get fuzzy, you get phantom scores.

Streaming output is an observability primitive (not a UX flourish)

The retrieved context shows the review helper importing a broadcast function for agent streaming (a broadcastAgentStream symbol exists in the shared utilities import list). The key point I care about is architectural, not proprietary:

a streaming call can fail after producing a partial response,
and those partial tokens are often the only evidence you’ll ever get.

Without streaming, all you see is “failed.” With streaming + a broadcast hook, you can capture the partial output as it arrives, which makes the failure debuggable.

I’m deliberately not pasting the provider‑specific request object from the earlier draft. The earlier snippet included a response_format: { type: 'json_object' } field without naming the SDK/provider, and that parameter is not portable across clients. The retrieved context here does not include a complete, reproducible streaming call, so I’m not going to pretend the post contains one.

The failure rule: a broken judge must stay visible in the aggregate

This is where I made a concrete wrong turn.

I had two competing impulses:

1) “If anything fails, fail the run.”
2) “If something fails transiently, keep going—but don’t lie.”

My first attempt biased toward (1): treat a sentinel review as catastrophic and collapse the entire run’s score. That was strict, but brittle.

The design I actually wanted is (2): the sentinel exists so the pipeline can stay deterministic while still reflecting reality.

Concretely:

A completely failed agent should still appear in the raw list of AgentReviews as the sentinel (score=0, confidence=0, empty arrays), so degradation is explicit.
Aggregation should not treat that sentinel as a real “vote.”

I’m not including the full computeWeightedScore(...) implementation in this post because the retrieved source excerpt does not include it. Publishing a stub (or a guessed implementation) would be worse than publishing nothing: it would look authoritative while being unverifiable.

What I can say, grounded in the design described above, is the intended behavior:

filter out reviews that match the sentinel (score=0 and confidence=0) before computing an aggregate,
compute the aggregate from the remaining reviews,
if no non‑sentinel reviews remain, return a clearly failed aggregate (e.g., a zero score or a separate “no valid reviews” state—whatever your pipeline expects),
keep the sentinel reviews in the recorded run output so the system is honest about partial failure.

Weights and gating policy: keep it centralized, and don’t assume weights sum to 1

The retrieved context shows a centralized QUALITY_GATE object with a WEIGHTS field and several threshold values loaded via a helper like envInt(...).

I removed the exact environment variable key names and exact numeric weights from this post. Those details increase inference risk (project fingerprinting), and the earlier draft also had an unresolved correctness problem: the shown weights summed to 1.1 without any explanation of whether the aggregation normalizes them.

The durable lesson is independent of the exact numbers:

Put weights in one shared place.
Treat the weights as policy, not incidental constants.
In aggregation, either (a) ensure weights are defined to sum to 1, or (b) normalize by the sum of weights actually used.

If you don’t do that, “weight tuning” turns into a hidden scale factor that quietly shifts your output.

A concrete sequence: one agent fails, the system stays honest

Here’s the one analogy I’ll use, exactly once:

Think of the scoring run like a panel of judges where one judge sometimes doesn’t show up and occasionally hands in a napkin with half a sentence. Your job isn’t “compute an average.” It’s make sure the scoreboard reflects what really happened.

The flow is the important part:

flowchart TD
  draft[Draft content] --> agents[Parallel agent reviews]
  agents --> stream[Broadcast partial output]
  agents --> parse[Parse JSON output]
  parse --> normalize[Normalize + clamp fields]
  normalize --> score[Compute weighted score]
  score --> gate[Quality gate decision]
  gate --> output[Publish or rewrite]

A failure timeline that used to be messy, now made explicit:

One agent starts streaming, emits partial JSON, then stops.
Parsing fails to produce a trustworthy object.
Normalization yields the sentinel review shape so downstream code sees a valid AgentReview object—but with an explicit failure marker (score=0, confidence=0).
Because streaming output was broadcast, you have evidence of the partial emission (rather than a black‑box “no output”).
Aggregation excludes sentinel reviews from the weighted calculation.
The run output still contains the sentinel review, so the system is visibly degraded.

That’s the whole trick: continue without lying.

What went wrong first (the real mistake)

My initial approach treated “one agent failed” as equivalent to “the run is invalid,” and collapsed the aggregate to a meaningless value.

That conflates two different states:

Partial degradation: one agent flakes out, others produce usable reviews.
Systemic failure: you can’t trust any of the reviews.

Streaming systems fail partially in the real world. The point of a sentinel review is to represent that partial failure without smearing it into your math.

The defensive invariants I enforce now

I keep coming back to a small set of invariants:

Treat model output as unknown until proven otherwise.
Clamp numeric fields into a known range (my review schema uses a 0–100 scale).
Normalize “list-like” fields into string[] and default missing values to empty arrays.
Represent total agent failure explicitly (my sentinel is score=0, confidence=0).
Capture intermediate output so partial failures are diagnosable.
Centralize weighting/gate policy, and ensure the aggregation math doesn’t depend on accidental weight sums.

Multi‑agent scoring isn’t a number you compute. It’s a contract you enforce. And the only way it stays trustworthy is if every failure mode becomes visible before it becomes persuasive.

🎧 Listen to the audiobook — Spotify · Google Play · All platforms
🎬 Watch the visual overviews on YouTube
📖 Read the full 13-part series with AI assistant