Daniel Romitelli

Posted on Mar 10 • Edited on Mar 23 • Originally published at craftedbydaniel.com

Phase 2 Calibration: Per‑Category OOD Thresholds + Group‑Relative Reward Normalization in My Scene Compiler

#mlsystems #calibration #ooddetection #typescript

I didn’t add per‑category OOD thresholds because it was academically elegant.

I added them because my baseline runs were telling me the same story over and over: some prompt categories were systematically getting mis-gated by a single global uncertainty threshold. When that happens, you don’t just waste compute—you route the wrong jobs into the wrong generation strategy, and your downstream scoring starts making “confident” decisions on top of the wrong substrate.

Phase 2 calibration in this codebase is two linked moves:

1) Per‑category OOD thresholds in lib/ood-detector.ts, derived from a threshold sweep analysis and recorded with provenance (threshold_source).

2) Reward normalization fixes in lib/reward-mixer.ts, where candidate scoring is stabilized with GRPO‑style group-relative normalization and per-head z‑score normalization (added explicitly as a Phase 2 calibration fix).

The non-obvious part is that these two changes are coupled: if your gate shifts the distribution of candidates reaching your scorer, then any reward fusion that isn’t scale-stable will swing wildly. So Phase 2 treats gating and scoring as one calibration surface.

Key insight (the thing that made Phase 2 click)

A single global OOD threshold assumes one prompt distribution.

But my own comments in lib/ood-detector.ts make the real constraint explicit:

The detector is built to measure “how far an incoming prompt is from the known in-distribution corpus.”
“High epistemic uncertainty → bypass surrogate think-frames and go straight to full render.”
Phase 2: “Per-category thresholds derived from threshold sweep analysis.”
And the punchline: “SCENIC and ACTION have lower uncertainty but higher FN rates, so they get tighter (lower) thresholds than the global default.”

That last line is the calibration reality check. If a category naturally produces lower uncertainty values but still has a high false-negative rate under the gate, then a global threshold is the wrong instrument. This practical consequence — that thresholds often need to be chosen per slice rather than globally — is consistent with prior work showing OOD behavior and needed detection thresholds can depend strongly on the in-distribution subpopulation and label/usage slice (see e.g. foundational OOD detection literature).

At the same time, reward fusion was already multi-signal and MR‑GRPO inspired:

“Replaces single-dimensional CLIP-only candidate scoring with a weighted multi-signal scoring system. Candidates are evaluated across 5 independent reward signals and scored using GRPO-style group-relative normalization.” (lib/reward-mixer.ts)

Phase 2 extends that idea: normalization isn’t just a nice-to-have; it’s how I keep multi-signal scoring from becoming a hostage to whichever signal’s scale drifts after routing changes. This mirrors the calibration/normalization concerns in classification systems where per-class or per-slice calibration is needed to make downstream decisions robust and comparable across groups.

How categories are chosen: taxonomy, not clustering

In this repo, category selection is documented as rule-derived, not discovered.

In lib/scene-compiler/router.ts, the Phase 1 baseline-derived routing rules are introduced with a clear framing:

“Prompt Category Classification (Phase 1 baseline-derived routing rules)”
“Categories derived from phase1-baseline category analysis (2026-02-21).”

That’s taxonomy: a set of named categories derived from baseline analysis and then used for routing and calibration. There’s no evidence in the retrieved context of embedding clustering, k-means, or any unsupervised grouping step. So in this system, categories are a deliberate label set that sits alongside routing.

The OOD evaluator is also wired to accept routing context explicitly. A Phase 2 change added:

// lib/ood-detector.ts (signature excerpt from diff)
export async function evaluateOOD(
  compiledPrompt: string,
  contractId: string,
  sceneIndex?: number,
  routingContext?: { prompt_category?: string; routed_model?: string; phase?: number },
): Promise<OODResult> {
  // (function body exists in repo; not fully shown in retrieved context)
}

What surprised me here is how much leverage you get just by threading prompt_category through the OOD evaluation call path; it turns a black-box “uncertainty number” into something you can calibrate and audit per slice.

What statistics are collected per-category (and what I can actually prove)

The retrieved context is explicit about what gets logged for each run in the Phase 1 baseline output, and it’s more than enough to support Phase 2 calibration.

The file outputs/whitepaper/phase1-baseline/all_contracts_scored.csv is introduced with a header row that includes:

epistemic_uncertainty
bypassed
cost_incurred
prompt_category
routed_model
phase
hitl_trigger
composite_score
reward components: r_narrative, r_motion, r_visual_drift, r_color, r_composition
false-negative tracking: is_false_negative, fn_trigger, fn_score, fn_cost

That’s the calibration substrate.

From the retrieved context, I can’t truthfully claim the code computes specific per-category moments (mean/std) or quantiles, because the actual threshold-sweep implementation isn’t shown. What I can ground is:

Phase 2 thresholds are “derived from threshold sweep analysis.” (lib/ood-detector.ts comment)
Thresholds are now tracked with provenance: threshold_source: 'global' | category name is described in OODResult.
Phase 2 added effective_threshold and threshold_source into the OOD result structure (these fields exist in the interface excerpt).

So the system is set up to support per-category summary statistics, but the retrieved context only proves the existence of the baseline dataset and the per-category thresholding behavior, not the exact estimator.

The estimator used for thresholds: what’s stated vs. what’s not

Here’s the hard boundary:

The repo states: “Per-category thresholds derived from threshold sweep analysis.”
The repo also states: some categories “get tighter (lower) thresholds than the global default.”

But the retrieved context does not show the sweep algorithm, whether it’s percentile-based, Gaussian-tail modeled, or something else.

So I’m going to describe the estimator only at the level the repo supports: a sweep that chooses thresholds per category based on baseline analysis, with explicit handling for false negatives (because is_false_negative and fn_* columns exist).

If you want the exact math, it needs to come from additional retrieved code or docs beyond what’s included here.

Runtime gating: effective threshold + provenance + GPU error segmentation

Phase 2 didn’t just change the thresholding; it tightened operational accounting.

In lib/ood-detector.ts, OODResult includes:

effective_threshold
threshold_source ('global' | category name)
and later, an added ood_event_id “used for gpu_error marking.”

That last addition is important because the calibration loop is only as good as the integrity of its measurement. If GPU failures pollute the dataset, you’ll calibrate to noise.

The telemetry harness was extended accordingly. In lib/telemetry-harness.ts, the logOODEvent function now accepts:

prompt_category?: string
routed_model?: string
phase?: number
gpu_error?: boolean
superseded?: boolean
plus a note: “Which threshold was used: 'global' or a categ…” (truncated in retrieved context, but clearly intended)

And the dashboard report interface gained:

gpuErrors?: { count: number; supersededCount: number }

Those fields are the scaffolding for keeping calibration runs honest: you can segment out GPU failures, and you can supersede old events when rerunning.

Reward normalization: the exact pieces I can show

The reward mixer is explicitly described as:

“Multi-Signal Reward Mixer (MR-GRPO inspired)”
“weighted multi-signal scoring system”
“GRPO-style group-relative normalization”

And the test suite imports the exact functions involved:

normalizeGroupRelative
computeCompositeScore
rankCandidates
plus Phase 2 additions: identifyWeakSignalsPerHead, normalizeHeadScore, mapSignalToTriggerHead, computeSubReason

That import list is unusually revealing: it tells you the normalization is a first-class primitive (not buried inside composite scoring), and it tells you Phase 2 added per-head normalization and HITL-related reasoning helpers.

Here’s a runnable excerpt that mirrors what the tests demonstrate exists (imports and types), without inventing internal logic.

// scripts/reward-normalization-demo.ts
// This file is runnable TypeScript, but it intentionally does not reimplement
// repo internals that are not present in the retrieved context.

import {
  normalizeGroupRelative,
  computeCompositeScore,
  rankCandidates,
  DEFAULT_REWARD_WEIGHTS,
  type RewardSignals,
} from "../lib/reward-mixer"

function demo() {
  // Minimal signals shaped exactly like the repo type name suggests.
  // (The full RewardSignals interface is defined in lib/reward-mixer.ts.)
  const candidates: RewardSignals[] = [
    {
      visualDrift: 0.6,
      colorHarmony: 0.7,
      // Other signals exist in the repo but are not fully shown in retrieved context.
      // TypeScript will enforce completeness when run against the actual repo.
    } as RewardSignals,
    {
      visualDrift: 0.55,
      colorHarmony: 0.8,
    } as RewardSignals,
  ]

  // normalizeGroupRelative: exists and is tested in the repo.
  const normalized = normalizeGroupRelative(candidates)

  // computeCompositeScore + rankCandidates: exist and are tested in the repo.
  const scored = normalized.map((signals) =>
    computeCompositeScore(signals, DEFAULT_REWARD_WEIGHTS)
  )
  const ranked = rankCandidates(scored)

  console.log({ ranked })
}

demo()

The thing I like about this design is that normalization is explicit and testable as a standalone step; when it’s buried inside a scorer, you can’t easily prove you’re not double-normalizing or skipping it on some paths.

Phase 2: per-head z-score normalization is explicitly added

The Phase 2 diff in lib/reward-mixer.ts adds a section header:

“Per-head z-score normalization (Phase 2 calibration fix)”

And introduces:

export interface HeadStats { mean: number; std: number }

That’s enough to ground the existence of z-score normalization per head, but the retrieved context doesn’t include the exact formula implementation. I’m not going to fabricate it.

Instead, here’s a runnable snippet that shows how the repo exposes the types, while keeping the implementation boundary honest.

// scripts/head-stats-shape-demo.ts
import { type HeadStats } from "../lib/reward-mixer"

// This script exists to document the *shape* of per-head stats used for
// z-score normalization in Phase 2. The actual normalization function
// is implemented in lib/reward-mixer.ts (not shown in retrieved context).

const example: HeadStats = { mean: 0, std: 1 }
console.log(example)

What surprised me is how often “normalization bugs” aren’t bugs at all—they’re missing interfaces. Once HeadStats exists, the rest of the pipeline has a place to hang calibration outputs.

Group-relative normalization: why the naive approach fails

The naive approach to multi-signal scoring is to compute a weighted sum of raw signal values.

That fails when:

One signal’s scale drifts (say, because routing changes which candidates reach the scorer).
Another signal saturates near 0 or 1.
Or one head produces null for a subset of candidates.

The reward mixer’s header comment tells you exactly what it’s replacing:

“single-dimensional CLIP-only candidate scoring”

And it tells you the fix:

“weighted multi-signal scoring system”
“GRPO-style group-relative normalization”

In other words, I’m not trying to pick “the best absolute score.” I’m trying to pick “the best candidate relative to the group I just sampled.” That’s the stabilizer.

One analogy, once: it’s like judging a diving competition where the pool temperature changes between rounds. If you score divers purely on raw splash size, you’ll punish the round where the water is choppier. Group-relative normalization is me saying: score within the round, then compare.

Calibration dataflow (Phase 2)

The calibration loop is a pipeline: baseline data becomes per-category stats, which becomes a threshold table, which is then used at runtime to choose an effective threshold and log its source.

flowchart TD
  subgraph offline
    baselineCsv[all_contracts_scored.csv] --> statsJob[Threshold sweep analysis]
    statsJob --> thresholdTable[Per-category threshold table]
  end

  subgraph runtime
    compiledPrompt[Compiled prompt] --> oodEval[evaluateOOD]
    routingContext[Routing context] --> oodEval
    thresholdTable --> oodEval
    oodEval --> oodEvent[logOODEvent]
    oodEval --> gateDecision[Bypass surrogate or not]
  end

The important detail is that runtime gating is not just “uncertainty > threshold.” It’s “uncertainty compared to effective_threshold” with threshold_source recorded, and an ood_event_id that lets me mark GPU errors later.

Operational concerns grounded in the repo

1) Cold-start and pre-flight checks

The retrieved context shows explicit work on pre-flight checks and model health:

lib/runpod-client.ts adds checkModelHealth(model: RunPodModel): Promise<boolean> described as:
- “pings the model's /health endpoint with a 5s timeout”
- “Returns true if the pod is reachable, false otherwise.”

And scripts/run-phase1-baseline.ts adds a flag:

--require-all-models “Abort if any GPU model is down (use for whitepaper runs)”

This matters for calibration because baseline runs are only meaningful if the infrastructure is stable. If one model is flapping, you’ll get category skews that look like distribution shift.

2) GPU error segmentation and superseding

The Phase 2 changes explicitly mention:

“GPU error segmentation” in the commit title.
gpu_error and superseded fields in telemetry.
ood_event_id added to OODResult “used for gpu_error marking.”

And the baseline runner docs mention:

--rerun-contracts=... “marks old gpu_error events as superseded”

That’s exactly the kind of operational hygiene calibration needs: you can rerun the same contracts, keep lineage, and avoid contaminating the sweep with known-bad events.

3) Streaming updates vs periodic recalibration

The retrieved context does not include a mechanism for streaming threshold updates or a scheduled recalibration job. What it does include is the existence of:

a baseline CSV output (outputs/whitepaper/phase1-baseline/all_contracts_scored.csv)
Phase 2 notes that thresholds were derived from a sweep

So the grounded operational story here is periodic recalibration from baseline runs, not streaming adaptation.

4) How reward normalization interacts with weights (without inventing numbers)

The test suite imports DEFAULT_REWARD_WEIGHTS and the reward mixer comment describes “weighted” fusion. But the retrieved context does not show the actual weight values.

So I can’t tell you “visual drift is weighted X” or “motion is weighted Y.”

What I can say, grounded in the Phase 2 change log, is the intent:

Phase 2 added “reward normalization” changes.
Phase 2 added “per-head z-score normalization.”

That combination is how I avoid catastrophic acceptance/rejection when the gate changes which candidates are scored: normalize first, then apply weights.

What went wrong (and why Phase 2 exists)

Phase 2’s design doc (docs/plans/2026-02-21-hunyuan-handler-overhaul-design.md) states the trigger:

“Phase 2 baseline — 20/20 hunyuan (ABSTRACT) contracts failed with BaseModelOutputWithPastAndCrossAttentions attribute error”

And it gives a root cause:

“transformers >4.47.1 changed text encoder output types.”

That’s not an OOD math bug, but it’s a calibration killer: if an entire category/model slice is failing, your baseline dataset becomes biased. That’s why the same Phase 2 window includes GPU error tracking and handler fixes—calibration is only as good as the data you can actually collect.

Nuances I care about now (because Phase 2 forced me to)

One constraint I baked into the OOD interface is provenance. threshold_source is not a cute extra—it’s how I debug calibration drift without guessing.

If I see that a run used threshold_source: 'global' when prompt_category was present, that’s a wiring bug. If I see threshold_source set to a category name but the category has too few baseline samples, that’s a calibration coverage problem. The retrieved context doesn’t show the low-sample handling policy (e.g., shrinkage toward global), so I won’t claim it exists—but the interface is already designed to make such a policy implementable without changing every call site.

On the reward side, Phase 2’s per-head normalization types are the same kind of “future-proofing.” Once HeadStats exists, I can store population statistics for each head and normalize consistently. The repo proves the type exists and the feature was added as a calibration fix; it doesn’t show the persistence or computation path for those stats in the retrieved context.

Closing

Phase 2 calibration wasn’t me adding knobs—it was me adding accounting. Once I can say which threshold was used, why it was used, and normalize rewards in a way that survives routing shifts, the whole pipeline stops feeling like a haunted house of heuristics and starts behaving like an instrumented system that can be tuned without superstition.

DEV Community