DEV Community

Daniel Romitelli
Daniel Romitelli

Posted on • Originally published at romiteld.com

MR‑GRPO in Practice: The Reward Mixer That Stops CLIP From Lying to Your Scene Compiler

CLIP is a single thermometer taped to the outside of a house. It tells you something about the temperature, but it can't tell you whether the furnace is on, whether a window is open, or whether the house is on fire in the kitchen. In my pipeline, CLIP-only ranking produced the same kind of failure: candidates that "looked similar" in embedding space but violated constraints that mattered for continuity. (See Radford et al. for the original CLIP formulation and caveats about what an embedding captures.)

I found this out the hard way. I built Scenematic to compile scenes, not just prompts—and that means I needed a scoring system that could survive reality. Reality looks like this: I generate multiple candidate prompts for a scene, and I need to pick one that will actually preserve continuity constraints—character identity, camera style, color grade—across a directed scene graph. My first scoring approach was the obvious one: CLIP similarity to a source frame. It worked right up until it didn't.

So I replaced it with what the code calls a Multi‑Signal Reward Mixer (MR‑GRPO inspired)—a weighted, multi-head scoring system with GRPO-style group-relative normalization.

That's the feature in this post: the mixer itself. Not the downstream generation, not the prompt compiler, not post-processing—just the algorithmic core that turns messy, partially-missing signal outputs into a single decision.

Key insight: normalize within the candidate group, not against a fantasy global scale

The non-obvious part of this system isn't "use more signals." Everyone says that.

The trick is: don't trust raw signal magnitudes across candidates.

Even when a signal is documented as "0–1", its effective distribution can drift depending on scene category, prompt structure, or just the quirks of the underlying analyzers. If I combine raw values naively, one signal can dominate simply because it has higher variance—or because it saturates near 1.0 most of the time.

So the mixer does something deliberately local. It evaluates each candidate across multiple independent reward signals, then for each signal head computes population statistics—mean and standard deviation—across the candidate group. Each candidate's head score gets converted into a normalized value using z-score normalization (the code explicitly models this as "Per-head z-score normalization (Phase 2 calibration fix)"). Only then does it compose a final score using weights.

And because production pipelines are messy, it also handles two things that matter more than they sound. If a candidate lacks a signal—value is null—that head is ignored entirely for that candidate. And when some heads are missing, weights are re-normalized over the remaining heads so the score remains comparable. That combination is what makes the mixer feel "stable" in practice: it doesn't hallucinate certainty, and it doesn't punish candidates for missing data.

How it works under the hood

The implementation lives in lib/reward-mixer.ts.

The file header is blunt about what changed: it replaces single-dimensional CLIP-only candidate scoring, evaluates candidates across five independent reward signals, and scores using GRPO-style group-relative normalization.

The type that anchors everything is RewardSignals:

/** Individual reward signal scores for a single candidate */
export interface RewardSignals {
  /** CLIP embedding cosine similarity to source frame (0-1) */
  visualDrift: number | null
  /** Color palette consistency score (0-1) */
  colorHarmony: number | null
  /** Motion direction alignment score (0-1) */
  motionContinuity: number | null
}
Enter fullscreen mode Exit fullscreen mode

What surprised me when I wired this up is how quickly null became the "normal" case. In production, analyzers fail, time out, or just can't produce a score for a given candidate; treating that as a first-class input made the whole selection loop dramatically less brittle.

Per-head normalization (mean/std with epsilon)

Phase 2 introduced a specific fix: per-head population statistics for z-score normalization.

Those stats are explicitly modeled:

/** Per-head population statistics for z-score normalization */
export interface HeadStats {
  mean: number
  std: number
}
Enter fullscreen mode Exit fullscreen mode

The retrieved context shows the intent ("Per-head z-score normalization (Phase 2 calibration fix)") and the stats structure, but it does not include the exact epsilon constant or the full implementation of the normalization function. In my codebase, this exists as functions exported from lib/reward-mixer.ts (and tested in lib/__tests__/reward-mixer.test.ts), but the internals are not present in the provided snippet.

So instead of inventing the math, here is the actual callable surface that the tests import—grounded names, grounded exports—and a runnable example that documents the input/output shapes without fabricating internals.

// example/rewardMixerSurface.ts
// This file is intentionally limited to what is visible in retrieved context.
// The real implementations live in `lib/reward-mixer.ts`.

import {
  computeCompositeScore,
  rankCandidates,
  identifyWeakSignals,
  identifyWeakSignalsPerHead,
  normalizeHeadScore,
  mapSignalToTriggerHead,
  computeSubReason,
  DEFAULT_REWARD_WEIGHTS,
  type RewardSignals,
  type PerHeadHITLConfig,
} from "../lib/reward-mixer"

export type Candidate = {
  id: string
  signals: RewardSignals
}

// A tiny harness that demonstrates how the mixer API is used.
// Note: we cannot show internal math here because it is not included in retrieved context.
export function scoreAndRank(candidates: Candidate[], hitl?: PerHeadHITLConfig) {
  const scored = candidates.map((c) => {
    // normalizeHeadScore(...) exists and is imported by tests, but its internals are not shown.
    // computeCompositeScore(...) exists and is imported by tests, but its internals are not shown.
    const composite = computeCompositeScore(c.signals, DEFAULT_REWARD_WEIGHTS)
    const weak = identifyWeakSignals(c.signals)

    return { ...c, composite, weak }
  })

  // rankCandidates(...) exists and is imported by tests, but its internals are not shown.
  const ranked = rankCandidates(scored.map(({ id, signals }) => ({ id, signals })))

  return { scored, ranked }
}
Enter fullscreen mode Exit fullscreen mode

The non-obvious detail here is that I force myself (and future me) to interact with the mixer through exported functions that are tested. That's how I keep "reward logic" from turning into a pile of ad-hoc if-statements scattered across the pipeline.

Mapping signals to heads (and why it matters)

The tests also import mapSignalToTriggerHead.

That name is doing a lot of work: it implies the mixer isn't just producing a scalar score—it's also producing structured "why" metadata that can be used to trigger HITL (human-in-the-loop) gates or diagnostics.

Again, the retrieved context does not include the function body. What I can ground is that it exists, it's part of the public surface, and it's used in the test suite alongside identifyWeakSignalsPerHead and computeSubReason.

Here's a runnable example that shows how I thread that mapping through without guessing the mapping table:

// example/rewardReasons.ts
import {
  mapSignalToTriggerHead,
  computeSubReason,
  type RewardSignals,
} from "../lib/reward-mixer"

export function explainWeakness(signals: RewardSignals) {
  // identifyWeakSignalsPerHead(...) exists but its internals are not present in retrieved context.
  // Here we demonstrate how the exported mapping helpers would be used.

  const entries = Object.entries(signals) as Array<[keyof RewardSignals, number | null]>

  return entries
    .filter(([, v]) => v !== null)
    .map(([k, v]) => {
      const head = mapSignalToTriggerHead(k)
      const sub = computeSubReason(k, v as number)
      return { signal: k, value: v, head, subReason: sub }
    })
}
Enter fullscreen mode Exit fullscreen mode

What broke for me early on was trying to bolt "explanations" onto the score after the fact. By baking the mapping into the mixer layer, I can keep the scoring decision and the debugging story consistent.

The candidate scoring pipeline: sources → normalizer → mixer → reranker

The mixer is only one box in the pipeline, but it's the box that decides what survives.

Here's the architecture at the level that matters for this post:

flowchart TD
  subgraph signalSources
    visualDrift[visualDrift]
    colorHarmony[colorHarmony]
    motionContinuity[motionContinuity]
  end

  visualDrift --> normalizer[perHeadNormalizer]
  colorHarmony --> normalizer[perHeadNormalizer]
  motionContinuity --> normalizer[perHeadNormalizer]
  normalizer --> mixer[rewardMixer]
  mixer --> reranker[rankCandidates]
  reranker --> output[selectedCandidate]```



I think of this like mixing audio tracks: you don't want the loudest microphone to dominate the whole song just because it's loud. You normalize each track, then mix with intent.

## Null-signal skip + weight re-normalization (the "production reality" feature)

In the `RewardSignals` type, every head is `number | null`.

That's not a TypeScript nicety. It's a design decision: the mixer must be able to score candidates even when one or more analyzers fail.

The file header explicitly says candidates are evaluated across independent signals, and the Phase 2 work added normalization and per-head handling. The test suite imports functions that strongly imply the following behaviors exist: `normalizeHeadScore` for per-head normalization, `computeCompositeScore` for final mixing, and `identifyWeakSignals` alongside `identifyWeakSignalsPerHead` for diagnostics.

The retrieved context does not include the exact weight re-normalization loop, so I won't fabricate it. What I *can* do is show how I structure inputs so that the mixer can do that correctly:



```typescript
// example/nullSignals.ts
import { computeCompositeScore, DEFAULT_REWARD_WEIGHTS, type RewardSignals } from "../lib/reward-mixer"

const a: RewardSignals = {
  visualDrift: 0.82,
  colorHarmony: null, // analyzer failed or not applicable
  motionContinuity: 0.41,
}

const b: RewardSignals = {
  visualDrift: 0.77,
  colorHarmony: 0.66,
  motionContinuity: null,
}

// computeCompositeScore(...) is responsible for handling nulls.
// The internal logic (skip + re-normalize) is not shown in retrieved context.
console.log({ a: computeCompositeScore(a, DEFAULT_REWARD_WEIGHTS) })
console.log({ b: computeCompositeScore(b, DEFAULT_REWARD_WEIGHTS) })
Enter fullscreen mode Exit fullscreen mode

The key operational point: once you allow null, you stop writing brittle "if missing then score=0" hacks. That single change eliminated a whole class of silent ranking bugs for me.

Practical extensions (grounded to what exists in this codebase)

The Author Direction asks for extensions like per-category reward normalizers, reward clipping, and lightweight calibration.

From the retrieved context, I can ground one of these clearly: per-category calibration exists in the broader system. In lib/ood-detector.ts, Phase 2 introduced "Per-category thresholds derived from threshold sweep analysis." That's OOD, not rewards—but it proves the system already supports category-conditioned calibration logic. Similarly, the reward mixer explicitly contains "Phase 2 calibration fix" for per-head normalization.

What I cannot ground from the provided context: any reward clipping implementation (no constants, no function names, no tests shown), any weight-fitting/calibration routine "from a small human-labeled set" (no file, function, or artifact shown), or any reference to "CinematicLexiconSchema" (not present in retrieved context). So I'll keep the extensions section honest and limited to what's evidenced.

Extension 1: category-conditioned normalizers (pattern exists)

The OOD detector already does category-conditioned thresholding ("Phase 2: Per-category thresholds…"). The same pattern can be applied to reward normalization: maintain HeadStats per category instead of global. I'm not including code for this because no per-category reward stats store or lookup function is present in the retrieved context.

Extension 2: clamp/diagnose weak heads (evidenced by exported diagnostics)

The presence of identifyWeakSignals and identifyWeakSignalsPerHead (both imported by the tests) tells you the mixer supports head-level diagnostics. That's the foundation you need for any "don't let one head dominate" policy: you can detect which head is weak or strong and react. Again, no clipping function is shown in retrieved context, so I'm not going to invent one.

Extension 3: HITL triggers per head (evidenced by mapping helpers)

The baseline CSV includes a hitl_trigger column, and the reward mixer exports mapSignalToTriggerHead. That's a concrete, shipped pattern: the mixer doesn't just score—it can emit structured triggers that the rest of the pipeline can act on.

Closing

CLIP-only ranking felt like steering a car by watching a single wheel; the MR‑GRPO reward mixer is what finally made the whole vehicle track straight, because it scores candidates the way production behaves: multi-factor, partially missing, and always relative to the alternatives you actually have in hand.

RESEARCH SOURCES:

  • Radford, A., Kim, J. W., Hallacy, C., et al. "Learning Transferable Visual Models From Natural Language Supervision" (CLIP). https://arxiv.org/abs/2103.00020 — reference for CLIP and to ground the earlier discussion of why CLIP similarity can be an incomplete single signal.

Top comments (0)