DEV Community

Tram Victor
Tram Victor

Posted on

Calibrated LLM-as-judge: how I made my LLM give honest 4/10 scores instead of always-an-8

TL;DR

Built a UGC ad-script generator (5 scripts per request). Each script's hook is self-scored 1-10 by the same LLM. Naive prompt = every hook scores 8-9, useless. Fixed by writing a calibration rubric in the system prompt, anchoring with 3 worked examples, and forcing structured output with a strict JSON schema. Now scores spread 4-9 and correlate with which one I'd actually film. Code + prompts inside.

If you've been burned by "LLM-as-judge always says 9/10," this is one way to fix it without RLHF, fine-tuning, or a second model.


The problem

I built ScriptHook — generates 5 UGC ad scripts at a time for TikTok / Reels / Shorts. Each script comes back with a hook (the first 1.5s of the video), value beats, CTA, B-roll, on-screen text, caption, hashtags.

5 hooks per generation. Buyer needs to pick which one to film first. So I asked the model to self-score each hook 1-10.

First version, every hook came back at 8 or 9. Sometimes a 7 if I begged. Never below.

That's not a score, that's a participation trophy.

Why naive LLM-as-judge fails

Two reasons:

  1. Models are trained to be helpful. "Helpful" overlaps a lot with "encouraging." Telling a user their hook is a 4/10 feels mean, so the model rounds up.
  2. No anchor for what 4 means. If the prompt says "score 1-10," the model has no shared scale with you. It defaults to the same "polite middle" distribution you see in Yelp reviews and Uber driver ratings.

The fix isn't a smarter model. It's giving the model a calibrated rubric with examples so "4/10" has a concrete meaning the model can match against.

The fix — 3 layers

Layer 1: Anchored rubric in the system prompt

You score hooks 1-10 on a calibrated scale. The scale is NOT a politeness scale.
You MUST give scores below 6 when warranted. Most generic AI-written hooks
should score 4-6, not 7-8.

10: Stops scroll instantly. Specific number, contrarian, or pattern-break.
    Example: "Why I quit Athletic Greens after 90 days."
 9: Strong specific hook with curiosity gap. "3 reasons your skincare isn't working."
 8: Solid named-pain or named-promise. "If your hair frizzes by noon, watch this."
 7: Workable but unspecific. "Tired of feeling tired?"
 6: Generic curiosity hook. "Have you ever wondered why..."
 5: Cliche or AI-sounding. "In today's fast-paced world..."
 4: Lazy promise with no specificity. "You won't believe this."
 3: Word soup. "Get ready for the ultimate experience."
 2: Off-brand or off-product.
 1: Incoherent.

Default expectation for AI-generated hooks: 5-7. Push above 7 only when the hook
has a specific number, named pain, contrarian frame, or pattern-break opener.
Enter fullscreen mode Exit fullscreen mode

The key sentence: "Most generic AI-written hooks should score 4-6, not 7-8." This is the prior the model needs to overcome its politeness bias.

Layer 2: Worked examples in the rubric

Showing the model 3 hooks with their scores teaches more than 100 lines of rubric.

EXAMPLES:

Hook: "Want better skin? Try this." → score 4
Reason: Generic, no specificity, no number, no named-pain.

Hook: "I tested 12 retinol serums for 30 days. One actually worked." → score 9
Reason: Specific number (12, 30), implied scarcity (one actually worked), credible framing.

Hook: "Stop wasting money on supplements that don't absorb." → score 7
Reason: Named-pain (wasting money), category-clear, but not yet specific to a product.
Enter fullscreen mode Exit fullscreen mode

These three examples shift the model from a politeness distribution to a calibrated one.

Layer 3: Structured output with a strict JSON schema

Gemini 2.5 Flash Lite (the model I use) supports responseMimeType: "application/json" + a responseSchema parameter. Equivalent for OpenAI is response_format: { type: "json_schema", strict: true }.

const schema = {
  type: "object",
  properties: {
    scripts: {
      type: "array",
      minItems: 5,
      maxItems: 5,
      items: {
        type: "object",
        properties: {
          hook:        { type: "string" },
          hook_score:  { type: "integer", minimum: 1, maximum: 10 },
          hook_reason: { type: "string" },
          beats:       { type: "array", items: { type: "string" } },
          cta:         { type: "string" },
          b_roll:      { type: "array", items: { type: "string" } },
          on_screen:   { type: "array", items: { type: "string" } },
          caption:     { type: "string" },
          hashtags:    { type: "array", items: { type: "string" } },
        },
        required: [
          "hook", "hook_score", "hook_reason", "beats",
          "cta", "b_roll", "on_screen", "caption", "hashtags",
        ],
      },
    },
  },
  required: ["scripts"],
};
Enter fullscreen mode Exit fullscreen mode

hook_reason is the magic field. Forcing the model to justify the score in 1-2 sentences before / alongside the number dramatically reduces "everything is 8" drift. The model now has to commit to a reason, and a reason like "generic curiosity hook" is hard to pair with a 9.

This is structurally similar to chain-of-thought, but enforced via schema, not via "let's think step by step." The reason has to exist as a field, so the model produces it.

The Next.js / Vercel side

Stack:

  • Next.js 14 App Router
  • TypeScript
  • Tailwind
  • @google/generative-ai SDK
  • Vercel Edge runtime for the generation endpoint
// app/api/generate/route.ts
import { GoogleGenerativeAI } from "@google/generative-ai";

export const runtime = "edge";

const genAI = new GoogleGenerativeAI(process.env.GOOGLE_AI_API_KEY!);

export async function POST(req: Request) {
  const { product, platform, tone, hook_style, length } = await req.json();

  const model = genAI.getGenerativeModel({
    model: "gemini-2.5-flash-lite",
    generationConfig: {
      responseMimeType: "application/json",
      responseSchema: schema,
      temperature: 0.9,
    },
    systemInstruction: SYSTEM_PROMPT,
  });

  const userPrompt = buildUserPrompt(product, platform, tone, hook_style, length);
  const result = await model.generateContent(userPrompt);
  const parsed = JSON.parse(result.response.text());

  return Response.json(parsed);
}
Enter fullscreen mode Exit fullscreen mode

A few things I tried and rejected:

  • Separate scoring call. Generate 5 hooks → second call to score them. Worked, but doubled latency and cost, and the second call still scored politely without the same rubric. So inline scoring won.
  • Temperature = 0.3 for scoring. Caused mode collapse: every hook got the same score. 0.9 + a strict rubric gave better spread.
  • GPT-4o-mini. Worked well but was ~5× the cost per generation. Switched to Gemini 2.5 Flash Lite when its structured-output mode shipped.

Results

Across ~120 generations I logged manually:

Score Naive prompt Calibrated prompt
9-10 12% 18%
7-8 78% 41%
5-6 9% 29%
3-4 1% 11%
1-2 0% 1%

Mean shifted from 7.9 → 6.7. Standard deviation roughly doubled. And — the part that actually matters — the score now lines up with which one I'd film. When I asked 3 friends to rank the 5 hooks blind, the model's top pick matched theirs 64% of the time, vs 33% (chance) for the naive prompt.

Not great. Not bad. Workable.

Where this breaks

  • Domain drift. Calibrated for UGC ad hooks. If I used the same rubric for, say, B2B sales emails, the anchors don't fit and it'd revert to politeness scores.
  • Same-model bias. The judge is the same model that wrote the hook. It has a blind spot for its own failure modes. A real fix would be a different model as judge (or a small fine-tuned scorer).
  • Score inflation under repetition. If you regenerate the same product 10 times in one session, scores creep up. Probably context-window contamination. I haven't fully chased this — current workaround is don't reuse the session.

Things I'd love feedback on

  1. Has anyone tried this with a separate-model judge (e.g. Gemini hooks scored by Claude Haiku)? Curious if cross-model judging beats same-model rubric calibration.
  2. Better way to enforce diversity across the 5 hooks than just temperature? I tried contrastive prompting (telling the model "make hook #5 maximally different from hook #1") — mixed results.
  3. Anyone using DSPy for prompt optimization on this kind of rubric task? My rubric is hand-tuned and I suspect DSPy would beat me.

If you want to poke at the actual product, it's free for 3 generations no signup at scripthook.vercel.app. $19 lifetime for 50, $39/mo unlimited if you build a lot of UGC.

Code samples in this post are simplified for clarity. Real repo has retries, schema validation fallback (gemini occasionally returns malformed JSON ~1/200 reqs), and a backend-pluggable adapter so the same code runs against Claude / OpenAI / Gemini.

Happy to answer questions in comments.

Top comments (0)