TL;DR
Built a UGC ad-script generator (5 scripts per request). Each script's hook is self-scored 1-10 by the same LLM. Naive prompt = every hook scores 8-9, useless. Fixed by writing a calibration rubric in the system prompt, anchoring with 3 worked examples, and forcing structured output with a strict JSON schema. Now scores spread 4-9 and correlate with which one I'd actually film. Code + prompts inside.
If you've been burned by "LLM-as-judge always says 9/10," this is one way to fix it without RLHF, fine-tuning, or a second model.
The problem
I built ScriptHook â generates 5 UGC ad scripts at a time for TikTok / Reels / Shorts. Each script comes back with a hook (the first 1.5s of the video), value beats, CTA, B-roll, on-screen text, caption, hashtags.
5 hooks per generation. Buyer needs to pick which one to film first. So I asked the model to self-score each hook 1-10.
First version, every hook came back at 8 or 9. Sometimes a 7 if I begged. Never below.
That's not a score, that's a participation trophy.
Why naive LLM-as-judge fails
Two reasons:
- Models are trained to be helpful. "Helpful" overlaps a lot with "encouraging." Telling a user their hook is a 4/10 feels mean, so the model rounds up.
- No anchor for what 4 means. If the prompt says "score 1-10," the model has no shared scale with you. It defaults to the same "polite middle" distribution you see in Yelp reviews and Uber driver ratings.
The fix isn't a smarter model. It's giving the model a calibrated rubric with examples so "4/10" has a concrete meaning the model can match against.
The fix â 3 layers
Layer 1: Anchored rubric in the system prompt
You score hooks 1-10 on a calibrated scale. The scale is NOT a politeness scale.
You MUST give scores below 6 when warranted. Most generic AI-written hooks
should score 4-6, not 7-8.
10: Stops scroll instantly. Specific number, contrarian, or pattern-break.
Example: "Why I quit Athletic Greens after 90 days."
9: Strong specific hook with curiosity gap. "3 reasons your skincare isn't working."
8: Solid named-pain or named-promise. "If your hair frizzes by noon, watch this."
7: Workable but unspecific. "Tired of feeling tired?"
6: Generic curiosity hook. "Have you ever wondered why..."
5: Cliche or AI-sounding. "In today's fast-paced world..."
4: Lazy promise with no specificity. "You won't believe this."
3: Word soup. "Get ready for the ultimate experience."
2: Off-brand or off-product.
1: Incoherent.
Default expectation for AI-generated hooks: 5-7. Push above 7 only when the hook
has a specific number, named pain, contrarian frame, or pattern-break opener.
The key sentence: "Most generic AI-written hooks should score 4-6, not 7-8." This is the prior the model needs to overcome its politeness bias.
Layer 2: Worked examples in the rubric
Showing the model 3 hooks with their scores teaches more than 100 lines of rubric.
EXAMPLES:
Hook: "Want better skin? Try this." â score 4
Reason: Generic, no specificity, no number, no named-pain.
Hook: "I tested 12 retinol serums for 30 days. One actually worked." â score 9
Reason: Specific number (12, 30), implied scarcity (one actually worked), credible framing.
Hook: "Stop wasting money on supplements that don't absorb." â score 7
Reason: Named-pain (wasting money), category-clear, but not yet specific to a product.
These three examples shift the model from a politeness distribution to a calibrated one.
Layer 3: Structured output with a strict JSON schema
Gemini 2.5 Flash Lite (the model I use) supports responseMimeType: "application/json" + a responseSchema parameter. Equivalent for OpenAI is response_format: { type: "json_schema", strict: true }.
const schema = {
type: "object",
properties: {
scripts: {
type: "array",
minItems: 5,
maxItems: 5,
items: {
type: "object",
properties: {
hook: { type: "string" },
hook_score: { type: "integer", minimum: 1, maximum: 10 },
hook_reason: { type: "string" },
beats: { type: "array", items: { type: "string" } },
cta: { type: "string" },
b_roll: { type: "array", items: { type: "string" } },
on_screen: { type: "array", items: { type: "string" } },
caption: { type: "string" },
hashtags: { type: "array", items: { type: "string" } },
},
required: [
"hook", "hook_score", "hook_reason", "beats",
"cta", "b_roll", "on_screen", "caption", "hashtags",
],
},
},
},
required: ["scripts"],
};
hook_reason is the magic field. Forcing the model to justify the score in 1-2 sentences before / alongside the number dramatically reduces "everything is 8" drift. The model now has to commit to a reason, and a reason like "generic curiosity hook" is hard to pair with a 9.
This is structurally similar to chain-of-thought, but enforced via schema, not via "let's think step by step." The reason has to exist as a field, so the model produces it.
The Next.js / Vercel side
Stack:
- Next.js 14 App Router
- TypeScript
- Tailwind
-
@google/generative-aiSDK - Vercel Edge runtime for the generation endpoint
// app/api/generate/route.ts
import { GoogleGenerativeAI } from "@google/generative-ai";
export const runtime = "edge";
const genAI = new GoogleGenerativeAI(process.env.GOOGLE_AI_API_KEY!);
export async function POST(req: Request) {
const { product, platform, tone, hook_style, length } = await req.json();
const model = genAI.getGenerativeModel({
model: "gemini-2.5-flash-lite",
generationConfig: {
responseMimeType: "application/json",
responseSchema: schema,
temperature: 0.9,
},
systemInstruction: SYSTEM_PROMPT,
});
const userPrompt = buildUserPrompt(product, platform, tone, hook_style, length);
const result = await model.generateContent(userPrompt);
const parsed = JSON.parse(result.response.text());
return Response.json(parsed);
}
A few things I tried and rejected:
- Separate scoring call. Generate 5 hooks â second call to score them. Worked, but doubled latency and cost, and the second call still scored politely without the same rubric. So inline scoring won.
- Temperature = 0.3 for scoring. Caused mode collapse: every hook got the same score. 0.9 + a strict rubric gave better spread.
- GPT-4o-mini. Worked well but was ~5Ã the cost per generation. Switched to Gemini 2.5 Flash Lite when its structured-output mode shipped.
Results
Across ~120 generations I logged manually:
| Score | Naive prompt | Calibrated prompt |
|---|---|---|
| 9-10 | 12% | 18% |
| 7-8 | 78% | 41% |
| 5-6 | 9% | 29% |
| 3-4 | 1% | 11% |
| 1-2 | 0% | 1% |
Mean shifted from 7.9 â 6.7. Standard deviation roughly doubled. And â the part that actually matters â the score now lines up with which one I'd film. When I asked 3 friends to rank the 5 hooks blind, the model's top pick matched theirs 64% of the time, vs 33% (chance) for the naive prompt.
Not great. Not bad. Workable.
Where this breaks
- Domain drift. Calibrated for UGC ad hooks. If I used the same rubric for, say, B2B sales emails, the anchors don't fit and it'd revert to politeness scores.
- Same-model bias. The judge is the same model that wrote the hook. It has a blind spot for its own failure modes. A real fix would be a different model as judge (or a small fine-tuned scorer).
- Score inflation under repetition. If you regenerate the same product 10 times in one session, scores creep up. Probably context-window contamination. I haven't fully chased this â current workaround is don't reuse the session.
Things I'd love feedback on
- Has anyone tried this with a separate-model judge (e.g. Gemini hooks scored by Claude Haiku)? Curious if cross-model judging beats same-model rubric calibration.
- Better way to enforce diversity across the 5 hooks than just temperature? I tried contrastive prompting (telling the model "make hook #5 maximally different from hook #1") â mixed results.
- Anyone using DSPy for prompt optimization on this kind of rubric task? My rubric is hand-tuned and I suspect DSPy would beat me.
If you want to poke at the actual product, it's free for 3 generations no signup at scripthook.vercel.app. $19 lifetime for 50, $39/mo unlimited if you build a lot of UGC.
Code samples in this post are simplified for clarity. Real repo has retries, schema validation fallback (gemini occasionally returns malformed JSON ~1/200 reqs), and a backend-pluggable adapter so the same code runs against Claude / OpenAI / Gemini.
Happy to answer questions in comments.
Top comments (0)