I built a tiny UGC script generator because hooks are the hardest part

Tram Victor — Mon, 25 May 2026 07:49:19 +0000

I kept seeing the same problem with tiny products and creator videos: the product is fine, but the first line is too soft.

People do not stop scrolling for:

"Here is our new app..."

They stop when the first line names a pain they already feel.

So I built a small tool called ScriptHook. You paste a product, audience, platform, tone, and length. It gives back 5 short UGC-style scripts with different hooks, ranked by why the hook should work.

The point is not to make a perfect ad in one click. The useful part is getting 5 angles fast, then choosing one to test.

Example hooks I keep seeing work better:

"If your launch post got likes but no buyers, the hook was probably too vague."
"This is for founders who can build the product but freeze when they have to sell it."
"Your landing page is clear. The first sentence is just not painful enough yet."

Stack is simple: Next.js frontend, server route for scoring/generation, Gumroad for the $19 lifetime license. I also made a 600-hook swipe file from the same workflow, but Gumroad is currently blocking publish on that until payment setup is connected, so I am pushing the live tool first.

Try the free demo here:

https://saas-tool-one.vercel.app/?ref=devto

If you want, drop a one-sentence product description in the comments and I will reply with 3 hooks I would test.

Update: I also made a practical Hook Vault PDF for creators who want ready-to-shoot hooks instead of generating from scratch: https://nvat2510.gumroad.com/l/ryqqsk

It has 250 TikTok/UGC hooks across 10 niches, with b-roll ideas and why each hook works. The free demo tool is still here: https://saas-tool-one.vercel.app/?ref=devto

Calibrated LLM-as-judge: how I made my LLM give honest 4/10 scores instead of always-an-8

Tram Victor — Sun, 24 May 2026 19:30:02 +0000

TL;DR

Built a UGC ad-script generator (5 scripts per request). Each script's hook is self-scored 1-10 by the same LLM. Naive prompt = every hook scores 8-9, useless. Fixed by writing a calibration rubric in the system prompt, anchoring with 3 worked examples, and forcing structured output with a strict JSON schema. Now scores spread 4-9 and correlate with which one I'd actually film. Code + prompts inside.

If you've been burned by "LLM-as-judge always says 9/10," this is one way to fix it without RLHF, fine-tuning, or a second model.

The problem

I built ScriptHook â€” generates 5 UGC ad scripts at a time for TikTok / Reels / Shorts. Each script comes back with a hook (the first 1.5s of the video), value beats, CTA, B-roll, on-screen text, caption, hashtags.

5 hooks per generation. Buyer needs to pick which one to film first. So I asked the model to self-score each hook 1-10.

First version, every hook came back at 8 or 9. Sometimes a 7 if I begged. Never below.

That's not a score, that's a participation trophy.

Why naive LLM-as-judge fails

Two reasons:

Models are trained to be helpful. "Helpful" overlaps a lot with "encouraging." Telling a user their hook is a 4/10 feels mean, so the model rounds up.
No anchor for what 4 means. If the prompt says "score 1-10," the model has no shared scale with you. It defaults to the same "polite middle" distribution you see in Yelp reviews and Uber driver ratings.

The fix isn't a smarter model. It's giving the model a calibrated rubric with examples so "4/10" has a concrete meaning the model can match against.

The fix â€” 3 layers

Layer 1: Anchored rubric in the system prompt

You score hooks 1-10 on a calibrated scale. The scale is NOT a politeness scale.
You MUST give scores below 6 when warranted. Most generic AI-written hooks
should score 4-6, not 7-8.

10: Stops scroll instantly. Specific number, contrarian, or pattern-break.
    Example: "Why I quit Athletic Greens after 90 days."
 9: Strong specific hook with curiosity gap. "3 reasons your skincare isn't working."
 8: Solid named-pain or named-promise. "If your hair frizzes by noon, watch this."
 7: Workable but unspecific. "Tired of feeling tired?"
 6: Generic curiosity hook. "Have you ever wondered why..."
 5: Cliche or AI-sounding. "In today's fast-paced world..."
 4: Lazy promise with no specificity. "You won't believe this."
 3: Word soup. "Get ready for the ultimate experience."
 2: Off-brand or off-product.
 1: Incoherent.

Default expectation for AI-generated hooks: 5-7. Push above 7 only when the hook
has a specific number, named pain, contrarian frame, or pattern-break opener.

The key sentence: "Most generic AI-written hooks should score 4-6, not 7-8." This is the prior the model needs to overcome its politeness bias.

Layer 2: Worked examples in the rubric

Showing the model 3 hooks with their scores teaches more than 100 lines of rubric.

EXAMPLES:

Hook: "Want better skin? Try this." â†’ score 4
Reason: Generic, no specificity, no number, no named-pain.

Hook: "I tested 12 retinol serums for 30 days. One actually worked." â†’ score 9
Reason: Specific number (12, 30), implied scarcity (one actually worked), credible framing.

Hook: "Stop wasting money on supplements that don't absorb." â†’ score 7
Reason: Named-pain (wasting money), category-clear, but not yet specific to a product.

These three examples shift the model from a politeness distribution to a calibrated one.

Layer 3: Structured output with a strict JSON schema

Gemini 2.5 Flash Lite (the model I use) supports responseMimeType: "application/json" + a responseSchema parameter. Equivalent for OpenAI is response_format: { type: "json_schema", strict: true }.

const schema = {
  type: "object",
  properties: {
    scripts: {
      type: "array",
      minItems: 5,
      maxItems: 5,
      items: {
        type: "object",
        properties: {
          hook:        { type: "string" },
          hook_score:  { type: "integer", minimum: 1, maximum: 10 },
          hook_reason: { type: "string" },
          beats:       { type: "array", items: { type: "string" } },
          cta:         { type: "string" },
          b_roll:      { type: "array", items: { type: "string" } },
          on_screen:   { type: "array", items: { type: "string" } },
          caption:     { type: "string" },
          hashtags:    { type: "array", items: { type: "string" } },
        },
        required: [
          "hook", "hook_score", "hook_reason", "beats",
          "cta", "b_roll", "on_screen", "caption", "hashtags",
        ],
      },
    },
  },
  required: ["scripts"],
};

hook_reason is the magic field. Forcing the model to justify the score in 1-2 sentences before / alongside the number dramatically reduces "everything is 8" drift. The model now has to commit to a reason, and a reason like "generic curiosity hook" is hard to pair with a 9.

This is structurally similar to chain-of-thought, but enforced via schema, not via "let's think step by step." The reason has to exist as a field, so the model produces it.

The Next.js / Vercel side

Stack:

Next.js 14 App Router
TypeScript
Tailwind
@google/generative-ai SDK
Vercel Edge runtime for the generation endpoint

// app/api/generate/route.ts
import { GoogleGenerativeAI } from "@google/generative-ai";

export const runtime = "edge";

const genAI = new GoogleGenerativeAI(process.env.GOOGLE_AI_API_KEY!);

export async function POST(req: Request) {
  const { product, platform, tone, hook_style, length } = await req.json();

  const model = genAI.getGenerativeModel({
    model: "gemini-2.5-flash-lite",
    generationConfig: {
      responseMimeType: "application/json",
      responseSchema: schema,
      temperature: 0.9,
    },
    systemInstruction: SYSTEM_PROMPT,
  });

  const userPrompt = buildUserPrompt(product, platform, tone, hook_style, length);
  const result = await model.generateContent(userPrompt);
  const parsed = JSON.parse(result.response.text());

  return Response.json(parsed);
}

A few things I tried and rejected:

Separate scoring call. Generate 5 hooks â†’ second call to score them. Worked, but doubled latency and cost, and the second call still scored politely without the same rubric. So inline scoring won.
Temperature = 0.3 for scoring. Caused mode collapse: every hook got the same score. 0.9 + a strict rubric gave better spread.
GPT-4o-mini. Worked well but was ~5Ã— the cost per generation. Switched to Gemini 2.5 Flash Lite when its structured-output mode shipped.

Results

Across ~120 generations I logged manually:

Score	Naive prompt	Calibrated prompt
9-10	12%	18%
7-8	78%	41%
5-6	9%	29%
3-4	1%	11%
1-2	0%	1%

Mean shifted from 7.9 â†’ 6.7. Standard deviation roughly doubled. And â€” the part that actually matters â€” the score now lines up with which one I'd film. When I asked 3 friends to rank the 5 hooks blind, the model's top pick matched theirs 64% of the time, vs 33% (chance) for the naive prompt.

Not great. Not bad. Workable.

Where this breaks

Domain drift. Calibrated for UGC ad hooks. If I used the same rubric for, say, B2B sales emails, the anchors don't fit and it'd revert to politeness scores.
Same-model bias. The judge is the same model that wrote the hook. It has a blind spot for its own failure modes. A real fix would be a different model as judge (or a small fine-tuned scorer).
Score inflation under repetition. If you regenerate the same product 10 times in one session, scores creep up. Probably context-window contamination. I haven't fully chased this â€” current workaround is don't reuse the session.

Things I'd love feedback on

Has anyone tried this with a separate-model judge (e.g. Gemini hooks scored by Claude Haiku)? Curious if cross-model judging beats same-model rubric calibration.
Better way to enforce diversity across the 5 hooks than just temperature? I tried contrastive prompting (telling the model "make hook #5 maximally different from hook #1") â€” mixed results.
Anyone using DSPy for prompt optimization on this kind of rubric task? My rubric is hand-tuned and I suspect DSPy would beat me.

If you want to poke at the actual product, it's free for 3 generations no signup at scripthook.vercel.app. $19 lifetime for 50, $39/mo unlimited if you build a lot of UGC.

Code samples in this post are simplified for clarity. Real repo has retries, schema validation fallback (gemini occasionally returns malformed JSON ~1/200 reqs), and a backend-pluggable adapter so the same code runs against Claude / OpenAI / Gemini.

Happy to answer questions in comments.

DEV Community: Tram Victor