<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tram Victor</title>
    <description>The latest articles on DEV Community by Tram Victor (@tram_victor_c7ad31e2add9e).</description>
    <link>https://dev.to/tram_victor_c7ad31e2add9e</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3949482%2F3fbc8684-96d2-4e40-9799-41b785d69b40.png</url>
      <title>DEV Community: Tram Victor</title>
      <link>https://dev.to/tram_victor_c7ad31e2add9e</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tram_victor_c7ad31e2add9e"/>
    <language>en</language>
    <item>
      <title>Calibrated LLM-as-judge: how I made my LLM give honest 4/10 scores instead of always-an-8</title>
      <dc:creator>Tram Victor</dc:creator>
      <pubDate>Sun, 24 May 2026 19:30:02 +0000</pubDate>
      <link>https://dev.to/tram_victor_c7ad31e2add9e/calibrated-llm-as-judge-how-i-made-my-llm-give-honest-410-scores-instead-of-always-an-8-2hjf</link>
      <guid>https://dev.to/tram_victor_c7ad31e2add9e/calibrated-llm-as-judge-how-i-made-my-llm-give-honest-410-scores-instead-of-always-an-8-2hjf</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Built a UGC ad-script generator (5 scripts per request). Each script's hook is self-scored 1-10 by the same LLM. Naive prompt = every hook scores 8-9, useless. Fixed by writing a &lt;strong&gt;calibration rubric in the system prompt&lt;/strong&gt;, anchoring with &lt;strong&gt;3 worked examples&lt;/strong&gt;, and forcing &lt;strong&gt;structured output with a strict JSON schema&lt;/strong&gt;. Now scores spread 4-9 and correlate with which one I'd actually film. Code + prompts inside.&lt;/p&gt;

&lt;p&gt;If you've been burned by "LLM-as-judge always says 9/10," this is one way to fix it without RLHF, fine-tuning, or a second model.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;I built &lt;a href="https://saas-tool-one.vercel.app" rel="noopener noreferrer"&gt;ScriptHook&lt;/a&gt; â€” generates 5 UGC ad scripts at a time for TikTok / Reels / Shorts. Each script comes back with a hook (the first 1.5s of the video), value beats, CTA, B-roll, on-screen text, caption, hashtags.&lt;/p&gt;

&lt;p&gt;5 hooks per generation. Buyer needs to pick which one to film first. So I asked the model to self-score each hook 1-10.&lt;/p&gt;

&lt;p&gt;First version, every hook came back at 8 or 9. Sometimes a 7 if I begged. Never below.&lt;/p&gt;

&lt;p&gt;That's not a score, that's a participation trophy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why naive LLM-as-judge fails
&lt;/h2&gt;

&lt;p&gt;Two reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Models are trained to be helpful.&lt;/strong&gt; "Helpful" overlaps a lot with "encouraging." Telling a user their hook is a 4/10 feels mean, so the model rounds up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No anchor for what 4 means.&lt;/strong&gt; If the prompt says "score 1-10," the model has no shared scale with you. It defaults to the same "polite middle" distribution you see in Yelp reviews and Uber driver ratings.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The fix isn't a smarter model. It's giving the model a &lt;strong&gt;calibrated rubric with examples&lt;/strong&gt; so "4/10" has a concrete meaning the model can match against.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix â€” 3 layers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Layer 1: Anchored rubric in the system prompt
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You score hooks 1-10 on a calibrated scale. The scale is NOT a politeness scale.
You MUST give scores below 6 when warranted. Most generic AI-written hooks
should score 4-6, not 7-8.

10: Stops scroll instantly. Specific number, contrarian, or pattern-break.
    Example: "Why I quit Athletic Greens after 90 days."
 9: Strong specific hook with curiosity gap. "3 reasons your skincare isn't working."
 8: Solid named-pain or named-promise. "If your hair frizzes by noon, watch this."
 7: Workable but unspecific. "Tired of feeling tired?"
 6: Generic curiosity hook. "Have you ever wondered why..."
 5: Cliche or AI-sounding. "In today's fast-paced world..."
 4: Lazy promise with no specificity. "You won't believe this."
 3: Word soup. "Get ready for the ultimate experience."
 2: Off-brand or off-product.
 1: Incoherent.

Default expectation for AI-generated hooks: 5-7. Push above 7 only when the hook
has a specific number, named pain, contrarian frame, or pattern-break opener.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key sentence: &lt;strong&gt;"Most generic AI-written hooks should score 4-6, not 7-8."&lt;/strong&gt; This is the &lt;strong&gt;prior&lt;/strong&gt; the model needs to overcome its politeness bias.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Worked examples in the rubric
&lt;/h3&gt;

&lt;p&gt;Showing the model 3 hooks with their scores teaches more than 100 lines of rubric.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EXAMPLES:

Hook: "Want better skin? Try this." â†’ score 4
Reason: Generic, no specificity, no number, no named-pain.

Hook: "I tested 12 retinol serums for 30 days. One actually worked." â†’ score 9
Reason: Specific number (12, 30), implied scarcity (one actually worked), credible framing.

Hook: "Stop wasting money on supplements that don't absorb." â†’ score 7
Reason: Named-pain (wasting money), category-clear, but not yet specific to a product.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These three examples shift the model from a politeness distribution to a calibrated one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Structured output with a strict JSON schema
&lt;/h3&gt;

&lt;p&gt;Gemini 2.5 Flash Lite (the model I use) supports &lt;code&gt;responseMimeType: "application/json"&lt;/code&gt; + a &lt;code&gt;responseSchema&lt;/code&gt; parameter. Equivalent for OpenAI is &lt;code&gt;response_format: { type: "json_schema", strict: true }&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;scripts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;array&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;minItems&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;maxItems&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;hook&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;hook_score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;integer&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;minimum&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;maximum&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;hook_reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;beats&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;       &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;array&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;cta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;         &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;b_roll&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;array&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;on_screen&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;array&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;caption&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;hashtags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;array&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
          &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;hook&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;hook_score&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;hook_reason&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;beats&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;cta&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;b_roll&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;on_screen&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;caption&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;hashtags&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;scripts&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;hook_reason&lt;/code&gt; is the magic field. Forcing the model to &lt;strong&gt;justify the score in 1-2 sentences before / alongside the number&lt;/strong&gt; dramatically reduces "everything is 8" drift. The model now has to commit to a reason, and a reason like "generic curiosity hook" is hard to pair with a 9.&lt;/p&gt;

&lt;p&gt;This is structurally similar to chain-of-thought, but enforced via schema, not via "let's think step by step." The reason has to exist as a field, so the model produces it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Next.js / Vercel side
&lt;/h2&gt;

&lt;p&gt;Stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Next.js 14 App Router&lt;/li&gt;
&lt;li&gt;TypeScript&lt;/li&gt;
&lt;li&gt;Tailwind&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;@google/generative-ai&lt;/code&gt; SDK&lt;/li&gt;
&lt;li&gt;Vercel Edge runtime for the generation endpoint
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// app/api/generate/route.ts&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;GoogleGenerativeAI&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@google/generative-ai&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;runtime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;edge&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;genAI&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;GoogleGenerativeAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;GOOGLE_AI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;POST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;platform&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;tone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;hook_style&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;genAI&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getGenerativeModel&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gemini-2.5-flash-lite&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;generationConfig&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;responseMimeType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;responseSchema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;systemInstruction&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;userPrompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;buildUserPrompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;platform&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;tone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;hook_style&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generateContent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userPrompt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things I tried and rejected:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Separate scoring call.&lt;/strong&gt; Generate 5 hooks â†’ second call to score them. Worked, but doubled latency and cost, and the second call still scored politely without the same rubric. So inline scoring won.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Temperature = 0.3 for scoring.&lt;/strong&gt; Caused mode collapse: every hook got the same score. 0.9 + a strict rubric gave better spread.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-4o-mini.&lt;/strong&gt; Worked well but was ~5Ã— the cost per generation. Switched to Gemini 2.5 Flash Lite when its structured-output mode shipped.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;Across ~120 generations I logged manually:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Naive prompt&lt;/th&gt;
&lt;th&gt;Calibrated prompt&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;9-10&lt;/td&gt;
&lt;td&gt;12%&lt;/td&gt;
&lt;td&gt;18%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7-8&lt;/td&gt;
&lt;td&gt;78%&lt;/td&gt;
&lt;td&gt;41%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5-6&lt;/td&gt;
&lt;td&gt;9%&lt;/td&gt;
&lt;td&gt;29%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3-4&lt;/td&gt;
&lt;td&gt;1%&lt;/td&gt;
&lt;td&gt;11%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1-2&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;1%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Mean shifted from 7.9 â†’ 6.7. Standard deviation roughly doubled. And â€” the part that actually matters â€” the score now lines up with which one I'd film. When I asked 3 friends to rank the 5 hooks blind, the model's top pick matched theirs 64% of the time, vs 33% (chance) for the naive prompt.&lt;/p&gt;

&lt;p&gt;Not great. Not bad. Workable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this breaks
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Domain drift.&lt;/strong&gt; Calibrated for UGC ad hooks. If I used the same rubric for, say, B2B sales emails, the anchors don't fit and it'd revert to politeness scores.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same-model bias.&lt;/strong&gt; The judge is the same model that wrote the hook. It has a blind spot for its own failure modes. A real fix would be a different model as judge (or a small fine-tuned scorer).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score inflation under repetition.&lt;/strong&gt; If you regenerate the same product 10 times in one session, scores creep up. Probably context-window contamination. I haven't fully chased this â€” current workaround is don't reuse the session.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Things I'd love feedback on
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Has anyone tried this with a separate-model judge (e.g. Gemini hooks scored by Claude Haiku)? Curious if cross-model judging beats same-model rubric calibration.&lt;/li&gt;
&lt;li&gt;Better way to enforce diversity across the 5 hooks than just temperature? I tried contrastive prompting (telling the model "make hook #5 maximally different from hook #1") â€” mixed results.&lt;/li&gt;
&lt;li&gt;Anyone using DSPy for prompt optimization on this kind of rubric task? My rubric is hand-tuned and I suspect DSPy would beat me.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;If you want to poke at the actual product, it's free for 3 generations no signup at &lt;a href="https://saas-tool-one.vercel.app" rel="noopener noreferrer"&gt;scripthook.vercel.app&lt;/a&gt;. $19 lifetime for 50, $39/mo unlimited if you build a lot of UGC.&lt;/p&gt;

&lt;p&gt;Code samples in this post are simplified for clarity. Real repo has retries, schema validation fallback (gemini occasionally returns malformed JSON ~1/200 reqs), and a backend-pluggable adapter so the same code runs against Claude / OpenAI / Gemini.&lt;/p&gt;

&lt;p&gt;Happy to answer questions in comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>typescript</category>
      <category>nextjs</category>
    </item>
  </channel>
</rss>
