<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Charlie Hadley</title>
    <description>The latest articles on DEV Community by Charlie Hadley (@hadleyworks).</description>
    <link>https://dev.to/hadleyworks</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3938282%2Fbf5a194f-b3e6-4cf8-8791-b2fadbf013d9.png</url>
      <title>DEV Community: Charlie Hadley</title>
      <link>https://dev.to/hadleyworks</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hadleyworks"/>
    <language>en</language>
    <item>
      <title>Why Your LLM Prompt Breaks in Production (And How to Fix It Before Shipping)</title>
      <dc:creator>Charlie Hadley</dc:creator>
      <pubDate>Mon, 18 May 2026 17:12:54 +0000</pubDate>
      <link>https://dev.to/hadleyworks/why-your-llm-prompt-breaks-in-production-and-how-to-fix-it-before-shipping-1flo</link>
      <guid>https://dev.to/hadleyworks/why-your-llm-prompt-breaks-in-production-and-how-to-fix-it-before-shipping-1flo</guid>
      <description>&lt;h1&gt;
  
  
  Why Your LLM Prompt Breaks in Production (And How to Fix It Before Shipping)
&lt;/h1&gt;

&lt;p&gt;You've tested your LLM feature manually. It looks great. You ship it.&lt;/p&gt;

&lt;p&gt;Three days later, a user reports the output is completely wrong. You dig in, and realise: you changed a prompt last week, and that change broke something subtle you never tested.&lt;/p&gt;

&lt;p&gt;This is the most common failure mode for indie developers shipping LLM features. And it's entirely preventable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Root Cause: Probabilistic Systems Need Deterministic Tests
&lt;/h2&gt;

&lt;p&gt;Traditional software has a nice property: given the same input, you get the same output. You write a unit test, it passes, you ship with confidence.&lt;/p&gt;

&lt;p&gt;LLMs break this property. The same input produces different outputs. Quality degrades gradually as you tweak prompts. Models get updated. Context windows fill up differently.&lt;/p&gt;

&lt;p&gt;You can't test LLM systems the same way you test regular code.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Works: Rubric-Based Evaluation
&lt;/h2&gt;

&lt;p&gt;Instead of "does this output look right?", define quality as a &lt;strong&gt;concrete rubric&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Scale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Correctness&lt;/td&gt;
&lt;td&gt;Is the answer factually accurate?&lt;/td&gt;
&lt;td&gt;0–10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Conciseness&lt;/td&gt;
&lt;td&gt;Does it avoid unnecessary verbosity?&lt;/td&gt;
&lt;td&gt;0–10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hallucination Risk&lt;/td&gt;
&lt;td&gt;Does it cite things it can't know?&lt;/td&gt;
&lt;td&gt;0–10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tone&lt;/td&gt;
&lt;td&gt;Does it match the expected register?&lt;/td&gt;
&lt;td&gt;0–10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Usefulness&lt;/td&gt;
&lt;td&gt;Would a real user find this helpful?&lt;/td&gt;
&lt;td&gt;0–10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A judge model (GPT-4o-mini at ~$0.0001/call) scores each output against this rubric automatically. Run 50 test cases, aggregate scores, and if your composite score drops below a threshold — the PR fails.&lt;/p&gt;

&lt;p&gt;This is &lt;strong&gt;eval-as-code&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Golden Dataset Problem
&lt;/h2&gt;

&lt;p&gt;The hardest part is building test cases. Here's the key insight most guides miss:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with failures, not successes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every time your LLM makes a mistake in production or testing:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Save the input&lt;/li&gt;
&lt;li&gt;Write down what the correct output should have been&lt;/li&gt;
&lt;li&gt;Add it to &lt;code&gt;golden_dataset.json&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After 2–3 weeks, you'll have 30–50 test cases that represent &lt;strong&gt;real failure modes&lt;/strong&gt; — far more valuable than synthetic examples you invented. A golden dataset built from real failures will catch real regressions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running This in GitHub Actions
&lt;/h2&gt;

&lt;p&gt;Here's the minimal CI integration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLM Eval&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;eval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v3&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run evals&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python run_evals.py&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.OPENAI_API_KEY }}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check threshold&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python check_threshold.py --min-score &lt;/span&gt;&lt;span class="m"&gt;7.5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If aggregate score drops below 7.5, &lt;code&gt;check_threshold.py&lt;/code&gt; exits with code 1 — the PR is blocked. Simple, deterministic gating on a probabilistic system.&lt;/p&gt;

&lt;p&gt;Total cost to run 50 evals: about £0.20.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Model Comparison Before You Commit
&lt;/h2&gt;

&lt;p&gt;Before paying for GPT-4o, run your eval suite across providers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-3-5-haiku&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-flash-1.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_eval_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;golden_dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calculate_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: score=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, cost=£&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll often find that Claude Haiku or GPT-4o-mini scores 90%+ as well as GPT-4o at 20% of the cost. Don't pay for intelligence you don't need.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Real Example
&lt;/h2&gt;

&lt;p&gt;I shipped a classification system prompt update to improve response formatting. It looked solid in manual testing on 5 examples. I accidentally dropped a critical piece of context the model needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without evals:&lt;/strong&gt; ships to users. Angry tickets. Rollback. Lost trust.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;With this setup:&lt;/strong&gt; CI caught the regression in 4 minutes. PR failed. Fixed the prompt. Shipped cleanly.&lt;/p&gt;

&lt;p&gt;That one catch alone justified the entire system.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I've Packaged
&lt;/h2&gt;

&lt;p&gt;I've turned this into a complete, ready-to-use system — &lt;strong&gt;&lt;a href="https://hadleyworks.gumroad.com/l/nyzala" rel="noopener noreferrer"&gt;The Indie Hacker's LLM Eval Playbook&lt;/a&gt;&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;6 golden dataset templates (classification, summarization, retrieval, generation, code review, reasoning)&lt;/li&gt;
&lt;li&gt;Complete rubric scoring system in Python (copy-paste ready)&lt;/li&gt;
&lt;li&gt;Multi-model comparison script with cost-efficiency ranking&lt;/li&gt;
&lt;li&gt;GitHub Actions workflow — drop it in and it works&lt;/li&gt;
&lt;li&gt;Cost optimisation guide with real benchmarks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;£29 one-time.&lt;/strong&gt; One prevented production incident pays for it 10× over.&lt;/p&gt;

&lt;p&gt;Questions about implementing this? Drop them in the comments.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>productivity</category>
      <category>startup</category>
    </item>
    <item>
      <title>Why Your LLM Prompt Breaks in Production (And How to Fix It Before Shipping)</title>
      <dc:creator>Charlie Hadley</dc:creator>
      <pubDate>Mon, 18 May 2026 16:35:34 +0000</pubDate>
      <link>https://dev.to/hadleyworks/why-your-llm-prompt-breaks-in-production-and-how-to-fix-it-before-shipping-1ddk</link>
      <guid>https://dev.to/hadleyworks/why-your-llm-prompt-breaks-in-production-and-how-to-fix-it-before-shipping-1ddk</guid>
      <description>&lt;h1&gt;
  
  
  LLM Evaluation in CI: Stop Manual Testing Before It Costs You
&lt;/h1&gt;

&lt;p&gt;You ship a prompt change to production. Two hours later, a customer complains your LLM is now returning hallucinated data. You rollback. You lost an hour of revenue.&lt;/p&gt;

&lt;p&gt;This happens because you tested the happy path, not the edge cases. LLM systems are probabilistic—the same input doesn't always produce the same output quality.&lt;/p&gt;

&lt;p&gt;The enterprise solution is Braintrust ($249/mo), LangSmith ($99/mo), or Arize. If you're indie, bootstrapped, or pre-PMF, those budgets don't exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Eval-as-Code in GitHub Actions
&lt;/h2&gt;

&lt;p&gt;I've been shipping LLM features for indie products for the past year. I built a rubric-based evaluation system that runs in CI and costs about £0.20 per full eval run.&lt;/p&gt;

&lt;p&gt;Here's the core idea:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Define quality as a rubric&lt;/strong&gt;, not vibes. Instead of "does this look good?", you write: correctness, conciseness, tone, hallucination-risk, usefulness. 5-10 concrete attributes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create golden datasets&lt;/strong&gt;. For each use case (classification, summarization, retrieval, generation, etc.), build 20-50 test cases with expected outputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use a cheap judge model&lt;/strong&gt;. GPT-4o-mini scores each output against your rubric. Cost: pennies per eval.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automate in CI&lt;/strong&gt;. GitHub Actions runs the evals on every PR. If scores drop below threshold, the PR fails.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Concrete Example From Production
&lt;/h2&gt;

&lt;p&gt;I changed a classification system prompt to improve response formatting. The change looked solid in manual testing. But I accidentally dropped a critical piece of context the model needed for correct classification.&lt;/p&gt;

&lt;p&gt;Without evals: that ships to users. Angry support tickets. Rollback. Lost trust.&lt;/p&gt;

&lt;p&gt;With evals: CI caught it in 4 minutes. PR fails. I fix the prompt. Evals pass. Ship confidently.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Actually in the Playbook
&lt;/h2&gt;

&lt;p&gt;I've packaged this into a complete system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Golden dataset templates&lt;/strong&gt; for 6 common LLM use cases (classification, summarization, retrieval, generation, code, reasoning)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rubric-scoring system&lt;/strong&gt;: the exact Python code to score outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-model comparison scripts&lt;/strong&gt;: compare GPT-4o vs Claude vs Gemini on identical cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complete GitHub Actions workflow&lt;/strong&gt;: copy-paste, no tweaking needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost optimization&lt;/strong&gt;: batch evals, cache responses, use cheaper models for coarse filtering&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full system is documented with real examples from my production infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who This Is For
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Indie hackers&lt;/strong&gt; shipping LLM features with no ML team&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Startups&lt;/strong&gt; evaluating multiple models before scaling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineers&lt;/strong&gt; maintaining LLM systems over time (catch regressions early)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anyone tired&lt;/strong&gt; of deploying hope instead of metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The playbook is £29 one-time. You run it once, you've paid for itself by avoiding one bad production deployment.&lt;/p&gt;

&lt;p&gt;Get it: &lt;a href="https://hadleyworks.gumroad.com/l/nyzala" rel="noopener noreferrer"&gt;https://hadleyworks.gumroad.com/l/nyzala&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>LLM Evaluation in CI: Stop Manual Testing Before It Costs You</title>
      <dc:creator>Charlie Hadley</dc:creator>
      <pubDate>Mon, 18 May 2026 16:35:21 +0000</pubDate>
      <link>https://dev.to/hadleyworks/llm-evaluation-in-ci-stop-manual-testing-before-it-costs-you-59i7</link>
      <guid>https://dev.to/hadleyworks/llm-evaluation-in-ci-stop-manual-testing-before-it-costs-you-59i7</guid>
      <description>&lt;h1&gt;
  
  
  LLM Evaluation in CI: Stop Manual Testing Before It Costs You
&lt;/h1&gt;

&lt;p&gt;You ship a prompt change to production. Two hours later, a customer complains your LLM is returning hallucinated data. You rollback. You lost an hour of revenue and some user trust.&lt;/p&gt;

&lt;p&gt;This happens because you tested the happy path, not the edge cases. LLM systems are probabilistic — the same input doesn't always produce the same output quality.&lt;/p&gt;

&lt;p&gt;The enterprise solution is Braintrust ($249/mo), LangSmith ($99/mo), or Arize. If you're indie, bootstrapped, or pre-PMF, those budgets simply don't exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Idea: Eval-as-Code
&lt;/h2&gt;

&lt;p&gt;Instead of vibes-based testing, you define quality as a &lt;strong&gt;rubric&lt;/strong&gt; with concrete attributes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Correctness&lt;/strong&gt; (0–10): Is the answer factually right?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conciseness&lt;/strong&gt; (0–10): Does it avoid unnecessary padding?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination risk&lt;/strong&gt; (0–10): Does it cite things it can't know?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tone&lt;/strong&gt; (0–10): Does it match expected register?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Usefulness&lt;/strong&gt; (0–10): Would a real user find this helpful?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A cheap judge model (GPT-4o-mini at ~$0.0001/call) scores each output against your rubric. You run 50 test cases per eval. Total cost: about £0.20 per full run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building This in GitHub Actions
&lt;/h2&gt;

&lt;p&gt;Here's the minimal structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLM Eval&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;eval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v3&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run evals&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python run_evals.py&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.OPENAI_API_KEY }}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check threshold&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python check_threshold.py --min-score &lt;/span&gt;&lt;span class="m"&gt;7.5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;run_evals.py&lt;/code&gt; script:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Loads your golden dataset (JSON file of input/expected-output pairs)&lt;/li&gt;
&lt;li&gt;Runs your LLM system on each input&lt;/li&gt;
&lt;li&gt;Sends (input, expected, actual) to GPT-4o-mini with your rubric&lt;/li&gt;
&lt;li&gt;Aggregates scores by attribute&lt;/li&gt;
&lt;li&gt;Writes results to &lt;code&gt;eval_results.json&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If aggregate score drops below your threshold, &lt;code&gt;check_threshold.py&lt;/code&gt; exits with code 1 — the PR fails.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Real Example From Production
&lt;/h2&gt;

&lt;p&gt;I changed a classification system prompt to improve response formatting. The change looked solid in manual testing on 5 examples. But I accidentally dropped a critical piece of context the model needed for correct classification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without evals:&lt;/strong&gt; ships to users. Angry support tickets. Rollback. Lost trust.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With evals:&lt;/strong&gt; CI caught it in 4 minutes. PR fails. I fix the prompt. Evals pass. Ship confidently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Golden Datasets: The Hard Part
&lt;/h2&gt;

&lt;p&gt;The hardest part is building your test cases. The key insight: &lt;strong&gt;start with failures, not successes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every time your LLM system makes a mistake:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Save the input&lt;/li&gt;
&lt;li&gt;Write down what the correct output should have been&lt;/li&gt;
&lt;li&gt;Add it to your golden dataset&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After 2–3 weeks of normal usage, you'll have 30–50 meaningful test cases that represent real failure modes — far more valuable than synthetic test cases you invented upfront.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Model Comparison
&lt;/h2&gt;

&lt;p&gt;Before committing to an expensive model, run your eval suite across providers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-3-5-haiku&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-flash-1.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_eval_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;golden_dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Sort by (score / cost_per_1k_tokens) to find optimal tradeoff
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This stops you from paying for GPT-4o when Claude Haiku scores 92% as well at 20% of the cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Optimization
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Batch your calls&lt;/strong&gt;: OpenAI batch API gives 50% discount on async evals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache responses&lt;/strong&gt;: Hash (model + prompt + input) → cache hit avoids re-scoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coarse-to-fine&lt;/strong&gt;: Use a 2-stage system — cheap model filters obvious passes, expensive model only sees borderline cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weekly CI only&lt;/strong&gt;: Run full suite on PRs to main, not every commit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A well-optimized setup runs 100 eval cases for under £0.10.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I've Packaged Up
&lt;/h2&gt;

&lt;p&gt;I've turned this into a complete ready-to-use system in &lt;strong&gt;&lt;a href="https://hadleyworks.gumroad.com/l/nyzala" rel="noopener noreferrer"&gt;The Indie Hacker's LLM Eval Playbook&lt;/a&gt;&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;6 golden dataset templates&lt;/strong&gt; for common LLM tasks (classification, summarization, retrieval, generation, code review, reasoning)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complete rubric scoring system&lt;/strong&gt; in Python (copy-paste ready)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-model comparison script&lt;/strong&gt; with cost-efficiency ranking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Actions workflow&lt;/strong&gt; — drop it in your repo and it works&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost optimization guide&lt;/strong&gt; with benchmarks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;£29 one-time.&lt;/strong&gt; One avoided production incident pays for it 10× over.&lt;/p&gt;

&lt;p&gt;If you have questions about implementing eval-as-code for your specific use case, drop them in the comments — happy to help.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>testing</category>
      <category>devops</category>
    </item>
    <item>
      <title>I Built LLM Evaluation-as-Code in CI: Here's How to Avoid Shipping Regressions</title>
      <dc:creator>Charlie Hadley</dc:creator>
      <pubDate>Mon, 18 May 2026 16:28:10 +0000</pubDate>
      <link>https://dev.to/hadleyworks/i-built-llm-evaluation-as-code-in-ci-heres-how-to-avoid-shipping-regressions-3f7h</link>
      <guid>https://dev.to/hadleyworks/i-built-llm-evaluation-as-code-in-ci-heres-how-to-avoid-shipping-regressions-3f7h</guid>
      <description>&lt;h1&gt;
  
  
  API Rate Limiting Playbook: Protect Your Backend From Abuse
&lt;/h1&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Your API is live in production. Traffic is growing. Then one day, a bot discovers your endpoint and starts hammering it with 100,000 requests per second. Your database melts. Your users see 500 errors. You lose revenue and reputation.&lt;/p&gt;

&lt;p&gt;Or worse: a malicious actor uses your API to brute-force user accounts. You didn't have rate limiting in place. You're liable.&lt;/p&gt;

&lt;p&gt;This is the silent killer of indie SaaS. You ship the product. You don't ship the protection. Then production breaks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Most Indie Teams Skip Rate Limiting
&lt;/h2&gt;

&lt;p&gt;Rate limiting &lt;em&gt;sounds&lt;/em&gt; complicated. "Distributed rate limiting"? "Token bucket algorithm"? "Redis backing stores"?&lt;/p&gt;

&lt;p&gt;In reality, it's simple. And you don't need expensive tools. You don't need AWS API Gateway ($0.35 per million requests). You don't need third-party middleware.&lt;/p&gt;

&lt;p&gt;You need a methodology. Once you have methodology, the implementation is trivial.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three-Layer Strategy
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Layer 1: IP-Based Rate Limiting (Nginx)
&lt;/h3&gt;

&lt;p&gt;First line of defense: block obvious bots and abusers at the edge.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;limit_req_zone&lt;/span&gt; &lt;span class="nv"&gt;$binary_remote_addr&lt;/span&gt; &lt;span class="s"&gt;zone=general:10m&lt;/span&gt; &lt;span class="s"&gt;rate=10r/s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;limit_req_zone&lt;/span&gt; &lt;span class="nv"&gt;$binary_remote_addr&lt;/span&gt; &lt;span class="s"&gt;zone=auth:10m&lt;/span&gt; &lt;span class="s"&gt;rate=1r/s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/api/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;limit_req&lt;/span&gt; &lt;span class="s"&gt;zone=general&lt;/span&gt; &lt;span class="s"&gt;burst=20&lt;/span&gt; &lt;span class="s"&gt;nodelay&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/api/auth/login&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;limit_req&lt;/span&gt; &lt;span class="s"&gt;zone=auth&lt;/span&gt; &lt;span class="s"&gt;burst=3&lt;/span&gt; &lt;span class="s"&gt;nodelay&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost: $0 (Nginx is free).&lt;/p&gt;

&lt;p&gt;Setup time: 15 minutes.&lt;/p&gt;

&lt;p&gt;Blocks: 95% of bot traffic and accidental DDoS.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: User/Token-Based Rate Limiting (Redis + Python)
&lt;/h3&gt;

&lt;p&gt;Your authenticated users have legitimate spikes. A single IP-based rule punishes them unfairly.&lt;/p&gt;

&lt;p&gt;Instead, rate limit per API key or user ID:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;

&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_rate_limited&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rate_limit:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;window_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;incr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;

&lt;span class="nd"&gt;@app.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/api/resource&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_resource&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;is_rate_limited&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Rate limit exceeded&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;429&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;process_request&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost: Redis Cloud free tier (up to 30MB).&lt;/p&gt;

&lt;p&gt;Setup time: 30 minutes.&lt;/p&gt;

&lt;p&gt;Blocks: Authenticated abuse, account enumeration, brute-force attacks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Endpoint-Specific Thresholds
&lt;/h3&gt;

&lt;p&gt;Different endpoints have different abuse vectors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Public endpoints&lt;/strong&gt; (search, info): 100 req/min per IP&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auth endpoints&lt;/strong&gt; (login, signup): 5 req/min per IP + distributed rate limit&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource creation&lt;/strong&gt; (write APIs): 10 req/min per user&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Admin endpoints&lt;/strong&gt;: 1000 req/day per user (tight control)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Document these in your API spec. Expose rate limit headers to clients:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;X-RateLimit-Limit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;100&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;X-RateLimit-Remaining&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;87&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;X-RateLimit-Reset&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;unix_timestamp&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Real-World Cost Breakdown
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Nginx configuration&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Redis Cloud (free tier)&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monitoring + alerts&lt;/td&gt;
&lt;td&gt;$0–10/month (CloudWatch or Datadog free tier)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0–10/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Compare to AWS API Gateway: $0.35 per million requests = $3,500/month at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation Checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Deploy Nginx rate limiting (zone + limit_req directive)&lt;/li&gt;
&lt;li&gt;[ ] Set up Redis account (free tier)&lt;/li&gt;
&lt;li&gt;[ ] Write rate limit middleware in your framework&lt;/li&gt;
&lt;li&gt;[ ] Define endpoint-specific limits&lt;/li&gt;
&lt;li&gt;[ ] Add rate limit headers to responses&lt;/li&gt;
&lt;li&gt;[ ] Test with Apache Bench or Vegeta load testing tool&lt;/li&gt;
&lt;li&gt;[ ] Set up alerts (Slack notification when a user hits limits)&lt;/li&gt;
&lt;li&gt;[ ] Document rate limits in your API docs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Time to implement: &lt;strong&gt;2–4 hours&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Cost: &lt;strong&gt;$0&lt;/strong&gt; (for 95% of use cases).&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes to Avoid
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Only IP-based limiting&lt;/strong&gt;: Punishes corporate networks and VPNs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No graduated response&lt;/strong&gt;: Ban immediately instead of throttling first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storing counts in database&lt;/strong&gt;: Too slow. Use Redis or in-memory cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not exposing rate limit headers&lt;/strong&gt;: Clients can't intelligently back off.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignoring health check endpoints&lt;/strong&gt;: Don't rate limit your own monitoring.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Debugging Rate Limit Issues
&lt;/h2&gt;

&lt;p&gt;When a user reports "API blocked", here's how to troubleshoot:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check Redis keys: &lt;code&gt;redis-cli KEYS "rate_limit:*"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Inspect their request pattern: high burst vs sustained?&lt;/li&gt;
&lt;li&gt;Whitelist their IP/user if it's a legitimate use case&lt;/li&gt;
&lt;li&gt;Adjust thresholds based on real traffic patterns&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;This playbook includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ready-to-deploy Nginx configs for all major frameworks&lt;/li&gt;
&lt;li&gt;Redis setup guide (AWS ElastiCache, DigitalOcean, Heroku)&lt;/li&gt;
&lt;li&gt;Complete Python/Node.js middleware code&lt;/li&gt;
&lt;li&gt;GitHub Actions workflow for load testing&lt;/li&gt;
&lt;li&gt;Real abuse patterns from production SaaS systems&lt;/li&gt;
&lt;li&gt;Cost optimization strategies (cache tiers, fallback limits)&lt;/li&gt;
&lt;li&gt;Comprehensive debugging guide&lt;/li&gt;
&lt;li&gt;Whitelist/bypass strategies for trusted partners&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Implementing rate limiting takes 2–4 hours. Ignoring it costs you production incidents and security breaches.&lt;/p&gt;

&lt;p&gt;Deploy today.&lt;/p&gt;

</description>
      <category>cicd</category>
      <category>llm</category>
      <category>testing</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How to Catch LLM Regressions in CI: The Rubric-Based Eval System That Works</title>
      <dc:creator>Charlie Hadley</dc:creator>
      <pubDate>Mon, 18 May 2026 16:13:08 +0000</pubDate>
      <link>https://dev.to/hadleyworks/how-to-catch-llm-regressions-in-ci-the-rubric-based-eval-system-that-works-48ck</link>
      <guid>https://dev.to/hadleyworks/how-to-catch-llm-regressions-in-ci-the-rubric-based-eval-system-that-works-48ck</guid>
      <description>&lt;h1&gt;
  
  
  API Rate Limiting Playbook: Protect Your Backend From Abuse
&lt;/h1&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Your API is live in production. Traffic is growing. Then one day, a bot discovers your endpoint and starts hammering it with 100,000 requests per second. Your database melts. Your users see 500 errors. You lose revenue and reputation.&lt;/p&gt;

&lt;p&gt;Or worse: a malicious actor uses your API to brute-force user accounts. You didn't have rate limiting in place. You're liable.&lt;/p&gt;

&lt;p&gt;This is the silent killer of indie SaaS. You ship the product. You don't ship the protection. Then production breaks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Most Indie Teams Skip Rate Limiting
&lt;/h2&gt;

&lt;p&gt;Rate limiting &lt;em&gt;sounds&lt;/em&gt; complicated. "Distributed rate limiting"? "Token bucket algorithm"? "Redis backing stores"?&lt;/p&gt;

&lt;p&gt;In reality, it's simple. And you don't need expensive tools. You don't need AWS API Gateway ($0.35 per million requests). You don't need third-party middleware.&lt;/p&gt;

&lt;p&gt;You need a methodology. Once you have methodology, the implementation is trivial.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three-Layer Strategy
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Layer 1: IP-Based Rate Limiting (Nginx)
&lt;/h3&gt;

&lt;p&gt;First line of defense: block obvious bots and abusers at the edge.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;limit_req_zone&lt;/span&gt; &lt;span class="nv"&gt;$binary_remote_addr&lt;/span&gt; &lt;span class="s"&gt;zone=general:10m&lt;/span&gt; &lt;span class="s"&gt;rate=10r/s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;limit_req_zone&lt;/span&gt; &lt;span class="nv"&gt;$binary_remote_addr&lt;/span&gt; &lt;span class="s"&gt;zone=auth:10m&lt;/span&gt; &lt;span class="s"&gt;rate=1r/s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/api/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;limit_req&lt;/span&gt; &lt;span class="s"&gt;zone=general&lt;/span&gt; &lt;span class="s"&gt;burst=20&lt;/span&gt; &lt;span class="s"&gt;nodelay&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/api/auth/login&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;limit_req&lt;/span&gt; &lt;span class="s"&gt;zone=auth&lt;/span&gt; &lt;span class="s"&gt;burst=3&lt;/span&gt; &lt;span class="s"&gt;nodelay&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost: $0 (Nginx is free).&lt;/p&gt;

&lt;p&gt;Setup time: 15 minutes.&lt;/p&gt;

&lt;p&gt;Blocks: 95% of bot traffic and accidental DDoS.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: User/Token-Based Rate Limiting (Redis + Python)
&lt;/h3&gt;

&lt;p&gt;Your authenticated users have legitimate spikes. A single IP-based rule punishes them unfairly.&lt;/p&gt;

&lt;p&gt;Instead, rate limit per API key or user ID:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;

&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_rate_limited&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rate_limit:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;window_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;incr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;

&lt;span class="nd"&gt;@app.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/api/resource&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_resource&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;is_rate_limited&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Rate limit exceeded&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;429&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;process_request&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost: Redis Cloud free tier (up to 30MB).&lt;/p&gt;

&lt;p&gt;Setup time: 30 minutes.&lt;/p&gt;

&lt;p&gt;Blocks: Authenticated abuse, account enumeration, brute-force attacks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Endpoint-Specific Thresholds
&lt;/h3&gt;

&lt;p&gt;Different endpoints have different abuse vectors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Public endpoints&lt;/strong&gt; (search, info): 100 req/min per IP&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auth endpoints&lt;/strong&gt; (login, signup): 5 req/min per IP + distributed rate limit&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource creation&lt;/strong&gt; (write APIs): 10 req/min per user&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Admin endpoints&lt;/strong&gt;: 1000 req/day per user (tight control)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Document these in your API spec. Expose rate limit headers to clients:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;X-RateLimit-Limit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;100&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;X-RateLimit-Remaining&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;87&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;X-RateLimit-Reset&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;unix_timestamp&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Real-World Cost Breakdown
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Nginx configuration&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Redis Cloud (free tier)&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monitoring + alerts&lt;/td&gt;
&lt;td&gt;$0–10/month (CloudWatch or Datadog free tier)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0–10/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Compare to AWS API Gateway: $0.35 per million requests = $3,500/month at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation Checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Deploy Nginx rate limiting (zone + limit_req directive)&lt;/li&gt;
&lt;li&gt;[ ] Set up Redis account (free tier)&lt;/li&gt;
&lt;li&gt;[ ] Write rate limit middleware in your framework&lt;/li&gt;
&lt;li&gt;[ ] Define endpoint-specific limits&lt;/li&gt;
&lt;li&gt;[ ] Add rate limit headers to responses&lt;/li&gt;
&lt;li&gt;[ ] Test with Apache Bench or Vegeta load testing tool&lt;/li&gt;
&lt;li&gt;[ ] Set up alerts (Slack notification when a user hits limits)&lt;/li&gt;
&lt;li&gt;[ ] Document rate limits in your API docs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Time to implement: &lt;strong&gt;2–4 hours&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Cost: &lt;strong&gt;$0&lt;/strong&gt; (for 95% of use cases).&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes to Avoid
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Only IP-based limiting&lt;/strong&gt;: Punishes corporate networks and VPNs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No graduated response&lt;/strong&gt;: Ban immediately instead of throttling first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storing counts in database&lt;/strong&gt;: Too slow. Use Redis or in-memory cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not exposing rate limit headers&lt;/strong&gt;: Clients can't intelligently back off.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignoring health check endpoints&lt;/strong&gt;: Don't rate limit your own monitoring.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Debugging Rate Limit Issues
&lt;/h2&gt;

&lt;p&gt;When a user reports "API blocked", here's how to troubleshoot:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check Redis keys: &lt;code&gt;redis-cli KEYS "rate_limit:*"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Inspect their request pattern: high burst vs sustained?&lt;/li&gt;
&lt;li&gt;Whitelist their IP/user if it's a legitimate use case&lt;/li&gt;
&lt;li&gt;Adjust thresholds based on real traffic patterns&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;This playbook includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ready-to-deploy Nginx configs for all major frameworks&lt;/li&gt;
&lt;li&gt;Redis setup guide (AWS ElastiCache, DigitalOcean, Heroku)&lt;/li&gt;
&lt;li&gt;Complete Python/Node.js middleware code&lt;/li&gt;
&lt;li&gt;GitHub Actions workflow for load testing&lt;/li&gt;
&lt;li&gt;Real abuse patterns from production SaaS systems&lt;/li&gt;
&lt;li&gt;Cost optimization strategies (cache tiers, fallback limits)&lt;/li&gt;
&lt;li&gt;Comprehensive debugging guide&lt;/li&gt;
&lt;li&gt;Whitelist/bypass strategies for trusted partners&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Implementing rate limiting takes 2–4 hours. Ignoring it costs you production incidents and security breaches.&lt;/p&gt;

&lt;p&gt;Deploy today.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>How to Run LLM Evaluations in CI Without Paying $249/Month</title>
      <dc:creator>Charlie Hadley</dc:creator>
      <pubDate>Mon, 18 May 2026 15:47:46 +0000</pubDate>
      <link>https://dev.to/hadleyworks/how-to-run-llm-evaluations-in-ci-without-paying-249month-2nf4</link>
      <guid>https://dev.to/hadleyworks/how-to-run-llm-evaluations-in-ci-without-paying-249month-2nf4</guid>
      <description>&lt;h1&gt;
  
  
  How to Run LLM Evaluations in CI Without Paying $249/Month
&lt;/h1&gt;

&lt;p&gt;If you're building LLM-powered features as an indie hacker or small team, you've probably hit this wall: your prompts work great in the playground, but you have no systematic way to know if they're actually &lt;em&gt;improving&lt;/em&gt; after each change.&lt;/p&gt;

&lt;p&gt;The obvious answer is Braintrust or LangSmith. But at $249/month minimum, that's a massive commitment for a pre-PMF product. Here's how to build a production-grade eval pipeline for under $5/month.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Architecture
&lt;/h2&gt;

&lt;p&gt;You need three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A golden dataset&lt;/strong&gt; — A CSV of 50-200 test cases covering your edge cases, with input + expected behavior description&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A scoring function&lt;/strong&gt; — LLM-as-judge using GPT-4o-mini (~$0.002 per example)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Actions integration&lt;/strong&gt; — Runs your eval suite on every PR with a score threshold check&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The magic: your CI pipeline fails the build if average quality drops below your threshold. No more shipping prompt regressions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Rubric-Based Scoring Beats Exact Match
&lt;/h2&gt;

&lt;p&gt;The biggest mistake teams make: they try to match exact output strings. This fails because LLMs are inherently non-deterministic.&lt;/p&gt;

&lt;p&gt;Instead, define what "good" looks like as a checklist rubric:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;rubric&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Score this response 1-5 based on:
- Does it answer the question directly? (1 point)
- Is it concise (under 200 words)? (1 point)  
- Does it avoid hallucinating specific numbers? (1 point)
- Is the tone professional? (1 point)
- Would a user find this genuinely useful? (1 point)
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then let GPT-4o-mini score each response against this rubric. At $0.002 per evaluation, running 100 test cases costs $0.20.&lt;/p&gt;

&lt;h2&gt;
  
  
  The GitHub Actions Workflow
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLM Eval CI&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;eval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v3&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run eval suite&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.OPENAI_API_KEY }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;pip install openai pandas&lt;/span&gt;
          &lt;span class="s"&gt;python eval/run_suite.py --threshold 3.5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--threshold 3.5&lt;/code&gt; means: if average score drops below 3.5/5.0, fail the PR. This is your quality gate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Multi-Model Comparison Pattern
&lt;/h2&gt;

&lt;p&gt;Before you commit to GPT-4o for your feature, run your eval suite against Claude 3.5 Haiku and Gemini Flash. You'll often find that a cheaper model scores within 0.2 points of the expensive one — at 1/10th the cost.&lt;/p&gt;

&lt;p&gt;This comparison takes 10 minutes to set up but can cut your inference costs by 60-80%.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Catches in Practice
&lt;/h2&gt;

&lt;p&gt;Real scenario: You change your system prompt to fix a formatting issue. Without evals, you ship it. With evals, your CI run shows classification accuracy dropped from 4.2 to 3.1 on the golden dataset. You investigate, find that your formatting fix accidentally removed context the model needed, and fix it before it hits production.&lt;/p&gt;

&lt;p&gt;The moment you catch your first regression in CI, the whole system pays for itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Your Golden Dataset
&lt;/h2&gt;

&lt;p&gt;Start with 50 examples. Pull them from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real user queries you've seen in logs&lt;/li&gt;
&lt;li&gt;Edge cases you've mentally worried about&lt;/li&gt;
&lt;li&gt;Failure modes you've already shipped by accident&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Don't try to write expected outputs. Instead, write &lt;em&gt;rubrics&lt;/em&gt; describing what good looks like for each category.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Breakdown
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Golden dataset (50 examples): $0.10 per full suite run&lt;/li&gt;
&lt;li&gt;GitHub Actions: free tier (2,000 minutes/month)&lt;/li&gt;
&lt;li&gt;Total monthly cost for 10 PRs/week: ~$4/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compare to Braintrust at $249/month.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;The hardest part isn't the code — it's building the golden dataset and writing good rubrics. Once those exist, the automation is straightforward.&lt;/p&gt;

&lt;p&gt;I've packaged the full methodology into a playbook: golden dataset templates, rubric examples, multi-model comparison scripts, and the complete GitHub Actions workflow. Available at &lt;a href="https://hadleyworks.gumroad.com/l/nyzala" rel="noopener noreferrer"&gt;hadleyworks.gumroad.com&lt;/a&gt; for $29.&lt;/p&gt;

&lt;p&gt;What eval setups are others running at small scale? Happy to discuss approaches in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Evaluating LLMs in Production Without Paying $249/Month for Braintrust</title>
      <dc:creator>Charlie Hadley</dc:creator>
      <pubDate>Mon, 18 May 2026 15:02:43 +0000</pubDate>
      <link>https://dev.to/hadleyworks/evaluating-llms-in-production-without-paying-249month-for-braintrust-31ch</link>
      <guid>https://dev.to/hadleyworks/evaluating-llms-in-production-without-paying-249month-for-braintrust-31ch</guid>
      <description>&lt;h1&gt;
  
  
  Evaluating LLMs in Production Without Paying $249/Month for Braintrust
&lt;/h1&gt;

&lt;p&gt;If you're building an LLM-powered product as an indie hacker or small team, you've probably hit this wall: your prompts work great in the playground, but you have no idea if they're actually getting better (or worse) after each change.&lt;/p&gt;

&lt;p&gt;The obvious solution is a dedicated eval platform — Braintrust, Langsmith, Humanloop. But at $249/month for meaningful usage, that's a lot of MRR to justify before you've found product-market fit.&lt;/p&gt;

&lt;p&gt;Here's what I've been doing instead, using tools you already have.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Problem With Ad-Hoc Evals
&lt;/h2&gt;

&lt;p&gt;Most indie teams do one of three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Vibe-check evals&lt;/strong&gt; — you prompt it, it feels right, you ship&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One-shot spreadsheets&lt;/strong&gt; — you run 20 examples once, never again&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nothing&lt;/strong&gt; — you just watch for complaints in Discord&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of these catch regressions. When you change a prompt to fix one thing, you break two others, and you won't know for a week.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Lightweight Eval Stack That Actually Works
&lt;/h2&gt;

&lt;p&gt;Here's the stack: &lt;strong&gt;Golden dataset + GitHub Actions + a simple scoring function&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Build a Golden Dataset
&lt;/h3&gt;

&lt;p&gt;A golden dataset is just a CSV with input/expected output pairs. Start with 20-50 examples that cover your edge cases:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;input,expected_output,tags
"Summarize this legal clause: ...", "The clause limits liability to...", "legal,summarization"
"What is the capital of France?", "Paris", "factual,simple"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;strong&gt;you don't need perfect expected outputs&lt;/strong&gt;. You need &lt;em&gt;rubric-based scoring&lt;/em&gt;, not exact match. Define what "good" looks like as a checklist.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Write a Scoring Function
&lt;/h3&gt;

&lt;p&gt;For most use cases, a simple LLM-as-judge approach works well:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;score_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;actual_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected_output&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Rate this LLM response on a scale of 1-5.

    Input: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;input_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
    Expected: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;expected_output&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  
    Actual: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;actual_output&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

    Score based on: accuracy, completeness, tone.
    Return JSON: {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: X, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}}
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost per run: ~$0.002 per example with GPT-4o-mini. Running 50 examples costs $0.10. You can run this on every PR.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: GitHub Actions Integration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLM Eval Suite&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;eval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v3&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run eval suite&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python eval/run_evals.py&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.OPENAI_API_KEY }}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check score threshold&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python eval/check_threshold.py --min-score &lt;/span&gt;&lt;span class="m"&gt;3.8&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now every PR shows a score. If it drops below 3.8, the check fails. You've just built CI for your prompts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Doesn't Cover
&lt;/h2&gt;

&lt;p&gt;This approach works great for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Summarization and extraction tasks&lt;/li&gt;
&lt;li&gt;Classification (with expected labels)&lt;/li&gt;
&lt;li&gt;RAG retrieval quality&lt;/li&gt;
&lt;li&gt;Tone/style adherence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's harder to apply to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open-ended creative tasks&lt;/li&gt;
&lt;li&gt;Multi-turn conversations&lt;/li&gt;
&lt;li&gt;Tasks where "correct" is deeply subjective&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For those cases, you need human-in-the-loop evals — but you can still automate the &lt;em&gt;collection&lt;/em&gt; of examples and use the human time only for scoring edge cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Win: Regression Detection
&lt;/h2&gt;

&lt;p&gt;The moment this system pays off is when you change your system prompt to improve summarization, run the eval suite, and see that your classification accuracy dropped from 4.2 to 3.1. Without this, you'd ship it and wonder why your churn ticked up next week.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The goal isn't perfect evals. The goal is catching regressions before your users do.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Going Deeper
&lt;/h2&gt;

&lt;p&gt;If you want the full methodology — including golden dataset templates, rubric examples, multi-model comparison scripts, and a GitHub Actions workflow you can clone — I packaged everything into a playbook: &lt;a href="https://buy.stripe.com/6oUeV5gH7b4s56YcHG4ko0d" rel="noopener noreferrer"&gt;The Indie Hacker's LLM Eval Playbook&lt;/a&gt; (£25, instant download).&lt;/p&gt;

&lt;p&gt;But honestly, the approach above will get you 80% of the way there for free.&lt;/p&gt;

&lt;p&gt;The main insight: &lt;strong&gt;treat your prompts like code&lt;/strong&gt;. You wouldn't ship a function without tests. Don't ship a prompt without evals.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What eval setup are you running? Curious what others have found works at small scale — drop a comment below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
      <category>startup</category>
    </item>
  </channel>
</rss>
