DEV Community

Cover image for How I built pairwise AI model compare pages with Claude Haiku and a budget cap
MORINAGA
MORINAGA

Posted on

How I built pairwise AI model compare pages with Claude Haiku and a budget cap

When I added compare pages to the Top AI Tools directory, the first question I had to answer was: how many pairs am I actually looking at? With roughly 200 models across 8 pipeline tags, the naive upper bound is 200 × 199 / 2 ≈ 19,900 pairs. Generating content for each one with Claude Haiku would cost somewhere around $20 per run — not ruinous, but not something I wanted to run daily without thinking carefully.

Here's what I actually built, where it falls short, and what I'd do differently if starting over.

The combinatorics problem

Model compare pages exist for a specific type of query: "llama 3 vs mistral 7b", "stable diffusion vs sdxl", "whisper vs wav2vec2". These are high-intent queries — the user has already narrowed down to a shortlist and wants a concrete decision nudge. The static SSG approach I'm running means I need to precompute each compare page at build time, which puts pressure on how many pages I can afford to generate.

The solution I landed on: group by pipeline_tag, pair the top-4 models by download count within each group, then cap total pairs with a COMPARE_LIMIT env var. Within a single pipeline like text-generation, the top 4 models give 6 pairs (4 choose 2). Across 8 active pipelines that's roughly 48 pairs. The env cap of 50 means I stay within that budget while having room to grow.

const byPipe = new Map<string, typeof models>();
for (const m of models) {
  if (!m.pipeline_tag) continue;
  const arr = byPipe.get(m.pipeline_tag) ?? [];
  arr.push(m);
  byPipe.set(m.pipeline_tag, arr);
}

const pairs: Array<[Model, Model]> = [];
for (const [, list] of byPipe) {
  const sorted = [...list].sort((a, b) => b.downloads - a.downloads);
  const take = sorted.slice(0, Math.min(4, sorted.length));
  for (let i = 0; i < take.length; i++) {
    for (let j = i + 1; j < take.length; j++) {
      pairs.push([take[i]!, take[j]!]);
    }
  }
}
const chosen = pairs.slice(0, MAX);
Enter fullscreen mode Exit fullscreen mode

The pairing happens entirely within pipelines right now, which means I'm covering "llama vs mistral" (both text-generation) but not "whisper vs gemini-vision" (cross-pipeline). Cross-pipeline comparisons are actually more valuable for users who don't know the landscape yet — that's the next iteration.

The pair_slug and idempotent inserts

The slug for each compare pair is constructed deterministically: sort the two model slugs alphabetically, join with --vs--. So whether the ETL processes (llama-3, mistral-7b) or (mistral-7b, llama-3), the slug is always llama-3--vs--mistral-7b.

const pairSlug = [a.slug, b.slug].sort().join("--vs--");
Enter fullscreen mode Exit fullscreen mode

This makes the entire ETL idempotent. The script runs every night. If all pairs already exist in the DB, it exits in a couple of seconds without a single Claude call. I check before inserting rather than using INSERT OR IGNORE at the SQL level — the explicit check lets me count skipped vs generated in the same run, which I log:

[compare] done — generated: 3, skipped: 47
Enter fullscreen mode Exit fullscreen mode

This matters for monitoring. A run that generates 0 and skips 50 is healthy. A run that generates 0 and skips 0 (nothing in DB, nothing processed) would indicate a bug.

Claude Haiku with system-prompt caching

I reuse the shared Haiku client I built in week one, which handles cacheSystem: true on the system prompt. Since the system prompt — the JSON schema instruction — is identical across all compare calls, the first call primes the cache and subsequent calls see near-zero token cost on that prefix.

The user prompt includes both model names, their authors, pipeline tags, and up to 400 characters of their existing summaries (which come from the earlier content generation step):

const userPrompt = `Compare these two AI models:
A: ${a.name} (author: ${a.author ?? "unknown"}, pipeline: ${a.pipeline_tag ?? "unknown"})
   Summary: ${a.summary?.slice(0, 400) ?? "(none)"}
B: ${b.name} (author: ${b.author ?? "unknown"}, pipeline: ${b.pipeline_tag ?? "unknown"})
   Summary: ${b.summary?.slice(0, 400) ?? "(none)"}

Produce the JSON comparison.`;
Enter fullscreen mode Exit fullscreen mode

Truncating summaries at 400 characters keeps the user prompt lean. Compare pages are about the delta between two models, not a rehash of each model individually. I already have dedicated model pages for depth; the compare page needs to answer "which one, for what" — that takes maybe 6 sentences total.

The system prompt requests a JSON object with summary, differences (array), similarities (array), and recommendation. Keeping the output shape narrow means Haiku rarely wanders off-schema.

JSON parsing with a regex fence

Even with tight prompting, Haiku occasionally produces JSON with an explanation preamble: "Here is the comparison:" followed by the actual object. Strict JSON.parse on the raw output would throw. I extract the outermost {...} block with a regex before parsing:

function parseCompare(text: string, fb: CompareData): CompareData {
  try {
    const m = text.match(/\{[\s\S]*\}/);
    if (!m) return fb;
    const p = JSON.parse(m[0]);
    return {
      summary: typeof p.summary === "string" ? p.summary : fb.summary,
      differences: Array.isArray(p.differences)
        ? p.differences.map(String)
        : fb.differences,
      similarities: Array.isArray(p.similarities)
        ? p.similarities.map(String)
        : fb.similarities,
      recommendation:
        typeof p.recommendation === "string"
          ? p.recommendation
          : fb.recommendation,
    };
  } catch {
    return fb;
  }
}
Enter fullscreen mode Exit fullscreen mode

Each field is validated individually before being accepted. If differences comes back as a string (occasional Haiku behavior when it conflates the array with a comma-separated list), the page falls back to the template for that field rather than crashing.

The fallback struct is worth writing carefully. I spent five minutes on mine and it shows:

const fb: CompareData = {
  summary: `${a.name} and ${b.name} are both ${a.pipeline_tag} models. See each entry for specifics.`,
  differences: ["See individual model pages for architecture and use cases."],
  similarities: ["Both are open-source models on HuggingFace."],
  recommendation: "Pick based on your compute budget and specific task requirements.",
};
Enter fullscreen mode Exit fullscreen mode

A user landing on a fallback-generated compare page gets a technically-true page that directs them to the model pages rather than a blank or error state. The model_used column in the DB records "fallback-template" for these rows, which I use to identify candidates for regeneration.

Storage in libSQL and the static JSON dump

Compare data lives in a model_compare table in Turso libSQL, with a unique constraint on pair_slug. After the ETL loop, everything gets dumped to compare.json for the static build:

const all = await db.execute(
  `SELECT * FROM model_compare ORDER BY slug_a, slug_b`
);
const entries = all.rows.map((r) => ({
  slug_a: String(r.slug_a),
  slug_b: String(r.slug_b),
  pair_slug: String(r.pair_slug),
  summary: r.summary ? String(r.summary) : "",
  differences: r.differences ? JSON.parse(String(r.differences)) as string[] : [],
  similarities: r.similarities ? JSON.parse(String(r.similarities)) as string[] : [],
  recommendation: r.recommendation ? String(r.recommendation) : "",
}));
await writeFile("./src/data/compare.json", JSON.stringify(entries, null, 2));
Enter fullscreen mode Exit fullscreen mode

The Astro build reads this JSON at build time, generating one static page per pair. No runtime DB calls, no cold starts. The tradeoff is freshness: compare content is up to 24 hours stale. For "llama 3.1 vs llama 3.2", that's fine — the models don't change daily.

I validate the JSON-LD on compare pages through the post-deploy audit CI step the same way I do for individual model pages. Structured data matters more on comparison queries because those are the exact queries that AI Overviews tend to surface, so getting the schema right is worth the CI overhead.

The Astro slug generation for compare pages uses the pair_slug directly. The URL pattern is /compare/llama-3--vs--mistral-7b/, which is ugly but unambiguous — the double-dash separator makes it clear this is a two-part slug rather than a hyphen in a model name.

What I'd change starting over

Generate cross-pipeline pairs from day one. The most useful compare queries aren't "llama 3.1 vs llama 3.2" — users who care about that distinction already know. The interesting queries are cross-category: "should I run inference on a text-generation model or use a RAG pipeline?" I skipped this to stay within the budget cap, but it means I'm missing the long-tail traffic that would actually be differentiated from generic model pages.

Drive pair selection from search query logs. Right now I pick pairs by download rank. A better signal would be which pairs users actually search for. Pagefind runs client-side and doesn't log queries to any server, so I'd need a thin logging endpoint — something like a POST to a GitHub Actions-triggered function that appends to a JSONL file. Then the ETL reads the top-N ungenerated pairs from the log. This is a small amount of infrastructure but it would make the pair selection much more demand-driven.

Raise the budget cap. MAX=50 is conservative. At current Haiku pricing with prompt caching, 500 pairs would cost roughly $0.10 per nightly run. I was cautious when I set the default, but I've watched the billing closely and the actual spend is a fraction of what I modeled. I'll bump this to 200 in the next ETL config update.

The itch.io entries pattern I added to the indie-games directory taught me to plan for the second data source earlier. Compare pages have the same shape: a join between two rows. Getting the abstraction right before you have 500+ rows in the DB is much easier than retrofitting it.

FAQ

Does the ETL run every night even when no new models are added?

Yes, but it's nearly free when nothing is new. The check-before-insert means most nights it does 50 DB reads and exits in under 3 seconds without touching the Claude API. The console output shows generated: 0, skipped: 47 which is the signal that everything is up to date.

What happens when Claude returns malformed JSON?

parseCompare catches the error and returns the fallback struct. The row is still written to the DB with model_used = "fallback-template", which I can query to find rows worth retrying. In practice, this happens on maybe 2-3% of generations — usually when the two models have very sparse metadata and Haiku doesn't have enough context to produce structured output.

Does the compare.json file get unwieldy as pairs accumulate?

At 50 pairs it's roughly 25KB. At 500 pairs it'll be around 250KB — still fine for build-time loading in Astro. If I ever hit 5,000 pairs I'd split the file by pipeline_tag and lazy-import only the relevant subset for each page. For now, one flat JSON file is simpler and fast enough.

Why not compute compare content at request time with an edge function?

Cold starts and cost. An edge function hit for each compare page view would add 200-500ms of latency (Haiku inference + DB round trip) and would cost much more per-pageview than the nightly batch approach. The content also doesn't need to be fresher than daily — model capabilities don't shift on an hourly basis. Static precomputation is the right tradeoff here, consistent with the broader bet on static SSG I'm running on all three sites.

How do you handle the case where a model is removed from HuggingFace?

Right now, I don't. If model foo is deleted from HuggingFace but its compare rows are still in the DB, those compare pages will still be served at build time. They'll have the old data until the model's row in models.json is removed — which only happens if the model falls out of the top-500 in the nightly fetch. It's a known gap. For now, the risk is low; popular models don't disappear. A more robust system would cross-reference the compare table against the model table and tombstone orphaned pairs.


Related: How I built a shared Claude Haiku client with system-prompt caching | Turso libSQL vs Cloudflare D1 for an Astro monorepo

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.

Top comments (0)