DEV Community

Daniel Rozin for Reviewiq.com

Posted on • Originally published at aversusb.net

Using Claude API to Generate Structured Product Comparisons at Scale

AI-generated content gets a bad reputation — and often deservedly so. Generic AI articles are everywhere. But there's a specific use case where AI generation genuinely shines: structured product comparisons.

At SmartReview, we use the Claude API to generate thousands of product comparison pages. Here's how we do it in a way that produces accurate, useful content rather than filler.

Why AI Works for Comparisons (and Where It Fails)

AI generation works well for comparisons because:

  1. The structure is fixed — every comparison has the same sections (key differences, attribute breakdown, verdict, FAQs)
  2. The data is enrichable — you can feed real specs, prices, and review data before generating
  3. The volume is high — there are millions of "X vs Y" queries; AI is the only scalable way to cover them

It fails when:

  • The prompt is vague ("compare these two products" → generic output)
  • There's no real data enrichment (hallucinated specs)
  • There's no structure enforcement (walls of text that don't help buyers decide)

The Prompt Architecture

Our prompts are structured in three layers:

Layer 1: System Context

You are a product comparison expert writing for buyers who are in the final decision stage. Your job is to help them decide, not to impress them with your knowledge.

Rules:
- Never hedge with "it depends" without giving a concrete tiebreaker
- Lead with the verdict — most readers want the answer first
- Use specific numbers from the provided data — never invent specs
- Flag any spec you are uncertain about with [unverified]
Enter fullscreen mode Exit fullscreen mode

Layer 2: Enrichment Data

Before generating, we run parallel Tavily searches:

const [vsData, entityAData, entityBData] = await Promise.all([
  searchTavily(`${entityA} vs ${entityB} comparison 2026`, 5),
  searchTavily(`${entityA} specs features price review 2026`, 3),
  searchTavily(`${entityB} specs features price review 2026`, 3)
]);
Enter fullscreen mode Exit fullscreen mode

This gives Claude real, current data to work with. The difference in output quality between enriched and unenriched prompts is dramatic.

Layer 3: Structure Enforcement

We use a strict JSON output schema:

interface ComparisonOutput {
  shortAnswer: string;          // 2-3 sentences max
  verdict: {
    winner: string;
    reason: string;             // One sentence
    bestFor: { entityA: string; entityB: string; };
  };
  keyDifferences: Array<{
    attribute: string;
    entityA: string;
    entityB: string;
    winner: string;
    importance: "critical" | "important" | "minor";
  }>;  // 5-7 items
  faqs: Array<{
    question: string;
    answer: string;             // 2-3 sentences
  }>;  // 5-8 items from PAA data
}
Enter fullscreen mode Exit fullscreen mode

Claude is instructed to return only valid JSON. We validate with Zod before storing.

The Generation Call

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 2000,
  system: systemPrompt,
  messages: [{
    role: "user",
    content: `Generate a comparison for: ${entityA} vs ${entityB}

## Research Data
${JSON.stringify(enrichmentData, null, 2)}

## People Also Ask (from SERP data)
${paaQuestions.join("\n")}

Return valid JSON matching the schema. Use only data from the research above — mark anything uncertain as [unverified].`
  }]
});
Enter fullscreen mode Exit fullscreen mode

Quality Control

Raw AI output needs validation before serving:

  1. Spec verification — cross-reference generated specs against enrichment sources
  2. [unverified] flagging — any spec Claude couldn't confirm from enrichment data gets flagged visually on the page
  3. Freshness scoring — pages get a "confidence score" based on enrichment data recency; low-confidence pages trigger re-enrichment
  4. Human spot-checks — we manually review 5% of generations weekly, focused on high-traffic pages

Results

After running this pipeline for three months:

  • 40% of pages rank in the top 10 for their target "vs" keyword
  • Average comparison accuracy: 94% verified against manufacturer specs
  • Generation cost: ~$0.003 per comparison (enrichment + generation)
  • Regeneration trigger: Price change > 10%, new model launch, or 30-day freshness expiry

The Key Insight

The biggest mistake teams make with AI content generation is treating the AI as the primary author. We treat it as the editor.

The pipeline is:
Real data → Structure → Claude → Validation → Human review

Not:
Topic → Claude → Publish

That distinction is what separates useful AI content from filler.


We're publishing this series on building SmartReview. Previous posts: Building Structured Product Comparisons with Next.js and AI and How Comparison Search Is Changing Consumer Behavior in 2026.

Questions about our pipeline? Drop a comment or find us at aversusb.net.

Top comments (0)