You've done this. Everyone has.
You want a monitor. You check two reviews. One says 4/5. The other says 7.8/10. Both rate it high but for different reasons. One loves color accuracy. The other thinks it's fine, not professional-grade. Same product.
So you go deeper.
A YouTube video catches backlight bleed nobody wrote about. Reddit says the USB-C doesn't charge a MacBook. Amazon has 4.2 stars but recent reviews are full of dead pixels.
You now have a dozen sources. Each one useful. None agreeing. And you're less sure than when you started.
The problem isn't finding reviews. It's cross-referencing them. Does "vivid colors" from one reviewer mean the same as "excellent color reproduction" from another when one is watching movies and the other is editing photos? That mental mapping across every dimension you care about is what quietly eats two hours.
I kept doing this. So I wrote a pipeline. Because obviously the correct response to "I can't pick a monitor" is 3 LLM providers and a six-stage architecture.
The result is SetupScore. It cross-references 20-50 sources per product and produces a scored breakdown in about 90 seconds. Monitors, keyboards, headphones. Here's how it works, stage by stage.
Finding the sources
Brave Search API for web reviews, not Google (their API pricing ended that experiment fast). YouTube videos transcribed through Groq Whisper. Reddit threads pulled for the raw owner-experience takes you can't get from publications.
For a single monitor, that's typically 25-35 independent sources. Expert reviews from RTINGS, PCMag, Tom's Hardware and 160+ others. YouTube teardowns. Reddit threads where someone actually owns the thing.
The fun part: I'm feeding 20-minute YouTube videos into Whisper, converting them to full transcripts, then passing those to an LLM for structured extraction. A reviewer's offhand comment at minute 14 about USB-C wattage? That's now a searchable, classifiable claim. We live in weird times.
Making sense of 30 different ratings
This is the part I redesigned the most. Normalizing PCMag's 4.5/5 and RTINGS' 7.8/10 to the same scale sounds simple. It's not.
ratingNormalizer.normalize("4.5/5", "5-star"); // → 90
ratingNormalizer.normalize("7.8/10", "10-point"); // → 78
ratingNormalizer.normalize("88%"); // → 88
Auto-detection across five scale types. But normalization is step zero. The problems start after.
Grade-inflation compression. Tech publications are generous. A 5/5 means "excellent", not "perfect." But 5/5 maps to 100. So every well-reviewed product becomes flawless, which is nonsense. Ratings above 90 get softened: 100 → 96.5, 95 → 93.25, 90 stays 90. Nothing below 90 changes.
Outlier dampening. One angry Reddit rant shouldn't tank a score. One sponsored review shouldn't inflate it. Any rating more than 20 points from the credibility-weighted mean gets capped. No median gymnastics. Just a hard deviation cap. Crude, effective, debuggable.
Source-count confidence. A single 5/5 review shouldn't produce a 95. One data point isn't consensus, it's a blog post. Products with few sources get pulled toward a baseline. As more sources arrive and agree, the score earns the right to be extreme.
The full sequence:
1. Normalize every rating to 0-100
2. Compress above 90 (grade-inflation)
3. Dampen outliers (±20 from weighted mean)
4. Credibility-weighted average (RTINGS > random blog)
5. Source-count confidence adjustment
→ Final score
Five steps. Each one exists because I shipped without it, got a stupid result, and had to go back.
Catching the bullshit
A score is only worth something if the data behind it is clean. Two problems here.
Bias detection. The LLM reads each source and flags sponsored content, affiliate-driven language, manufacturer relationships. Flagged sources don't get deleted. They get down-weighted in the aggregation. A sponsored review still has useful observations. It just doesn't get the same vote as an independent one. Think of it like code review: you still read the intern's PR, you just check it twice.
Consensus mapping. The CrossSourceAggregator doesn't average opinions. It maps them. When RTINGS, Hardware Unboxed, and three Reddit threads all praise color accuracy in different words, the pipeline recognizes that as one confirmed strength backed by five independent sources. When experts love the build quality but Amazon reviews across 4 markets report creaking after 6 months, that's a disagreement. Both sides get surfaced. Not averaged away. Visible.
Letting the product define itself
A single score is useful, but it hides a lot. An 87/100 monitor sounds great until you realize you're a photo retoucher who needs color accuracy and couldn't care less about gaming response times. That 87 doesn't tell you which parts are great and which parts don't matter to you.
The obvious fix is a per-category checklist. Monitors get Display Quality, Color Accuracy, Ports. Headphones get Sound Quality, ANC, Comfort. Define the list once per category, done. But even within a category, products are different. A studio headphone review talks about frequency response and soundstage. A wireless earbud review talks about call quality and Bluetooth range. Same category, completely different evaluation criteria.
So instead of defining what matters, I let the reviews decide. The LLM reads every extracted claim and discovers the relevant dimensions for each specific product. The Sony WH-1000XM5 surfaces ANC, Comfort, Sound Quality, Call Quality. The Beyerdynamic DT 900 Pro X surfaces Soundstage, Imaging, Build Quality, Clamping Force. Same category, different taxonomy, because reviewers talk about different things.
Each claim gets classified into its discovered aspect, scored by agreement weight (how many independent sources back the same observation), and assembled into per-dimension breakdowns.
Running it without going broke
Not every task needs your most expensive model.
Moonshot → classification, extraction, tagging (~$0.012/1K tokens)
Claude → bias detection, synthesis, reasoning (~$3-15/1M tokens)
Groq → YouTube transcription (fast, cheap)
The LLMOrchestrator auto-checks Moonshot responses. If the result looks weak (too short, uncertain language, contradictions), it silently upgrades to Claude.
if (this.needsClaudeFallback(response.result, task.type)) {
response = await this.executeClaude(task);
}
Both providers have fallback API keys for rate limiting (429s). Total cost per product: $0.15-0.60. "Run the whole catalog on a weekend" money.
The stack
TypeScript / Node.js
Claude / Moonshot / Groq Whisper
Brave Search API
Hugo static site
JSON files (no DB, fight me)
Cloudflare
Amazon Associates × 6 markets (runs AFTER scoring, different pipeline stage)
setupscore.com. Monitors, keyboards, headphones. Built because I'd rather spend a weekend automating than another evening cross-referencing tabs.
Whether you're a dev who'd do the scoring math differently or someone who just wants to know if a monitor is worth buying, I want to hear it. What would you improve? What product categories should come next? Comments or Bluesky / X.




Top comments (0)