Riskverdict

Posted on May 29 • Edited on May 30

I Built a SaaS Risk Scanner That Collects 35+ Signals Per Vendor. Here's What I Learned About Scraping, LLMs, and Solo Engineering.

#llm #saas #showdev #webscraping

I got into lifetime SaaS deals (LTDs) the way most people do - I bought a few on AppSumo and got burned. Not catastrophically, but enough to notice: there's zero objective information about whether a $149 "lifetime" deal will still exist in future. You get the sales page, some reviews, and a gut feeling. The risks are real - vendors ghost, products stagnate, and lifetime suddenly means "until we decide otherwise."

So I built RiskVerdict - a platform that automatically collects 35+ risk signals per SaaS vendor, runs them through weighted scoring, and gives buyers a clear answer: is this deal risky or relatively safe?

This is the story of how it works under the hood, what broke along the way, and what I'd do differently.

The Origin

The original concept was simple: check GitHub commit activity, social media sentiment, and SaaS age as proxies for stability. Three signals. How hard could it be?

Hard. Those three signals turned into 35+ because reality is messy. A vendor with active GitHub commits can still be a terrible deal. A vendor with no public GitHub might be perfectly healthy. You need to look at the whole picture: legal documents, pricing behavior, team size, community sentiment, infrastructure health, founder presence.

Architecture Overview

The system has three phases:

Scrapers (data collection) -> Extractors (LLM analysis) -> Signals (scoring)

Then a composite layer aggregates all signals into weighted categories:

Leadership, Operations, Engineering,
Organizational, Infrastructure, Legal

Each vendor goes through this pipeline automatically via Celery tasks.

The stack:

Backend: Python, FastAPI, Celery, Redis, PostgreSQL
Frontend: Next.js, Tailwind CSS
Scraping: httpx, curl_cffi, Headless browser
LLM: OpenRouter gateway for extraction
Infrastructure: Docker Compose on a server

Challenge 1: Scraping at Scale (Or: Why Your httpx Request Gets Blocked)

The first version used plain httpx for everything. It worked... for about 20 vendors. Then Cloudflare, DataDome, and various WAFs started blocking requests. Python's httpx has a distinctive TLS fingerprint that anti-bot systems detect immediately.

The Escalation Pattern

I solved this with a three-tier escalation strategy that tries the cheapest approach first and only escalates when needed:

Simple HTTP (httpx) - fastest, cheapest
TLS-spoofed HTTP (curl_cffi impersonating Chrome) - still fast, bypasses basic TLS fingerprinting
Full browser as last resort - heaviest but most capable

Each tier validates the response. If validation fails (empty content, CAPTCHA page, redirect loop), it escalates to the next tier. The return value includes which stage succeeded, so you can track costs.

The key insight: most sites don't need a browser. Out of ~275 vendors, roughly 70% respond to a simple HTTP request with TLS spoofing. Only the remaining 30% need a full browser. Escalation saves enormous resources.

Domain Protection Caching

Once a domain is known to use Cloudflare or CAPTCHAs, there's no point trying static fetch on future requests. I cache this in Redis - on subsequent runs, protected domains skip straight to the browser tier.

This small optimization cut scraping time by ~40% on subsequent runs because we skip the guaranteed-to-fail static attempt for protected domains.

Browser Pool: The Expensive Resource

Headless browser is effective but expensive - roughly 100MB RAM per instance. I can't spin up a browser per request. So I built a pool with a fixed number of browsers, each with its own proxy and geo-matched fingerprint (timezone, locale, WebRTC all match the proxy's exit IP).

Browsers rotate after hitting limits: N requests, M unique domains, or X minutes of age. When a browser rotates, it gets killed and replaced with a fresh one using a different proxy and fingerprint.

Critical lesson: proxy rotation must happen at the browser level, not the request level. If you rotate the proxy per request while keeping the same browser fingerprint, anti-bot systems detect it instantly - same fingerprint, different IP is impossible for a real user. Each browser gets one proxy for its entire lifetime. When you need a new proxy, you kill the browser and spawn a new one with a fresh fingerprint.

Challenge 2: Making LLM Extraction Actually Reliable

This was the hardest part, and it took the most iteration. The plan was simple: scrape a page, send the HTML to an LLM, get structured JSON back. Reality was not simple.

Problem 1: LLMs Hallucinate Confidently

Early versions would report things like "founder: John Smith" when the About page said nothing of the sort. The model was filling in plausible-sounding information.

Fix: Require verbatim evidence for every finding. Every extracted field must include the exact source text that justifies it. This serves double duty: it's a hallucination guard (the model can't fabricate evidence) and it's auditable (you can verify the reasoning against the source).

In production, this catches misattributions like confusing the company Twitter account with the founder's personal account - the evidence quote reveals the mismatch.

Problem 2: Vertical Prompts Beat Monolithic Prompts

My first approach was one big prompt: "Extract everything from this pricing page - prices, features, billing terms, refund policy, trial info, add-ons, usage limits." On cheaper models (which I need for cost reasons at 275 vendors), this produces shallow results across the board.

Fix: One focused prompt per extraction type. A pricing prompt extracts pricing. A billing prompt extracts billing terms. A legal prompt extracts legal clauses. Each prompt is narrow enough that even a non-SOTA model handles it well.

This matters for cost: processing 275 vendors with a cheap model + vertical prompts produces better results than an expensive model + a monolithic prompt. The narrow scope compensates for the model's limitations.

Problem 3: Pre-Process Before the LLM

A typical vendor page is 50-100KB of HTML. Most of it is navigation, footer links, scripts, cookie banners. If you send raw HTML to the LLM, you waste tokens on noise and the model gets distracted.

The math: 50KB HTML reduces to 5-15KB of actual content after CSS-based extraction. At roughly 4 chars/token, that's the difference between ~12K tokens and ~3K tokens per page. Across 275 vendors, this saves real money.

I strip everything that isn't semantic content (headings, paragraphs, tables, lists) before it reaches the LLM.

Problem 4: Consistency Matters More Than Accuracy

This was counterintuitive. I'd run an extraction, get a great result, and think "this works!" Then I'd run the same extraction again and get a different answer. And again, different.

Fix: Run the same extraction 3-5 times, measure agreement. If a field has less than 80% agreement across runs, that prompt section needs rework. An LLM that's consistently wrong on one field is fixable (adjust the prompt). An LLM that's randomly wrong is not fixable.

The tuning loop:

Write prompt with explicit checklist items
Run eval on 50+ real examples (not toy examples)
Check for four failure modes: hallucinated findings, missing findings, wrong format, inconsistent results
Fix ONE issue at a time
Measure consistency at 3x, then 5x
Loop until >= 80% stability

The Extraction Pipeline

Putting it all together, each LLM extraction goes through a 3-phase fallback:

Structured output with JSON Schema enforcement (the model is forced to produce valid JSON)
Correction prompt if Phase 1 fails - shows the model its previous error so it can fix it
Regex fallback - extract JSON from raw text if the model refuses to produce structured output

Each phase validates against the schema. If all phases fail, the extraction returns null and callers fall back to heuristic logic. No single LLM failure crashes the pipeline.

Challenge 3: Scoring Without Bias

Early scoring was a mess. I had signal scores, confidence scores, category weights, and bonus points flying around. Two specific problems:

The Double-Counting Trap

If a signal contributed points in multiple categories simultaneously, the system produced scores above 100 on a 100-point scale. This actually happened.

The fix isn't capping at 100. Capping hides the problem. The fix is auditing every input to every score component and removing duplicates.

Confidence Is Not Score

Score = "how risky is this signal" (0-1). Confidence = "how sure am I about this score" (0-1). These are different:

Low confidence + high score = "I think this is bad but I don't have enough data" - needs more scraping
High confidence + low score = "I'm sure this is fine" - reliable signal
Low confidence + low score = "not enough data, treat as neutral" - NOT the same as "safe"

Confusing the two means treating "no data" as "good signal", which is wrong for buyers counting on this analysis.

The Composite System

I settled on 6 weighted categories. Each category has its own set of signals with intra-category weights. A signal's contribution is its score multiplied by its confidence multiplied by its weight within the category. Signals with zero confidence are skipped entirely.

Weights are calibrated from real data across 55 projects. The important thing: I keep complex scoring internal and show simple results external. Buyers see "Low Risk" or "High Risk" with supporting evidence, not matrix math.

Challenge 4: Solo Founder Infrastructure

Running this on a home server with Docker Compose. Every dollar matters, so I optimize for cost.

The Docker Stack

8 services: backend, frontend, postgres, redis, celery worker, celery beat, nginx proxy, and a Cloudflare tunnel container. Total resource usage: ~3-4GB RAM with 2 browser instances in the pool.

The backend container runs with a read-only filesystem, all capabilities dropped, and no-new-privileges. There's no reason a Python API needs write access to the filesystem.

Celery Workflows

The scraping pipeline uses Celery chords - fan out all scrapers in parallel, then run extractors when all finish, then run signals when extractors finish:

Scraper tasks (parallel) - each source scraped once per project
Extractor tasks (parallel) - LLM extractors run once per project
Signal tasks (parallel) - read from store, compute scores
Composite callback - aggregate into final scores

Dependencies are resolved automatically from signal metadata - each signal declares which scrapers it needs, and the workflow builder computes the minimal task set. No scraper runs twice for the same project.

Cost Structure

Zero monthly costs except the domain name. Everything runs on self-hosted Docker. Data comes from public scraping. LLM extraction costs roughly $0.05 per vendor via OpenRouter. At 275 vendors, that's ~$14 per full analysis run.

Every $7 report is essentially pure profit. No burn rate, no runway countdown.

What I'd Do Differently

Start with the scoring model earlier. I spent months building scrapers and extractors before figuring out how to combine the data. The scoring model should have been designed first - it would have told me which signals matter most and which I could skip.

Test against real anti-bot from day one. Plain httpx worked in development. It fell apart in production against real sites with real WAFs. I should have tested against Cloudflare-protected sites from the start.

Don't let the LLM do math. Early versions asked the LLM to compute risk scores. The same input produced different numbers across runs. Now the LLM classifies (low/medium/high) and code computes the score. Categorical outputs from the LLM, deterministic math in Python.

Measure consistency, not just accuracy. One good extraction result means nothing. Run the same input 5 times and check agreement. Below 80% consistency, the extraction is unreliable for production.

The Registry Pattern

One architectural decision that paid off: the registry pattern for signals. Each signal self-registers via a decorator when its module is imported. Adding a new signal means creating a new file and adding one import line. No central list to maintain, no manual wiring. The workflow builder reads the registry and computes dependencies automatically.

This made it easy to go from 3 signals to 35+ without the codebase turning into spaghetti.

What's Next

RiskVerdict is live and processing vendors. The current focus is SEO-driven growth - programmatic pages for every vendor targeting "alternative to [SaaS]" keywords, comparison pages for side-by-side vendor analysis, and a weekly editorial digest that synthesizes market trends. If you want to see the buyer-facing side, the AppSumo alternatives guide is a good starting point.

If you're building something similar, the key lessons:

Escalation-based fetching saves enormous resources vs. always using a browser
Vertical LLM prompts on cheap models beat monolithic prompts on expensive models
Evidence requirements are the primary hallucination guard
Consistency measurement is as important as accuracy
Keep complex scoring internal, show simple results external

The code isn't open source (it's a product), but I'm happy to answer questions about any specific part of the architecture.

Top comments (1)

Harjot Singh • Jun 1

35 signals per vendor as a solo engineer is a serious data-plumbing feat, and the scraping + LLM-extraction lessons are the gold here. that solo-scale orchestration is exactly what Moonshift automates: agents build + deploy + market a SaaS overnight so one person isn't the bottleneck. really solid writeup. first run's free if you want to offload the next data-heavy build.