Alexander Leonhard

Posted on Apr 14

Why building a job scraper for $0.39/1,000 jobs is not about the money.

#ai #agents #opensource #mcp

I needed thousands of job postings in OJP v0.2 schema. Not a handful for a demo — enough volume that cost per posting had to disappear as a line item.

The existing options didn't work for me. Commercial scrapers price per-posting at numbers that assume you're an ATS vendor passing the cost to customers. Open-source ones want you to write a custom adapter per career page, which is its own slow failure mode. Neither fits what a protocol layer actually needs: cheap, fresh, and structured the same way across every board.

So I built my own in a single session. One production run: 887 postings across 4 boards.

Metric	Value
Cost	$0.39 / 1,000 postings
Throughput	3.9s / job
Success rate	77% raw, 95%+ after retry

Most of the "hard" parts weren't the LLM. The LLM call is the cheap part. Everything around it is where the cost and the success rate actually live.

The architecture

scrape-jobs.json (queue + status)
        │
        ▼
┌─────────────────────────┐
│ Playwright browser      │  ← stealth context, one per worker
└─────────┬───────────────┘
          │
   fetch + strip HTML → ~6K tokens of clean text
          │
          ▼
┌─────────────────────────┐
│ Gemini Flash-Lite       │  ← ~$0.0004/call
│ (OJP v0.2 extraction)   │
└─────────┬───────────────┘
          │
   sanitize + validate (JSON Schema)
          │
          ▼
     results.json

BFS queue. Listing pages discover individual posting URLs, add them as pending, and the loop runs until empty. Status lives in the input file itself so runs resume cleanly after a crash.

Five fallback layers, each kicks in when the previous fails: DOM link scraping → heuristic job-container detection → visual navigator (screenshot + vision model picks clickable selectors) → schema sanitizer → full vision retry on extraction failures.

The five things that moved the numbers

1. Content hashing so reruns are free

Every fetched page gets a SHA-256 of its stripped text. If the hash matches the last scrape, skip the LLM call entirely — no tokens, no cost.

On a weekly re-crawl, 95% of URLs hash-skip. Only actual job edits re-extract. This is what makes the whole thing viable as a recurring pipeline rather than a one-shot.

2. Stealth Playwright to beat bot detection

Realistic user-agent + viewport + timezone, --disable-blink-features=AutomationControlled, and an init script that hides the navigator.webdriver flag most bots forget about. Gets past the common bot-detection layers on 4/5 boards.

One ATS in my test set still blocks with a full CAPTCHA challenge. That one's on the list.

3. Per-worker parallelism partitioned by domain

Partition pending URLs by board domain so workers don't step on each other. If one domain dominates the queue (say, 400 URLs on one board vs. 20 on another), split the big one across shards and interleave the URLs so early items spread across all workers. One Chromium instance per thread, no shared state to debug at 2am.

This matters more than the raw worker count. Naive round-robin parallelism on a queue-mixed stream fights itself — you end up with every worker holding a connection to the same board.

4. A sanitizer that absorbs LLM schema drift

Gemini Flash-Lite is cheap but will happily return "manager" for a seniority enum that only accepts "lead", or "other" for a language code that must be ISO 639-1. The reframe: stop trying to prompt-engineer the model into perfect schema compliance. Assume it will drift. Catch the drift deterministically before validation.

What the sanitizer actually does:

Maps enum synonyms to canonical values (manager → lead, intermediate → mid, graduate → junior, chief → c_level)
Normalizes language names to ISO 639-1 (english → en, deutsch → de)
Moves misplaced fields into the right nesting (LLMs love putting skills at the root instead of under must_have)
Strips nulls because the schema has additionalProperties: false

Took success from 77% → 90%+ on the same input without changing the prompt or the model.

5. Vision-retry for the last 10%

When text extraction fails — an SPA rendered nothing parseable, or Gemini Lite returned invalid JSON even after sanitization — re-run with a full-page screenshot through Gemini Flash vision.

Recovered 4/20 retries at $0.0037 per recovered posting. Boards where text stripping returned 0 chars become viable because Playwright still renders the page. Vision sees what the user sees.

One nuance worth noting: Flash-Lite has weaker vision grounding, so the retry path specifically uses gemini-flash-latest even though the primary extraction uses Lite.

The validated cost model

Layer	Per call	Per 1K postings
Fetch (Playwright)	free	~3s compute
HTML stripping	free	local
Extraction (Flash-Lite text)	$0.0004	$0.39
Visual nav discovery	$0.0006	$0.003 per hard board
Vision retry	$0.0007	$0.015 per 20 retries

End-to-end at scale including retries: ~$0.42 / 1,000. That's ~$4 per 10K, ~$40 per 100K.

Every run writes a frozen stats-{timestamp}.json with extraction and vision cost tracked separately, so I can diff regressions between runs. A cumulative stats.json merges at the end of each run — single-writer, no race conditions across parallel workers.

Gotchas I paid for

Playwright base image version must match the playwright pip version exactly, or headless Chromium fails with Executable doesn't exist
Read/modify/write of shared stats from parallel workers creates race conditions — use per-run files and merge once at the end
Boards with locale prefixes (/en_US/jobs/..., /de_DE/jobs/...) create duplicate URLs that inflate extractions 10× unless you normalize during link discovery
Gemini Flash Lite's vision grounding isn't good enough for retry — hardcode the larger model for screenshots
Screenshot-based vision on heavy SPA boards works even when text stripping returns 0 chars, because Playwright still renders the page

What this is really about

The cost target wasn't the point for me. Sub-cent per posting is a useful headline, but it's a side effect.

The point is that once extraction gets this cheap and this structured, the scraping step stops being a moat. Anyone who wants job data at scale can get it. What remains valuable is the schema you extract into — OJP in my case — and the protocol layer that makes those extractions interoperable across every agent and every board.

I'm building ADNX because the next generation of hiring doesn't run on recruiter tools. It runs on agent-to-agent transactions against domain protocols. Job ingestion at $0.39/1K is just the plumbing. The interesting work is what structured hiring data enables once it's cheap enough to be ambient.

If you're working on anything adjacent — agent protocols, domain-specific extraction, A2A patterns — ping me. Always looking to compare notes.

Stack: Python 3.12, Playwright 1.58, google-genai SDK, jsonschema, Docker.

Top comments (1)

Mihai-Cristian Bâltac • Apr 14

This resonates a lot. I’ve been building something similar and ran into many of the same lessons.

The content hashing point is especially real. We solved it a bit differently with SQLite, a scraped_at timestamp, and a resume flag, but the principle is the same: if reruns aren’t basically free, you stop rerunning.

One place where I’d push back a bit is the sanitizer layer. We had better results moving enum normalization before the model call instead of after. Giving the model a fixed allowed list and telling it to return one value verbatim reduced drift a lot, so post-processing became pretty minimal. Heavier prompt, but worth it for structured extraction.

One nasty issue we hit that I didn’t see mentioned: DNS lookup timeouts on Windows bypassing normal request-level timeouts. socket.setdefaulttimeout() was the only thing that reliably caught it. We lost a few days to frozen workers before figuring that out.

Also fully agree with the bigger takeaway: once ingestion gets cheap enough, scraping stops being the moat. The real value shifts to the schema and to how portable the output is.

Curious how you’re thinking about OJP adoption long term. Do you expect ATS vendors to eventually consume it, or do agents just route around them?