We built a job aggregation scraper as a side project, and it turned into a much harder engineering problem than we expected. The goal is boring to state and brutal to solve: take a few hundred company career pages, none of which agree on structure, and turn them into a clean, deduplicated, continuously-refreshed feed of live job listings. The build now pulls 3,000+ jobs. Getting there meant being wrong, repeatedly and expensively, before we were right.
This is the honest version of that story — the approaches that didn't work, the one that did, and the metrics at each step. If you're about to wire an LLM into a scraper because the demos look magical, this is the post we wish we'd read first.

Here's the whole arc before we walk through it:
Act 1: The obvious answers, and why they failed
Hand-coding each site
By the time we had a couple dozen scrapers, we were spending more time repairing them than writing new ones.
The first version wasn't AI at all — one scraper per site, hand-written against that site's markup. Selenium for the JavaScript-heavy pages, plain requests for the ones that served data server-side. Each script knew exactly which container an ATS rendered its listings into, walked each listing to its job page, and pulled out the fields; some had to switch into an iframe just to reach the listings. The principle was always the same: encode this one site's structure by hand.
Each one worked — for exactly one site, until that site changed its HTML. And we're a two-person team. Hundreds of employers on different ATSs (Workday, Greenhouse, Lever, iCIMS, Taleo, and a long tail of bespoke pages), every one a separate scraper to write and, worse, to repair on every redesign. The maintenance burden, not the writing, is what made this a dead end. So we reached for AI.
Off-the-shelf AI scrapers
We tried the popular options, pointing them all at the same target — BlackBerry's Workday careers page — to keep the comparison fair.
Browser-use is the maximalist version of the dream: hand an autonomous agent a task ("extract every job, paginate through all of them") and it drives the browser itself, deciding each click. Impressive to watch, unusable in production. Our run ran past 35 minutes and 134 steps before we killed it — with no usable result. We tried Gemini 2.5 Flash, then 2.5 Pro; Pro didn't help (35 minutes, 143 steps, still nothing). The autonomy you're paying for is the same autonomy that makes it wander and never converge.
crawl4ai half-worked, in two ways that each fell short. With hand-coded rules, it was fast and free — 44 jobs in 43 seconds, zero tokens — but hand-coded rules are exactly the per-site trap we were escaping. Without rules, driving extraction with an LLM, it scaled but cost 41 jobs in 253 seconds, 45 LLM calls, 3.1 million tokens — and it can't paginate on its own (no memory of "I'm on page 3, here's what I've seen"), so we had to bolt an LLM on just to pick the next-page control. Stateless by design.
| Approach | Time | Jobs | Tokens | Rule-free? | Stateful? | Outcome |
|---|---|---|---|---|---|---|
| Browser-use (autonomous agent) | 35+ min, 134–143 steps | 0 | — | yes | yes | killed, never finished |
| crawl4ai + hand rules | 43s | 44 | 0 | no | n/a | works, but rules per site |
| crawl4ai + LLM (no rules) | 253s | 41 | 3.1M | yes | no | works, stateless, token-heavy |
The reframe that took us too long to see: this isn't "AI scrapers are bad." Each tool did something right — Browser-use needs no rules, crawl4ai-with-rules is blazing fast, crawl4ai-with-LLM scales without rules. What none gave us was all three at once: rule-free, stateful, and cheap. That gap is the entire reason we built our own.
Act 2: Building our own — from clever to boring
Here's the part nobody tells you: when we built our own, the first version was worse than the off-the-shelf tools we'd just rejected. Because we made the same mistake the autonomous agents did — we tried to be clever.
The clever version (that we deleted)
Our first custom build was agentic by design, with two ideas we were proud of and both wrong.
First, it asked the LLM a meta-question on every step: what should we do next? The model picked from a menu, and we acted on its choice. Literally, every page turn went through a round-trip like this:
# Every step, an LLM call just to decide the next move:
action = ai.determine_next_action(page_state, memory)
if action == "continue_pagination":
go_to_next_page()
elif action == "check_iframe":
switch_into_iframe()
elif action == "retry_with_screenshot": # the expensive escalation
images = scroll_and_screenshot(page) # multiple JPEGs → model
extract_from_images(images)
elif action == "complete":
break
Second, look at that retry_with_screenshot branch — when text extraction looked weak, we'd escalate to screenshots for "better results": scroll the page, capture images, and ship them to the model to look at.
The results were humbling. An early Workday-only version of this took 20 to 30 minutes to pull 29 jobs — roughly 45 seconds per job. Read that against the table above: crawl4ai-with-LLM did the same kind of job in ~6 seconds. Our clever custom scraper was slower than the off-the-shelf tool we'd dismissed. The bottleneck was the design itself — a constant stream of slow LLM calls, made worse by the screenshot payloads.
Both of our "clever" ideas were traps:
- An LLM call per decision is a latency-and-cost tax on every step. Deciding "should I click next?" doesn't need a round-trip to a language model — it's a question a five-line function answers from page state. We were paying model latency to re-derive control flow we already understood.
- Screenshots are the most expensive way to read a page. Gemini tokenizes images by tiling — 258 tokens per 768×768 tile — so a full-height career page runs several thousand tokens per screenshot, and a scrolling page is several screenshots. Text tokenizes at ~0.75 words/token, so the entire raw HTML of a ~2,000-word page is around 2,700 tokens before cleaning. The numbers:
| Full-page screenshot | Page as text (HTML) | |
|---|---|---|
| How it's counted | 258 tokens per 768×768 tile; a tall page = many tiles | ~1.33 tokens/word |
| Rough cost for one page | ~2,000–4,000+ tokens (×N for scrolling) | ~2,700 raw, far less cleaned |
| Before extraction | model must OCR the image (adds latency) | tokenized directly, no OCR |
| Structure accuracy | misses fine detail and exact field boundaries | reads structure exactly from the DOM |
We were paying more, waiting longer (OCR), and getting worse structure. There's no axis on which screenshots won.
The symmetry worth noting: the instinct to escalate when the cheap path fails was right — we just escalated to the wrong thing. The old version escalated to screenshots (expensive, slow, less accurate); the shipped version escalates to a stronger text model (cheap, fast, more accurate). Same idea, opposite cost.
So we deleted the clever parts and got boring.
The boring version (that we shipped)
The production design rests on a few unglamorous decisions, each of which earned its place by being faster or more reliable than the clever alternative.
Text, not pixels. We strip the page down with BeautifulSoup — kill script, style, noscript, svg, meta, link, drop comments — and send the cleaned HTML to the model, truncated to a token budget. No screenshots. As the numbers above show, the model reads structure better as text than as images, at a fraction of the cost.
Stealth by default. We don't drive vanilla Playwright — we run it with playwright-stealth, which patches the browser fingerprint (the navigator.webdriver tell, headless-Chrome quirks, and the rest) that career sites use to spot automation. It's not bulletproof, but it's the difference between getting clean HTML back and getting a challenge page on a meaningful share of sites.
Don't ask the model to drive. The LLM does exactly two narrow jobs: on a listing page, "here are the job links and the pagination action," and on a detail page, "here's the structured data." Control flow — paginate or not, are we done, did navigation fail — is plain Python reading plain state. The model annotates; the code decides. (This distinction matters so much it became the organizing principle of our current rewrite — Act 3.)
Headless first, headed only on proof. We run Playwright headless by default because it's faster and lighter. The model does flag when it suspects a bot wall or JS-only rendering — but we treat that flag as advisory, never act on it directly. The agent always tries headless first, and only an observed failure (zero jobs, pagination errors, the cycle-breaker firing, or undershooting the expected count) triggers a headed retry. The philosophy: prove headless doesn't work rather than abandon it on a prediction. Whichever attempt finds more jobs wins.
Log everything, per company. Each company we scrape writes its own console log and a dump of what we pulled that run. When a site silently changes and extraction quality drops, that per-company trail is the difference between catching it in an hour and finding out weeks later from a gap in the feed.
Make every job write atomic. A half-written job row is worse than no row. The pattern that keeps the feed clean is two layers: validate and sanitize every job before touching the database (so one malformed row doesn't blow up the batch), then insert the clean set inside a single transaction that either fully commits or fully rolls back:
# 1. Validate first — drop rows missing required fields, coerce the rest.
valid = [sanitize(j) for j in jobs if has_required_fields(j)]
# 2. Then one ACID transaction: all of them land, or none do.
async with conn.transaction():
for job in valid:
await conn.execute(insert_sql, *fields(job))
await conn.execute(update_lastscrape_sql, career_url)
# any exception here → automatic rollback, nothing partial persists
So anything in the database is a whole job, never a fragment — and a bad scrape can't leave the feed in a half-updated state.
Split listing from detail. This is the single most important structural decision, because scraping is two problems with two different shapes:
- Listing discovery is sequential and pagination-bound — load a page, find jobs, find the "next" control, repeat. One browser walking through pages.
- Detail extraction is embarrassingly parallel — given N job URLs, fetch and parse N pages; they don't depend on each other.
Treating these as one undifferentiated loop is part of why the clever version was slow. Splitting them is what let us go fast.
Parallelism — but carefully. Once detail extraction is its own phase, you can fan it out: many job URLs, fetched and parsed concurrently. We cap concurrency with an asyncio.Semaphore(5). The 5 is pragmatic, not tuned — fewer didn't saturate the work, and more risked rate-limiting or overloading the target site (and our own box; this runs on a modest AWS t3.medium). It's a sane ceiling, not a magic number. But how you fan out matters more than the cap, and we got it wrong the first time.
The parallelism trap
When we first added concurrency, it didn't help. It hurt. Two runs on the same workload, back to back:
Parallel: 0:08:09
Sequential: 0:07:27
The parallel version was slower — and the explanation is the whole lesson: we were parallelizing the wrong unit of work. Our first concurrent version cold-started a fresh browser for every job. Launching a headless browser from cold is the heaviest thing in the pipeline, so we were paying that startup tax N times and running the taxes in parallel — all we'd really parallelized was the overhead.
The fix: launch one browser, keep it alive for the whole batch, and give each concurrent job a lightweight child tab on it. Opening a tab is a fraction of the cost of booting a browser, so with startup paid once instead of N times, the workload that used to take half an hour came down to roughly 50 jobs in 3 minutes (~3.6s/job) — faster per job than crawl4ai-with-LLM, and stateful, and far cheaper on tokens. The combination none of the off-the-shelf tools offered.
The unglamorous reliability work
Speed is the headline; reliability is what keeps a feed live. Two pieces of plumbing did most of that work.
Model escalation. Every extraction defaults to a small, cheap model (gemini-2.5-flash-lite). We then check whether the required fields came back; for the large majority of pages they do, and we're done. Only the occasional page where the small model drops a field or can't produce valid output gets re-run on a stronger model (gemini-2.5-flash). It's speculative execution for LLMs — cheap by default, expensive only on demonstrated failure — so the average cost stays near the cheap-model floor.
JSON repair. LLMs returning "JSON" is a polite fiction — markdown fences, missing commas, unescaped quotes, stray control characters. Instead of retry-and-pray, we run a cascade of repairs and only give up if all of them fail; a surprising fraction of "failed" extractions are one regex away from valid.
(There's also the matter of career pages that are empty shells loading an ATS inside an iframe. We handle it, but it's a corner case rather than a headline.)
Act 3: Rebuilding it declaratively with LangGraph
The boring production version works. But it has a smell anyone who's maintained a stateful scraper will recognize: control flow and state mutation are tangled together. State lives in a mutable session object poked from a dozen places, and the "are we done?" logic is spread across a long loop with early breaks. It runs, but reasoning about why it took a given path on a given site means reading the whole thing.
So we rebuilt it on LangGraph. The rewrite is done and in testing now — not yet the version hardened by months of production, but the architecture we're moving to.
The core idea is to make the implicit state machine explicit, with a discipline that holds across both graphs: tools are leaves (each wraps one capability — one navigation, one LLM call, one DB write — and never calls another tool), nodes compose tools and return a partial state dict that LangGraph merges via reducers, and routers are pure functions of state. The live Playwright and Gemini objects stay out of graph state entirely (they're not serializable) and live in config instead. Two graphs mirror the two-phase split: a listing graph (navigate → analyze → merge → paginate or finish) and a details graph that diffs against the database, fans out per-job extraction, and persists. Each job gets its own subgraph: navigate → extract with the cheap model → (good enough? finish : retry with the strong model) → clean up. The model-escalation logic that used to be an if buried in a method becomes an edge in a graph you can see.
The discipline the rewrite forces is the payoff. Routers are pure functions of state — they read, they never write. All mutation happens inside nodes. The model-escalation decision, for instance, is now just this — a function that looks at the extracted data and picks the next node, changing nothing:
def _route_after_extract(state: JobState) -> str:
if is_job_data_complete(state.get("job_data")):
return "finalize_job" # cheap model got everything
return "extract_with_stronger" # missing fields → escalate
That one rule — routers decide, nodes mutate — kills the entire class of bug where deciding what to do next accidentally changes the thing you're deciding about. And it names a through-line: the property the off-the-shelf tools lacked was state. crawl4ai couldn't remember what it had seen or where it was in a pagination journey, and state is what we'd been investing in all along — from that first mutable session object to a typed graph state we can now inspect.
It isn't free, and the honest costs are real. A live browser tab isn't serializable, so it can't live in graph state; we keep those in a side map keyed by URL — exactly the out-of-band state the framework is meant to spare you. We make it safe by guaranteeing cleanup: the per-job subgraph's terminal node always runs and always closes its tab, even when the job fails, so tabs never leak. And a graph of conditional edges is harder to eyeball than a while loop; we're betting the clarity compounds as the system grows, but day one it's a cost, not a win.
The bet still looks right — declarative state, inspectable routing, per-job isolation are what you want as sites and edge cases multiply. We'll know for sure once it's taken the same production beating the current version has.
What we'd tell you if you're about to build this
There's no "rule-free + stateful + cheap" tool off the shelf — at least there wasn't for us. Each tool we tried nailed one or two of those and missed the third. Figure out which properties your problem actually needs before assuming a product gives you all of them.
Let the model annotate; don't let it drive. Use the model for the genuinely fuzzy task — reading an unfamiliar page — and keep it out of control flow, which you can write in plain code. That applies to its signals too: ours flags a suspected bot wall, but we never switch from headless to headed on that flag alone — only a real, observed failure escalates. Let the model suggest; let code and reality decide. (Autonomy is also the feature that makes agents wander for 35 minutes and never finish.)
Text beats screenshots for structured extraction. A single screenshot can cost more tokens than the whole page as text, takes longer (OCR), and reads structure worse. When the cheap path fails, escalate to a smarter model, not to pixels.
Parallelize the cheap unit of work, not the expensive one. Our first concurrent version cold-started a browser per job and got slower, because all it parallelized was startup overhead. Share one browser, fan out lightweight tabs. When parallelism makes things worse, you're almost always parallelizing the wrong thing.
The current version has run in production for months; the LangGraph rebuild is in testing behind it. But the shape of the thing is finally right, and the shape is the lesson: stop trying to be clever, split the problem along its natural seams, keep the state you need, and let the boring parts be boring.
The scraper runs in production behind lazyfruit.com, where it keeps 3,000+ job listings live. I built it with my friend Cute Agarwal — I'm Advait Khawase. Questions and war stories of your own are welcome in the comments.


Top comments (0)