Damien Alleyne

Posted on Feb 10 • Originally published at blog.alleyne.dev

I Benchmarked 6 LLMs to Automate My Job Board for $0.35/Month

#ai #webscraping #llm

Background

I run a curated remote job board (GlobalRemote) focused on established remote-first companies with transparent salaries that hire globally — companies like GitLab, Automattic, Buffer, and Zapier. I started it last September to scratch my own itch, manually curating every listing — researching interview processes, verifying geographic restrictions, and cross-referencing salary data. That worked for a small board, but didn't scale.

So I built custom Apify scrapers with department-level filtering to pull only engineering, product, design, and data roles from Greenhouse and Ashby boards. That cut the noise by 80%, but I still needed to automatically:

Classify whether a scraped job is a relevant tech role (engineer, designer, data scientist, PM) — department filtering catches the obvious non-tech roles, but borderline titles like "Integrations Consultant" or "Senior Sales Engineer" still slip through
Extract details that aren't in the job board's structured fields — geographic eligibility, regional variants, and salary data when it's buried in the description text

Previously, this pipeline ran locally using Ollama with qwen3:8b. I wanted to move it entirely to the cloud (GitHub Actions) using cheap API models, so it runs automatically twice a week without my local machine.

The Question

Which cloud LLM models give the best accuracy for classification and extraction, at the lowest cost? Should we use the same model for both tasks, or different models for each?

Methodology

Ground Truth Dataset

I built a test set from my own production data:

Classification (50 tests): 25 jobs that ARE on my board (known relevant, with expected categories) + 25 jobs that are NOT relevant (sales, marketing, HR, finance, legal titles from the same companies)
Extraction (5 tests): Jobs with known geographic badges and salary ranges, covering open-globally, multi-region, us-canada-only, americas-only, and null salary cases

Models Tested

Model	Provider	Type	Pricing (input/output per 1M tokens)
Claude Haiku 4.5	Anthropic	Fast inference	$0.80 / $4.00
GPT-5 Mini	OpenAI	Reasoning	~$0.15 / ~$0.60
GPT-5 Nano	OpenAI	Reasoning (smallest)	~$0.05 / ~$0.20
Gemini 2.5 Flash	Google	Fast inference	$0.15 / $0.60
Gemini 3 Flash (preview)	Google	Fast inference (preview)	~$0.15 / ~$0.60
Qwen3 8B	Alibaba (via Ollama)	Open-source	Free

Note: GPT-4o-mini and Gemini 2.0 Flash were also tested initially but replaced with their successors (GPT-5 Mini, Gemini 2.5 Flash) for the final benchmarks.

Prompts

Same prompts used across all models — the exact prompts from my production pipeline:

Classification prompt:

Classify this job title for a tech job board. Respond with JSON only.

Title: {title}
Company: {company}

{"isRelevant": true/false, "category": "engineering|product|design|data|other", "reason": "1-3 words"}

Rules:
- engineering: Software Engineer, Platform Engineer, SRE, DevOps, QA, Solutions Engineer, Design Engineer, Developer Advocate, Security Engineer
- product: Product Manager ONLY (not Growth Manager, not Renewals Manager)
- design: Product Designer, UX Designer, Visual Designer, Brand Designer
- data: Data Scientist, Data Engineer, ML Engineer, AI Engineer, Research Scientist
- ALL other roles (sales, marketing, HR, support, finance, legal) → isRelevant: false, category: "other"

Extraction prompt:

Extract job details from this posting. Respond with JSON only.

Title: {title}
Company: {company}
Text: {description}

{
  "geoBadge": "open-globally|americas-only|emea-only|...|multi-region|...",
  "regionalVariants": ["Americas", "EMEA", "APAC"] or null,
  "salaryMin": number or null,
  "salaryMax": number or null,
  "salaryCurrency": "USD" or "EUR" or "GBP" or "CAD" or null,
  "reasoning": "brief explanation"
}

Results

Full Comparison Table

Model	F1 (overall score)	Precision (% flagged that were correct)	Recall (% of relevant jobs caught)	Category	Geography	Salary	Cost/run
GPT-5 Mini	94.1%	92.3%	96.0%	96.0%	80.0%	100%	$0.008
Claude Haiku 4.5	91.7%	95.7%	88.0%	88.0%	100%	100%	$0.019
Gemini 2.5 Flash	89.8%	91.7%	88.0%	88.0%	80.0%	80.0%	$0.003
Gemini 3 Flash (preview)	89.4%	95.5%	84.0%	84.0%	40.0%	40.0%	$0.003
Qwen3 8B	85.7%	77.4%	96.0%	92.0%	60.0%	100%	free
GPT-5 Nano	23.3%	27.8%	20.0%	20.0%	0.0%	0.0%	$0.006

Key Findings

1. GPT-5 Mini is the best classifier

F1: 94.1% with 96% recall — it catches nearly every relevant job
Only 1 false negative: "Senior Marketing Data Analyst" (ambiguous title)
2 false positives: "Integrations Consultant" and "Senior Sales Engineer" (borderline roles)
96% category accuracy — correctly distinguishes engineering vs. design vs. data

2. Claude Haiku 4.5 is the best extractor

100% geography badge accuracy — correctly identifies open-globally, multi-region, us-canada-only, americas-only
100% salary accuracy — extracts exact numbers and currency, handles null correctly
Classification is good (91.7% F1) but misses some edge cases like "Growth Designer"

3. GPT-5 Nano is unusable

23.3% F1, 0% extraction accuracy — massive JSON parse errors
Classified most jobs as irrelevant, couldn't extract structured data
Despite being cheapest, it costs MORE than Gemini 2.5 Flash while being terrible
Verdict: Do not use GPT-5 Nano for structured extraction tasks

4. Gemini 3 Flash (preview) has JSON reliability issues

Classification is decent (89.4% F1) but extraction fails 60% of the time with parse errors
The preview model wraps JSON in markdown code blocks or adds commentary
Not production-ready yet — wait for GA

5. Gemini 2.5 Flash is the budget option

Cheapest cloud model at $0.003/run
Decent classification (89.8% F1) but weaker extraction (80% geography, 80% salary)
One parse error on a EUR salary extraction test

6. Qwen3 8B is surprisingly capable

Free and runs locally
85.7% F1 classification — decent but too many false positives (7)
100% salary extraction but only 60% geography badge accuracy
Misclassifies "multi-region" as "open-globally" consistently

Error Pattern Analysis

Common false negatives across models:

"Senior Marketing Data Analyst" — every model flagged this as marketing, not data. Ambiguous title.
"Data Analyst, Customer Intelligence" — "Customer" in title triggers exclusion
"Senior Growth Designer" — "Growth" confuses classification

Common false positives:

"Integrations Consultant - Americas" — every model said "engineering" (borderline role)
"Senior Sales Engineer" — GPT-5 Mini incorrectly treated as engineering

Extraction patterns:

"Americas" region consistently extracted correctly by Claude Haiku
"US and Europe" → "multi-region" was the hardest badge to get right
EUR salary with different format than USD tripped up Gemini models

Decision: Hybrid Model Strategy

Based on benchmarks, I implemented task-based model routing:

Task	Primary Model	Fallback	Why
Classification	GPT-5 Mini	Claude Haiku → Ollama	Best F1 (94.1%), best recall (96%), best category accuracy (96%)
Extraction	Claude Haiku	GPT-5 Mini → Ollama	Perfect geography + salary (100%)

Cost Estimate

Per twice-weekly ingestion run (~50-100 jobs to classify, ~10-20 to extract):

Task	Model	Estimated tokens	Cost
Classification	GPT-5 Mini	~30K input, ~5K output	~$0.008
Extraction	Claude Haiku	~20K input, ~5K output	~$0.036
Total per run			~$0.044
Monthly (8 runs)			~$0.35

Implementation

The routing is handled in a small LLM abstraction layer:

const TASK_ROUTING = {
  classify: ['openai', 'claude', 'ollama'],   // GPT-5 Mini first
  extract:  ['claude', 'openai', 'ollama'],   // Claude Haiku first
  default:  ['claude', 'openai', 'ollama'],
};

// Pipeline calls specify the task:
await batchGenerateJSON(classificationPrompts, { task: 'classify' });
await batchGenerateJSON(extractionPrompts, { task: 'extract' });

Each task resolves to the best available model in priority order. If OpenAI is down, classification falls back to Claude Haiku. If Anthropic is down, extraction falls back to GPT-5 Mini. Ollama is available as a fallback for local development, but isn't used in the cloud pipeline.

GPT-5 API Gotchas

If you're migrating from GPT-4o-mini to GPT-5 models, watch out for:

No temperature parameter — GPT-5 models are reasoning models (like o1/o3). They don't accept temperature. Remove it entirely.
max_completion_tokens not max_tokens — The parameter name changed for reasoning models.
response_format: { type: 'json_object' } still works — JSON mode is supported via Chat Completions.
Chat Completions API still works — Despite OpenAI promoting the new Responses API, Chat Completions hasn't been deprecated. For simple single-turn JSON extraction, Chat Completions is fine.

Timeline

I ran a local pipeline with Ollama for a while, which worked but required my Mac to be on. Moving everything to the cloud was another weekend:

Benchmarked 6 cloud models to find the best fit for classification and extraction
Built the cloud pipeline on GitHub Actions with hybrid model routing
Hardened data quality — HTML-based requirements extraction, fuzzy title dedup, paid-trial badge detection
Result: Fully automated, runs twice a week, creates PRs for review, costs ~$0.35/month

Top comments (3)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.