Background
I run a curated remote job board (GlobalRemote) focused on established remote-first companies with transparent salaries that hire globally — companies like GitLab, Automattic, Buffer, and Zapier. I started it last September to scratch my own itch, manually curating every listing — researching interview processes, verifying geographic restrictions, and cross-referencing salary data. That worked for a small board, but didn't scale.
So I built custom Apify scrapers with department-level filtering to pull only engineering, product, design, and data roles from Greenhouse and Ashby boards. That cut the noise by 80%, but I still needed to automatically:
- Classify whether a scraped job is a relevant tech role (engineer, designer, data scientist, PM) — department filtering catches the obvious non-tech roles, but borderline titles like "Integrations Consultant" or "Senior Sales Engineer" still slip through
- Extract details that aren't in the job board's structured fields — geographic eligibility, regional variants, and salary data when it's buried in the description text
Previously, this pipeline ran locally using Ollama with qwen3:8b. I wanted to move it entirely to the cloud (GitHub Actions) using cheap API models, so it runs automatically twice a week without my local machine.
The Question
Which cloud LLM models give the best accuracy for classification and extraction, at the lowest cost? Should we use the same model for both tasks, or different models for each?
Methodology
Ground Truth Dataset
I built a test set from my own production data:
- Classification (50 tests): 25 jobs that ARE on my board (known relevant, with expected categories) + 25 jobs that are NOT relevant (sales, marketing, HR, finance, legal titles from the same companies)
-
Extraction (5 tests): Jobs with known geographic badges and salary ranges, covering
open-globally,multi-region,us-canada-only,americas-only, andnullsalary cases
Models Tested
| Model | Provider | Type | Pricing (input/output per 1M tokens) |
|---|---|---|---|
| Claude Haiku 4.5 | Anthropic | Fast inference | $0.80 / $4.00 |
| GPT-5 Mini | OpenAI | Reasoning | ~$0.15 / ~$0.60 |
| GPT-5 Nano | OpenAI | Reasoning (smallest) | ~$0.05 / ~$0.20 |
| Gemini 2.5 Flash | Fast inference | $0.15 / $0.60 | |
| Gemini 3 Flash (preview) | Fast inference (preview) | ~$0.15 / ~$0.60 | |
| Qwen3 8B | Alibaba (via Ollama) | Open-source | Free |
Note: GPT-4o-mini and Gemini 2.0 Flash were also tested initially but replaced with their successors (GPT-5 Mini, Gemini 2.5 Flash) for the final benchmarks.
Prompts
Same prompts used across all models — the exact prompts from my production pipeline:
Classification prompt:
Classify this job title for a tech job board. Respond with JSON only.
Title: {title}
Company: {company}
{"isRelevant": true/false, "category": "engineering|product|design|data|other", "reason": "1-3 words"}
Rules:
- engineering: Software Engineer, Platform Engineer, SRE, DevOps, QA, Solutions Engineer, Design Engineer, Developer Advocate, Security Engineer
- product: Product Manager ONLY (not Growth Manager, not Renewals Manager)
- design: Product Designer, UX Designer, Visual Designer, Brand Designer
- data: Data Scientist, Data Engineer, ML Engineer, AI Engineer, Research Scientist
- ALL other roles (sales, marketing, HR, support, finance, legal) → isRelevant: false, category: "other"
Extraction prompt:
Extract job details from this posting. Respond with JSON only.
Title: {title}
Company: {company}
Text: {description}
{
"geoBadge": "open-globally|americas-only|emea-only|...|multi-region|...",
"regionalVariants": ["Americas", "EMEA", "APAC"] or null,
"salaryMin": number or null,
"salaryMax": number or null,
"salaryCurrency": "USD" or "EUR" or "GBP" or "CAD" or null,
"reasoning": "brief explanation"
}
Results
Full Comparison Table
| Model | F1 (overall score) | Precision (% flagged that were correct) | Recall (% of relevant jobs caught) | Category | Geography | Salary | Cost/run |
|---|---|---|---|---|---|---|---|
| GPT-5 Mini | 94.1% | 92.3% | 96.0% | 96.0% | 80.0% | 100% | $0.008 |
| Claude Haiku 4.5 | 91.7% | 95.7% | 88.0% | 88.0% | 100% | 100% | $0.019 |
| Gemini 2.5 Flash | 89.8% | 91.7% | 88.0% | 88.0% | 80.0% | 80.0% | $0.003 |
| Gemini 3 Flash (preview) | 89.4% | 95.5% | 84.0% | 84.0% | 40.0% | 40.0% | $0.003 |
| Qwen3 8B | 85.7% | 77.4% | 96.0% | 92.0% | 60.0% | 100% | free |
| GPT-5 Nano | 23.3% | 27.8% | 20.0% | 20.0% | 0.0% | 0.0% | $0.006 |
Key Findings
1. GPT-5 Mini is the best classifier
- F1: 94.1% with 96% recall — it catches nearly every relevant job
- Only 1 false negative: "Senior Marketing Data Analyst" (ambiguous title)
- 2 false positives: "Integrations Consultant" and "Senior Sales Engineer" (borderline roles)
- 96% category accuracy — correctly distinguishes engineering vs. design vs. data
2. Claude Haiku 4.5 is the best extractor
- 100% geography badge accuracy — correctly identifies open-globally, multi-region, us-canada-only, americas-only
- 100% salary accuracy — extracts exact numbers and currency, handles null correctly
- Classification is good (91.7% F1) but misses some edge cases like "Growth Designer"
3. GPT-5 Nano is unusable
- 23.3% F1, 0% extraction accuracy — massive JSON parse errors
- Classified most jobs as irrelevant, couldn't extract structured data
- Despite being cheapest, it costs MORE than Gemini 2.5 Flash while being terrible
- Verdict: Do not use GPT-5 Nano for structured extraction tasks
4. Gemini 3 Flash (preview) has JSON reliability issues
- Classification is decent (89.4% F1) but extraction fails 60% of the time with parse errors
- The preview model wraps JSON in markdown code blocks or adds commentary
- Not production-ready yet — wait for GA
5. Gemini 2.5 Flash is the budget option
- Cheapest cloud model at $0.003/run
- Decent classification (89.8% F1) but weaker extraction (80% geography, 80% salary)
- One parse error on a EUR salary extraction test
6. Qwen3 8B is surprisingly capable
- Free and runs locally
- 85.7% F1 classification — decent but too many false positives (7)
- 100% salary extraction but only 60% geography badge accuracy
- Misclassifies "multi-region" as "open-globally" consistently
Error Pattern Analysis
Common false negatives across models:
- "Senior Marketing Data Analyst" — every model flagged this as marketing, not data. Ambiguous title.
- "Data Analyst, Customer Intelligence" — "Customer" in title triggers exclusion
- "Senior Growth Designer" — "Growth" confuses classification
Common false positives:
- "Integrations Consultant - Americas" — every model said "engineering" (borderline role)
- "Senior Sales Engineer" — GPT-5 Mini incorrectly treated as engineering
Extraction patterns:
- "Americas" region consistently extracted correctly by Claude Haiku
- "US and Europe" → "multi-region" was the hardest badge to get right
- EUR salary with different format than USD tripped up Gemini models
Decision: Hybrid Model Strategy
Based on benchmarks, I implemented task-based model routing:
| Task | Primary Model | Fallback | Why |
|---|---|---|---|
| Classification | GPT-5 Mini | Claude Haiku → Ollama | Best F1 (94.1%), best recall (96%), best category accuracy (96%) |
| Extraction | Claude Haiku | GPT-5 Mini → Ollama | Perfect geography + salary (100%) |
Cost Estimate
Per twice-weekly ingestion run (~50-100 jobs to classify, ~10-20 to extract):
| Task | Model | Estimated tokens | Cost |
|---|---|---|---|
| Classification | GPT-5 Mini | ~30K input, ~5K output | ~$0.008 |
| Extraction | Claude Haiku | ~20K input, ~5K output | ~$0.036 |
| Total per run | ~$0.044 | ||
| Monthly (8 runs) | ~$0.35 |
Implementation
The routing is handled in a small LLM abstraction layer:
const TASK_ROUTING = {
classify: ['openai', 'claude', 'ollama'], // GPT-5 Mini first
extract: ['claude', 'openai', 'ollama'], // Claude Haiku first
default: ['claude', 'openai', 'ollama'],
};
// Pipeline calls specify the task:
await batchGenerateJSON(classificationPrompts, { task: 'classify' });
await batchGenerateJSON(extractionPrompts, { task: 'extract' });
Each task resolves to the best available model in priority order. If OpenAI is down, classification falls back to Claude Haiku. If Anthropic is down, extraction falls back to GPT-5 Mini. Ollama is available as a fallback for local development, but isn't used in the cloud pipeline.
GPT-5 API Gotchas
If you're migrating from GPT-4o-mini to GPT-5 models, watch out for:
-
No
temperatureparameter — GPT-5 models are reasoning models (like o1/o3). They don't accepttemperature. Remove it entirely. -
max_completion_tokensnotmax_tokens— The parameter name changed for reasoning models. -
response_format: { type: 'json_object' }still works — JSON mode is supported via Chat Completions. - Chat Completions API still works — Despite OpenAI promoting the new Responses API, Chat Completions hasn't been deprecated. For simple single-turn JSON extraction, Chat Completions is fine.
Timeline
I ran a local pipeline with Ollama for a while, which worked but required my Mac to be on. Moving everything to the cloud was another weekend:
- Benchmarked 6 cloud models to find the best fit for classification and extraction
- Built the cloud pipeline on GitHub Actions with hybrid model routing
- Hardened data quality — HTML-based requirements extraction, fuzzy title dedup, paid-trial badge detection
- Result: Fully automated, runs twice a week, creates PRs for review, costs ~$0.35/month
Top comments (3)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.