DEV Community

Cover image for I Benchmarked 6 LLMs to Automate My Job Board for $0.35/Month
Damien Alleyne
Damien Alleyne

Posted on • Originally published at blog.alleyne.dev

I Benchmarked 6 LLMs to Automate My Job Board for $0.35/Month

Background

I run a curated remote job board (GlobalRemote) focused on established remote-first companies with transparent salaries that hire globally — companies like GitLab, Automattic, Buffer, and Zapier. I started it last September to scratch my own itch, manually curating every listing — researching interview processes, verifying geographic restrictions, and cross-referencing salary data. That worked for a small board, but didn't scale.

So I built custom Apify scrapers with department-level filtering to pull only engineering, product, design, and data roles from Greenhouse and Ashby boards. That cut the noise by 80%, but I still needed to automatically:

  1. Classify whether a scraped job is a relevant tech role (engineer, designer, data scientist, PM) — department filtering catches the obvious non-tech roles, but borderline titles like "Integrations Consultant" or "Senior Sales Engineer" still slip through
  2. Extract details that aren't in the job board's structured fields — geographic eligibility, regional variants, and salary data when it's buried in the description text

Previously, this pipeline ran locally using Ollama with qwen3:8b. I wanted to move it entirely to the cloud (GitHub Actions) using cheap API models, so it runs automatically twice a week without my local machine.

The Question

Which cloud LLM models give the best accuracy for classification and extraction, at the lowest cost? Should we use the same model for both tasks, or different models for each?

Methodology

Ground Truth Dataset

I built a test set from my own production data:

  • Classification (50 tests): 25 jobs that ARE on my board (known relevant, with expected categories) + 25 jobs that are NOT relevant (sales, marketing, HR, finance, legal titles from the same companies)
  • Extraction (5 tests): Jobs with known geographic badges and salary ranges, covering open-globally, multi-region, us-canada-only, americas-only, and null salary cases

Models Tested

Model Provider Type Pricing (input/output per 1M tokens)
Claude Haiku 4.5 Anthropic Fast inference $0.80 / $4.00
GPT-5 Mini OpenAI Reasoning ~$0.15 / ~$0.60
GPT-5 Nano OpenAI Reasoning (smallest) ~$0.05 / ~$0.20
Gemini 2.5 Flash Google Fast inference $0.15 / $0.60
Gemini 3 Flash (preview) Google Fast inference (preview) ~$0.15 / ~$0.60
Qwen3 8B Alibaba (via Ollama) Open-source Free

Note: GPT-4o-mini and Gemini 2.0 Flash were also tested initially but replaced with their successors (GPT-5 Mini, Gemini 2.5 Flash) for the final benchmarks.

Prompts

Same prompts used across all models — the exact prompts from my production pipeline:

Classification prompt:

Classify this job title for a tech job board. Respond with JSON only.

Title: {title}
Company: {company}

{"isRelevant": true/false, "category": "engineering|product|design|data|other", "reason": "1-3 words"}

Rules:
- engineering: Software Engineer, Platform Engineer, SRE, DevOps, QA, Solutions Engineer, Design Engineer, Developer Advocate, Security Engineer
- product: Product Manager ONLY (not Growth Manager, not Renewals Manager)
- design: Product Designer, UX Designer, Visual Designer, Brand Designer
- data: Data Scientist, Data Engineer, ML Engineer, AI Engineer, Research Scientist
- ALL other roles (sales, marketing, HR, support, finance, legal) → isRelevant: false, category: "other"
Enter fullscreen mode Exit fullscreen mode

Extraction prompt:

Extract job details from this posting. Respond with JSON only.

Title: {title}
Company: {company}
Text: {description}

{
  "geoBadge": "open-globally|americas-only|emea-only|...|multi-region|...",
  "regionalVariants": ["Americas", "EMEA", "APAC"] or null,
  "salaryMin": number or null,
  "salaryMax": number or null,
  "salaryCurrency": "USD" or "EUR" or "GBP" or "CAD" or null,
  "reasoning": "brief explanation"
}
Enter fullscreen mode Exit fullscreen mode

Results

Full Comparison Table

Model F1 (overall score) Precision (% flagged that were correct) Recall (% of relevant jobs caught) Category Geography Salary Cost/run
GPT-5 Mini 94.1% 92.3% 96.0% 96.0% 80.0% 100% $0.008
Claude Haiku 4.5 91.7% 95.7% 88.0% 88.0% 100% 100% $0.019
Gemini 2.5 Flash 89.8% 91.7% 88.0% 88.0% 80.0% 80.0% $0.003
Gemini 3 Flash (preview) 89.4% 95.5% 84.0% 84.0% 40.0% 40.0% $0.003
Qwen3 8B 85.7% 77.4% 96.0% 92.0% 60.0% 100% free
GPT-5 Nano 23.3% 27.8% 20.0% 20.0% 0.0% 0.0% $0.006

Key Findings

1. GPT-5 Mini is the best classifier

  • F1: 94.1% with 96% recall — it catches nearly every relevant job
  • Only 1 false negative: "Senior Marketing Data Analyst" (ambiguous title)
  • 2 false positives: "Integrations Consultant" and "Senior Sales Engineer" (borderline roles)
  • 96% category accuracy — correctly distinguishes engineering vs. design vs. data

2. Claude Haiku 4.5 is the best extractor

  • 100% geography badge accuracy — correctly identifies open-globally, multi-region, us-canada-only, americas-only
  • 100% salary accuracy — extracts exact numbers and currency, handles null correctly
  • Classification is good (91.7% F1) but misses some edge cases like "Growth Designer"

3. GPT-5 Nano is unusable

  • 23.3% F1, 0% extraction accuracy — massive JSON parse errors
  • Classified most jobs as irrelevant, couldn't extract structured data
  • Despite being cheapest, it costs MORE than Gemini 2.5 Flash while being terrible
  • Verdict: Do not use GPT-5 Nano for structured extraction tasks

4. Gemini 3 Flash (preview) has JSON reliability issues

  • Classification is decent (89.4% F1) but extraction fails 60% of the time with parse errors
  • The preview model wraps JSON in markdown code blocks or adds commentary
  • Not production-ready yet — wait for GA

5. Gemini 2.5 Flash is the budget option

  • Cheapest cloud model at $0.003/run
  • Decent classification (89.8% F1) but weaker extraction (80% geography, 80% salary)
  • One parse error on a EUR salary extraction test

6. Qwen3 8B is surprisingly capable

  • Free and runs locally
  • 85.7% F1 classification — decent but too many false positives (7)
  • 100% salary extraction but only 60% geography badge accuracy
  • Misclassifies "multi-region" as "open-globally" consistently

Error Pattern Analysis

Common false negatives across models:

  • "Senior Marketing Data Analyst" — every model flagged this as marketing, not data. Ambiguous title.
  • "Data Analyst, Customer Intelligence" — "Customer" in title triggers exclusion
  • "Senior Growth Designer" — "Growth" confuses classification

Common false positives:

  • "Integrations Consultant - Americas" — every model said "engineering" (borderline role)
  • "Senior Sales Engineer" — GPT-5 Mini incorrectly treated as engineering

Extraction patterns:

  • "Americas" region consistently extracted correctly by Claude Haiku
  • "US and Europe" → "multi-region" was the hardest badge to get right
  • EUR salary with different format than USD tripped up Gemini models

Decision: Hybrid Model Strategy

Based on benchmarks, I implemented task-based model routing:

Task Primary Model Fallback Why
Classification GPT-5 Mini Claude Haiku → Ollama Best F1 (94.1%), best recall (96%), best category accuracy (96%)
Extraction Claude Haiku GPT-5 Mini → Ollama Perfect geography + salary (100%)

Cost Estimate

Per twice-weekly ingestion run (~50-100 jobs to classify, ~10-20 to extract):

Task Model Estimated tokens Cost
Classification GPT-5 Mini ~30K input, ~5K output ~$0.008
Extraction Claude Haiku ~20K input, ~5K output ~$0.036
Total per run ~$0.044
Monthly (8 runs) ~$0.35

Implementation

The routing is handled in a small LLM abstraction layer:

const TASK_ROUTING = {
  classify: ['openai', 'claude', 'ollama'],   // GPT-5 Mini first
  extract:  ['claude', 'openai', 'ollama'],   // Claude Haiku first
  default:  ['claude', 'openai', 'ollama'],
};

// Pipeline calls specify the task:
await batchGenerateJSON(classificationPrompts, { task: 'classify' });
await batchGenerateJSON(extractionPrompts, { task: 'extract' });
Enter fullscreen mode Exit fullscreen mode

Each task resolves to the best available model in priority order. If OpenAI is down, classification falls back to Claude Haiku. If Anthropic is down, extraction falls back to GPT-5 Mini. Ollama is available as a fallback for local development, but isn't used in the cloud pipeline.

GPT-5 API Gotchas

If you're migrating from GPT-4o-mini to GPT-5 models, watch out for:

  1. No temperature parameter — GPT-5 models are reasoning models (like o1/o3). They don't accept temperature. Remove it entirely.
  2. max_completion_tokens not max_tokens — The parameter name changed for reasoning models.
  3. response_format: { type: 'json_object' } still works — JSON mode is supported via Chat Completions.
  4. Chat Completions API still works — Despite OpenAI promoting the new Responses API, Chat Completions hasn't been deprecated. For simple single-turn JSON extraction, Chat Completions is fine.

Timeline

I ran a local pipeline with Ollama for a while, which worked but required my Mac to be on. Moving everything to the cloud was another weekend:

  1. Benchmarked 6 cloud models to find the best fit for classification and extraction
  2. Built the cloud pipeline on GitHub Actions with hybrid model routing
  3. Hardened data quality — HTML-based requirements extraction, fuzzy title dedup, paid-trial badge detection
  4. Result: Fully automated, runs twice a week, creates PRs for review, costs ~$0.35/month

Top comments (3)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.