DEV Community: Damien Alleyne

My Code, My Test, and My Prompt All Agreed. All Three Were Wrong.

Damien Alleyne — Wed, 08 Jul 2026 20:09:41 +0000

A friend handed me a grocery receipt to test my receipt-scanning app, Receipt Tracker, which turns a photo into categorised line items. It was a dense Massy Stores run, 37 items, and the photo wasn't the sharpest. The app got most of it, but a few items were missing and the totals didn't add up. He was generous about it, impressed at how much it had pulled off a bad photo. I was less impressed: a scan you have to double-check by hand for missing items isn't saving you the work.

The model doing the reading was Google's Gemini 2.5 Flash, and on that receipt it kept dropping lines and inventing quantities. Upgrading to Gemini 3.5 Flash fixed it cleanly: all 37 items, every run, reconciling to the printed $254.03 to the cent. The upgrade wasn't cheap. The newer model lists at $1.50 per million input tokens against 2.5 Flash's $0.30, five times the price, and enough of a jump that it drew pushback from developers when it landed. I paid it for the quality and kept it.

The model is the app's only real running cost, and I'd just chosen to pay more for it. So when I noticed DeepSeek's V4 models, the cheapest listed at $0.09 per million input tokens against Gemini's $1.50, the obvious question was whether the trade could run the other way: the same accuracy for a fraction of the cost.

What followed turned up two real bugs and improved the prompt for the model I wasn't trying to replace. Most of it was me being wrong. The uncomfortable part is why I nearly shipped the wrong call anyway: my code, my test, and my prompt all agreed, and I trusted that agreement without noticing all three came from the same place.

DeepSeek can't even see

The investigation started with DeepSeek, whose V4 models go for a fraction of Gemini's price, and ended about thirty seconds later. On OpenRouter, every DeepSeek model lists text as its only input modality. V4 does have a "Vision" mode, but it only lives in the chat app at chat.deepseek.com; the API itself is text-only. The capability exists; you just can't build on it.

Qwen looks great, then fails by $88

The real candidate was Qwen3-VL-32B: a proper vision model, strong on document OCR benchmarks, and about 18× cheaper than Gemini per receipt. I already had an eval harness (14 receipt fixtures, 135 assertions) so I pointed it at Qwen.

It passed about 85%. I trusted Gemini as the model that just worked, and honestly I'd have guessed it passed everything, though I wasn't checking that closely. So Qwen coming in lower made me wary, and one failure in particular jumped out, because it was about money. Reconciliation is a check that runs in code after the model: add up the line items it pulled out, and see whether they match the printed total. This was the same dense receipt Gemini reconciles to the cent, so when Qwen's numbers didn't add up on it, it looked like a straightforward Qwen problem. I checked directly with a quick probe, summing each item's price × quantity:

Qwen3-VL-32B on the 37-item Massy receipt:
  printed total:   $254.03
  sum of items:    $341.99
  reconciliation gap: $87.96   ← ouch

Eighty-eight dollars off. I dug in and found a clean culprit: promotional quantities. On the "Gala Apples" line, Qwen set quantity: 8 but kept the line's printed total ($5.52) in the price field, so price × quantity came out to $44.16 instead of $5.52. I wrote it up as a systematic Qwen failure, ran the bigger 235B model to confirm it wasn't just a small-model problem (it wasn't; the same gap showed up at 235B), and started drafting the verdict: cheaper, but the totals don't add up. Keep Gemini.

I'd run this whole investigation with my coding agent, Claude Code, so that verdict was as much its conclusion as mine, and I almost took it. But first I asked it the question I'd been skipping: was the prompt itself tuned for Gemini, giving it an unfair edge?

The bug was mine

Answering that meant running Gemini through the same probe. Gemini reconciles this receipt to the cent, so it should have passed clean. Instead my probe put Gemini just as far off, at $337.11, an $83 gap on a receipt I knew reconciles perfectly. That was the giveaway: Gemini's extraction hadn't changed, so the same line items that reconcile to $254.03 could not suddenly be $83 off. The probe was doing the wrong math, on both models.

In the JSON the model fills, each item's price field holds the line total (the extended price actually charged), rather than a unit price. The prompt's own examples say so: Drumsticks, quantity: 0.862, price: 17.20. So the correct reconciliation is Σ price, full stop. My probe computed Σ (price × quantity), which double-counts every multi-quantity line and manufactures ~$80 of phantom gap out of thin air.

With the correct metric, Σ price:

Model	Σ price	Gap vs printed total
Gemini	$254.03	$0.00 ✓
Qwen	$258.68	$4.65 (one extra line, not promo leakage)

The "$88 Qwen failure" never existed. It was an artifact of multiplying the wrong two numbers. And the same bug was hiding in my actual test suite's reconciliation assertion, inflating the sum on any multi-buy line exactly the way my probe did.

The promotional quantity: 8 I'd flagged as a Qwen defect? Production already handles it: a deterministic step re-derives the quantity by dividing the line total by the per-unit price the model extracts into its own field from the receipt's "N @ unit price" line, and the integrity guard sums line totals rather than price × quantity. I'd benchmarked raw model output and ignored the production code that normalises the exact field I was panicking about. Gemini and Qwen emit the same promo quantities; production fixes both.

There was a third instance, too, in the prompt itself. A few lines apart, it handed the model two contradictory rules:

"Do NOT multiply quantities — the receipt already shows the correct totals."

"RECONCILIATION CHECK: sum every extracted item's price × quantity."

The same wrong formula, written three times, by someone who plainly knew better. That's the real tell: you don't fix this by being more careful. The field is called price but it holds a line total, and "price × quantity = total" is the formula your hands type on autopilot. The durable fix is to name the field for what it holds, lineTotal, so that multiplying it by a quantity reads as the obvious nonsense it is.

And there's a deeper reason all three copies survived: no reconciliation test summed line totals on a multi-quantity receipt. My one reconciliation check was gated to quantity-1 restaurant receipts, where Σ (lineTotal × quantity) and Σ lineTotal give the same answer. The bug lived in the exact blind spot where no test was looking, so I added the missing one: sum the line totals on the dense 37-item Massy receipt and confirm it matches the printed total.

Confirm what a field means before you compute with it; a line-total-vs-unit-price mix-up fabricates gaps that look exactly like a model defect. And benchmark the production pipeline rather than raw model output, because deterministic post-processing can erase the very differences you're measuring.

The real gap was my prompt

With the metric fixed, I ran the honest comparison: both models, the same 135 assertions, three runs each.

Model	Overall (of 135, 3-run mean)
Gemini 3.5 Flash	131.7
Qwen3-VL-32B	116.0

A real ~15-assertion gap, but not in reconciliation (Qwen passes the money guard). When I bucketed Qwen's failures, the largest cluster was category assignment: of seven category checks, it failed six. And the pattern was almost comically specific:

Receipt	Expected	Qwen gave
Chefette, KFC	`Fast Food`	`Dining`
Heaven Sent (baby store)	`Baby Care`	`Baby & Children`
Massy (formula)	`Formula`	`Baby & Children > Formula`

Qwen wasn't miscategorising. It understood every receipt fine. It was returning the parent category (or the full Parent > Sub path) instead of the leaf subcategory my taxonomy wanted.

My prompt has a rule for exactly this: "use the subcategory, never the parent." But look at how it's written: "never return just 'Groceries', always pick the specific subcategory under it." Every example is a grocery example. That rule lives in a prompt I've revised across more than 40 commits, nearly every one a reaction to something Gemini got subtly wrong on a Bajan receipt. Gemini generalised the rule to all categories. Qwen read it literally, applied it to groceries, and returned parents for everything else.

The benchmark wasn't measuring "which model is better at receipts." It was measuring which model better fits a prompt I'd unconsciously shaped around Gemini.

The fix that helped both models

If the gap is prompt-fit, it should be promptable. I added one closing rule that restates "leaf subcategory only" for every tree, not just groceries:

restaurant / fast-food → Fast Food (not Dining, and not the ingredient)
baby → Baby Care
hardware → Home

Then I A/B'd it, three runs per condition, both models:

Condition	Category checks (of 7, 3-run mean)	Overall (3-run mean)
Qwen baseline	~1 / 7	116.0
Qwen + leaf rule	7 / 7	121.3 (+5.3)
Gemini baseline	~6 / 7	131.7
Gemini + leaf rule	7 / 7	132.7 (no regression)

The cheap model's category adherence went from roughly 1/7 to 7/7, and because those are 3-run means, that climb is worth about the +5.3 the overall score gained: the whole improvement was category, with nothing regressing elsewhere. The rule wasn't a Qwen-specific patch, either. It nudged Gemini's flaky case to solid too, with no regression anywhere. A model-neutral improvement that I only discovered by trying to onboard a model my prompt had quietly excluded.

That fix shipped, to the Gemini path I'm actually running.

So did I switch?

No. I'll be honest about why, because the honest reasons are the useful ones: this app is parked, I'm effectively its only active user, and at my volume the 18× price difference is cents per receipt. There is no production case for swapping. Qwen3-VL-32B turned out to be a genuinely viable, far cheaper model, about 15 checks behind a frontier model and most of those look tunable, but "viable and cheaper" doesn't beat "already working" when the savings round to zero.

What I actually got out of it was better than a model swap:

Three copies of one bug, found and killed: the probe, the test, and the prompt all summed price × quantity.
The field renamed price → lineTotal where the code does money math, plus the reconciliation test the bug had been hiding from.
A model-neutral prompt improvement for the model I kept.
A sharper picture of what my benchmark was really measuring.

When a challenger model "fails," the first suspect should be your own measurement: the metric, the pipeline, the prompt you wrote for one model without realising it. I didn't switch in the end, but that was because Qwen is narrowly behind and the savings are cents at my volume. It was never because it failed a test it actually passed.

What I'd actually tell another developer

I've argued before that catching where the AI is wrong is the core of the job. This is me catching myself. My coding agent generated the original schema, the first prompt, and the tests, and I revised that prompt across 40-plus commits afterward without ever going back to question what it had built underneath. So the same wrong assumption about what price meant was sitting in all three, and all three agreed with each other.

That agreement was worthless, and it helps to see why. The wrong formula was price × quantity: the prompt told the model to sum it, my probe computed it, and the test asserted it. Watch what it does to two kinds of line, remembering that price already holds the line total:

Line	`price × quantity`	Result
single item (`quantity: 1`)	3.99 × 1 = 3.99	✓ matches the line total
multi-buy (`quantity: 8`)	5.52 × 8 = 44.16	✗ line total is 5.52

Every fixture the reconciliation check ran on was a single item at quantity 1, where price × 1 and the line total are the same number, so the formula looked correct and the check stayed green. It only breaks on a multi-buy line like the Gala Apples (8 @ 0.69), and the check never ran on one. Those receipts were in my fixture set; reconciliation just wasn't scoped to them. The prompt taught the model the formula, the code checked with it, and the test agreed because it never met the input that exposes it.

When the code, the test, and the prompt all trace back to one source, their agreement doesn't verify anything. It echoes the same assumption back to you. You don't catch that by reading the code more carefully, because a careful read just runs the assumption again in your head. You catch it by checking against something the model never generated: the printed total on the receipt, a known-good incumbent, an invariant from the real world.

I got a second demonstration for free. The fix that renamed price to lineTotal quietly broke a downstream check that still read the old field. My agent wrote it and TypeScript passed it, because a stray any let the blind spot survive the rename. A different coding agent, Cursor, reviewing the diff cold, caught what mine couldn't. Same lesson, one level up: the fix agreed with itself until something that hadn't written it took a look.

The receipt scanner is Receipt Tracker; the code is private but I'm happy to talk shop about any of the above. Originally published on blog.alleyne.dev.

The Job Isn't Writing Code. It's Knowing When the AI Is Wrong.

Damien Alleyne — Fri, 20 Feb 2026 21:32:29 +0000

I use an AI coding agent for almost everything on my job board GlobalRemote. It writes my scrapers, builds my CI pipelines, architects my database schemas. It's written the vast majority of the codebase.

After a few months of building this way, I've noticed a pattern: the most valuable thing I do isn't writing code. It's catching where the AI gets it wrong — specifically the cases where the output looks correct but doesn't hold up once you think about it.

Here are three recent examples.

1. The Wrong Tool for the Job

My pipeline extracts tech stack requirements from job postings using regex. A role showed up on the board with no tech stack listed. The AI investigated, found the regex wasn't matching that posting's format, and proposed expanding the regex pattern.

Fair enough. But we already had LLMs classifying and extracting other fields from these same job descriptions. Why maintain a brittle regex when we could use the LLM we're already paying for?

The agent agreed and built the LLM-based extraction instead. More resilient, handles edge cases the regex never would have caught.

The AI optimized within the current approach. I questioned whether the approach itself was right. That's a pattern I keep seeing — AI agents are excellent at solving the problem you give them, but they don't question whether you're solving the right problem. That's still on you.

2. Technically Correct, Actually Misleading

My pipeline extracted geographic data from a GitLab job posting — a role open in the US, Canada, France, Germany, Ireland, Netherlands, Spain, and the UK — and tagged it as multi-region with regions Americas and Europe. I asked the agent to verify. It confirmed the data was accurate — the posting listed countries across both regions.

The problem: if a user from Brazil sees "Americas", they'll assume they can apply. Someone in Hungary sees "Europe", same thing. But this job is only open in 8 specific countries.

The agent hadn't considered this. It checked my existing data, found I already had a select-countries badge for this situation, updated the job, and then updated the LLM extraction prompt so the system would get this distinction right on future runs.

I caught this because I've been the person in a non-obvious country getting excluded from roles that say "Americas" or "Global Remote." I've had Zapier, Outliant, and others reject me on location after their postings implied I was eligible.

3. The Silent Failure

My pipeline ran on schedule. Scraped 39 jobs. Processed them. Reported: "No new entries to add." No errors, clean exit.

Zero new jobs from 39 listings didn't seem right. I pulled the raw data and asked the agent to audit its own pipeline's decisions.

It found two bugs. One was a dedup rule incorrectly matching a new job against a discontinued listing with a similar title — different posting, different job ID, valid salary data, silently dropped. The other was a salary field that the pipeline never parsed, so jobs with visible salary data were being dropped for "no salary transparency."

The pipeline didn't error or warn. It reported success while quietly dropping valid jobs.

I didn't catch this by reading code. I caught it because the output didn't pass a gut check.

Why This Matters

Ben Shoemaker wrote a piece recently arguing that engineers should stop reading code line-by-line and invest in the "harness" — specs, tests, verification layers, trust boundaries. OpenAI calls this Harness Engineering.

Looking at these three examples through that lens, that's what I've been doing without realizing it. The AI handles production. I handle specification, trust boundaries, and the "does this actually make sense for my users?" layer.

If you're an engineer building with AI tools right now, I'd suggest paying attention to the moments where you override the AI's suggestions. Those moments aren't interruptions to your workflow — they're the most valuable part of it. That's the skill set the market is shifting toward, and it's worth documenting for yourself even if you never publish it.

I'm a Senior Software Engineer with over a decade of experience, including building internationalization systems serving 50M+ users. I write about building with AI at blog.alleyne.dev.

I Benchmarked 6 LLMs to Automate My Job Board for $0.35/Month

Damien Alleyne — Tue, 10 Feb 2026 14:00:26 +0000

Background

I run a curated remote job board (GlobalRemote) focused on established remote-first companies with transparent salaries that hire globally — companies like GitLab, Automattic, Buffer, and Zapier. I started it last September to scratch my own itch, manually curating every listing — researching interview processes, verifying geographic restrictions, and cross-referencing salary data. That worked for a small board, but didn't scale.

So I built custom Apify scrapers with department-level filtering to pull only engineering, product, design, and data roles from Greenhouse and Ashby boards. That cut the noise by 80%, but I still needed to automatically:

Classify whether a scraped job is a relevant tech role (engineer, designer, data scientist, PM) — department filtering catches the obvious non-tech roles, but borderline titles like "Integrations Consultant" or "Senior Sales Engineer" still slip through
Extract details that aren't in the job board's structured fields — geographic eligibility, regional variants, and salary data when it's buried in the description text

Previously, this pipeline ran locally using Ollama with qwen3:8b. I wanted to move it entirely to the cloud (GitHub Actions) using cheap API models, so it runs automatically twice a week without my local machine.

The Question

Which cloud LLM models give the best accuracy for classification and extraction, at the lowest cost? Should we use the same model for both tasks, or different models for each?

Methodology

Ground Truth Dataset

I built a test set from my own production data:

Classification (50 tests): 25 jobs that ARE on my board (known relevant, with expected categories) + 25 jobs that are NOT relevant (sales, marketing, HR, finance, legal titles from the same companies)
Extraction (5 tests): Jobs with known geographic badges and salary ranges, covering open-globally, multi-region, us-canada-only, americas-only, and null salary cases

Models Tested

Model	Provider	Type	Pricing (input/output per 1M tokens)
Claude Haiku 4.5	Anthropic	Fast inference	$0.80 / $4.00
GPT-5 Mini	OpenAI	Reasoning	~$0.15 / ~$0.60
GPT-5 Nano	OpenAI	Reasoning (smallest)	~$0.05 / ~$0.20
Gemini 2.5 Flash	Google	Fast inference	$0.15 / $0.60
Gemini 3 Flash (preview)	Google	Fast inference (preview)	~$0.15 / ~$0.60
Qwen3 8B	Alibaba (via Ollama)	Open-source	Free

Note: GPT-4o-mini and Gemini 2.0 Flash were also tested initially but replaced with their successors (GPT-5 Mini, Gemini 2.5 Flash) for the final benchmarks.

Prompts

Same prompts used across all models — the exact prompts from my production pipeline:

Classification prompt:

Classify this job title for a tech job board. Respond with JSON only.

Title: {title}
Company: {company}

{"isRelevant": true/false, "category": "engineering|product|design|data|other", "reason": "1-3 words"}

Rules:
- engineering: Software Engineer, Platform Engineer, SRE, DevOps, QA, Solutions Engineer, Design Engineer, Developer Advocate, Security Engineer
- product: Product Manager ONLY (not Growth Manager, not Renewals Manager)
- design: Product Designer, UX Designer, Visual Designer, Brand Designer
- data: Data Scientist, Data Engineer, ML Engineer, AI Engineer, Research Scientist
- ALL other roles (sales, marketing, HR, support, finance, legal) → isRelevant: false, category: "other"

Extraction prompt:

Extract job details from this posting. Respond with JSON only.

Title: {title}
Company: {company}
Text: {description}

{
  "geoBadge": "open-globally|americas-only|emea-only|...|multi-region|...",
  "regionalVariants": ["Americas", "EMEA", "APAC"] or null,
  "salaryMin": number or null,
  "salaryMax": number or null,
  "salaryCurrency": "USD" or "EUR" or "GBP" or "CAD" or null,
  "reasoning": "brief explanation"
}

Results

Full Comparison Table

Model	F1 (overall score)	Precision (% flagged that were correct)	Recall (% of relevant jobs caught)	Category	Geography	Salary	Cost/run
GPT-5 Mini	94.1%	92.3%	96.0%	96.0%	80.0%	100%	$0.008
Claude Haiku 4.5	91.7%	95.7%	88.0%	88.0%	100%	100%	$0.019
Gemini 2.5 Flash	89.8%	91.7%	88.0%	88.0%	80.0%	80.0%	$0.003
Gemini 3 Flash (preview)	89.4%	95.5%	84.0%	84.0%	40.0%	40.0%	$0.003
Qwen3 8B	85.7%	77.4%	96.0%	92.0%	60.0%	100%	free
GPT-5 Nano	23.3%	27.8%	20.0%	20.0%	0.0%	0.0%	$0.006

Key Findings

1. GPT-5 Mini is the best classifier

F1: 94.1% with 96% recall — it catches nearly every relevant job
Only 1 false negative: "Senior Marketing Data Analyst" (ambiguous title)
2 false positives: "Integrations Consultant" and "Senior Sales Engineer" (borderline roles)
96% category accuracy — correctly distinguishes engineering vs. design vs. data

2. Claude Haiku 4.5 is the best extractor

100% geography badge accuracy — correctly identifies open-globally, multi-region, us-canada-only, americas-only
100% salary accuracy — extracts exact numbers and currency, handles null correctly
Classification is good (91.7% F1) but misses some edge cases like "Growth Designer"

3. GPT-5 Nano is unusable

23.3% F1, 0% extraction accuracy — massive JSON parse errors
Classified most jobs as irrelevant, couldn't extract structured data
Despite being cheapest, it costs MORE than Gemini 2.5 Flash while being terrible
Verdict: Do not use GPT-5 Nano for structured extraction tasks

4. Gemini 3 Flash (preview) has JSON reliability issues

Classification is decent (89.4% F1) but extraction fails 60% of the time with parse errors
The preview model wraps JSON in markdown code blocks or adds commentary
Not production-ready yet — wait for GA

5. Gemini 2.5 Flash is the budget option

Cheapest cloud model at $0.003/run
Decent classification (89.8% F1) but weaker extraction (80% geography, 80% salary)
One parse error on a EUR salary extraction test

6. Qwen3 8B is surprisingly capable

Free and runs locally
85.7% F1 classification — decent but too many false positives (7)
100% salary extraction but only 60% geography badge accuracy
Misclassifies "multi-region" as "open-globally" consistently

Error Pattern Analysis

Common false negatives across models:

"Senior Marketing Data Analyst" — every model flagged this as marketing, not data. Ambiguous title.
"Data Analyst, Customer Intelligence" — "Customer" in title triggers exclusion
"Senior Growth Designer" — "Growth" confuses classification

Common false positives:

"Integrations Consultant - Americas" — every model said "engineering" (borderline role)
"Senior Sales Engineer" — GPT-5 Mini incorrectly treated as engineering

Extraction patterns:

"Americas" region consistently extracted correctly by Claude Haiku
"US and Europe" → "multi-region" was the hardest badge to get right
EUR salary with different format than USD tripped up Gemini models

Decision: Hybrid Model Strategy

Based on benchmarks, I implemented task-based model routing:

Task	Primary Model	Fallback	Why
Classification	GPT-5 Mini	Claude Haiku → Ollama	Best F1 (94.1%), best recall (96%), best category accuracy (96%)
Extraction	Claude Haiku	GPT-5 Mini → Ollama	Perfect geography + salary (100%)

Cost Estimate

Per twice-weekly ingestion run (~50-100 jobs to classify, ~10-20 to extract):

Task	Model	Estimated tokens	Cost
Classification	GPT-5 Mini	~30K input, ~5K output	~$0.008
Extraction	Claude Haiku	~20K input, ~5K output	~$0.036
Total per run			~$0.044
Monthly (8 runs)			~$0.35

Implementation

The routing is handled in a small LLM abstraction layer:

const TASK_ROUTING = {
  classify: ['openai', 'claude', 'ollama'],   // GPT-5 Mini first
  extract:  ['claude', 'openai', 'ollama'],   // Claude Haiku first
  default:  ['claude', 'openai', 'ollama'],
};

// Pipeline calls specify the task:
await batchGenerateJSON(classificationPrompts, { task: 'classify' });
await batchGenerateJSON(extractionPrompts, { task: 'extract' });

Each task resolves to the best available model in priority order. If OpenAI is down, classification falls back to Claude Haiku. If Anthropic is down, extraction falls back to GPT-5 Mini. Ollama is available as a fallback for local development, but isn't used in the cloud pipeline.

GPT-5 API Gotchas

If you're migrating from GPT-4o-mini to GPT-5 models, watch out for:

No temperature parameter — GPT-5 models are reasoning models (like o1/o3). They don't accept temperature. Remove it entirely.
max_completion_tokens not max_tokens — The parameter name changed for reasoning models.
response_format: { type: 'json_object' } still works — JSON mode is supported via Chat Completions.
Chat Completions API still works — Despite OpenAI promoting the new Responses API, Chat Completions hasn't been deprecated. For simple single-turn JSON extraction, Chat Completions is fine.

Timeline

I ran a local pipeline with Ollama for a while, which worked but required my Mac to be on. Moving everything to the cloud was another weekend:

Benchmarked 6 cloud models to find the best fit for classification and extraction
Built the cloud pipeline on GitHub Actions with hybrid model routing
Hardened data quality — HTML-based requirements extraction, fuzzy title dedup, paid-trial badge detection
Result: Fully automated, runs twice a week, creates PRs for review, costs ~$0.35/month

I Built 2 Job Scrapers in One Weekend to Avoid Paying for Data

Damien Alleyne — Mon, 02 Feb 2026 17:26:51 +0000

I run GlobalRemote, a curated job board that shows interview processes and hiring transparency upfront. To keep it relevant, I needed to update it 2x per week with fresh jobs from Greenhouse and Ashby boards.

The problem? The scraper I was using fetched every job from each company — Sales, HR, Support, everything — and stored it all in my Apify dataset. With 6-8 companies, that's 300-400 jobs per scrape, but only 5-10 were actually relevant.

I was burning through my Apify free tier ($5/month, ~2000 dataset operations) on irrelevant data. Two scrapes per week would blow past my quota. I wasn't ready to pay for a higher tier just to subsidize wasteful scraping.

So my options were:

Update infrequently (once every 2-3 weeks) and let the board go stale
Pay for a higher Apify tier to subsidize wasteful scraping
Build my own scrapers with department filtering

I chose #3.

The scrapers are now live on Apify Store, open-source, and I'm dogfooding them on GlobalRemote right now.

The Problem: I Couldn't Update Frequently Enough

The scraper I was using worked like this:

Fetch all jobs from a company's job board
Store everything in an Apify dataset
I filter locally for the jobs I actually want

This makes sense if you want all the jobs. But for a curated board like GlobalRemote, I only wanted:

Engineering roles (not Sales, Marketing, HR)
From specific departments (e.g., "Code Wrangling" at Automattic, "Engineering" at GitLab)
Recent postings (not 6-month-old listings)

With 300-400 jobs stored per scrape and only 5-10 relevant, I was wasting my dataset quota. Two scrapes per week would exceed my free tier limit. The choice was: pay for a higher tier or update less frequently. Neither was ideal.

The Solution: Per-URL Department Filtering

I built two Apify actors:

Greenhouse Job Scraper (Automattic, GitLab, Speechify, etc.)
Ashby Job Scraper (Buffer, Zapier, RevenueCat, etc.)

Both support per-URL configuration, meaning each company can have different filters:

{
  "urls": [
    {
      "url": "https://job-boards.greenhouse.io/automatticcareers",
      "departments": [307170],
      "maxJobs": 50,
      "daysBack": 7
    },
    {
      "url": "https://job-boards.greenhouse.io/gitlab",
      "departments": [4011044002],
      "maxJobs": 20
    }
  ]
}

The scraper:

Fetches department metadata
Filters jobs by department ID before storing them
Only stores jobs that match your criteria
You only pay for the jobs you actually get (not the ones filtered out)

Result: I went from storing 300-400 jobs per scrape to 30-50 jobs — an 80% reduction in dataset usage.

How I Built It

Tech Stack

Apify platform — handles hosting, scheduling, dataset storage
Greenhouse + Ashby APIs — public APIs for job boards
AI (Claude) — for rapid development

How the APIs Work

Both platforms expose public APIs for their job boards. This meant I could:

Fetch departments/teams programmatically
Filter by department/team ID before fetching job details
Only pull full job data for matches
No browser automation or HTML scraping needed

This is key: I'm filtering before fetching details, not after. Most scrapers fetch everything, then you filter locally. Mine filters first, then only fetches what you need.

Development Process

I built both scrapers over one weekend using AI (Claude).

Saturday (Jan 31): Greenhouse scraper

Prompt: "Build an Apify actor that scrapes Greenhouse job boards with department filtering"
AI figured out the API structure
I tested on Automattic and GitLab job boards

Sunday (Feb 1): Ashby scraper

Prompt: "Build an Apify actor for Ashby job boards with department filtering (similar structure to the existing Greenhouse scraper)"
AI figured out Ashby's API
Tested on Buffer, Zapier, RevenueCat

What AI handled:

Reading API documentation (Greenhouse, Ashby, Apify actor structure)
Writing the scraper logic and Apify boilerplate
Handling edge cases (null departments, missing dates)
Generating input/output schemas

What I did:

Product decisions (per-URL config vs global config)
Testing on real job boards
Iterating when things didn't work
Catching issues (e.g., updated Node 20 → 22 in Dockerfile)

I never opened:

Total development time: One weekend.

AI is a co-pilot, not autopilot - but it handled all the research and boilerplate so I could focus on testing and product decisions.

Dogfooding on GlobalRemote

I'm using both scrapers to populate GlobalRemote right now.

When I need fresh data, I trigger both scrapers. They return 30-50 relevant jobs instead of 300-400, keeping me well within my Apify free tier.

What I've learned from dogfooding:

Department filtering reduced dataset usage by ~80%
I can now update regularly without exceeding my quota

If the scrapers break, GlobalRemote breaks. That's a strong incentive to keep them working.

What I Learned

1. Filter before storing, not after

For curated job boards, filtering before storage is way more cost-effective. The scraper I was using didn't do this.

2. Per-URL config beats global config

My first version had global department filters (same filter for all companies). That was a mistake. Different companies organize departments differently. Per-URL config gives users way more flexibility.

3. Real examples > Fake examples

In my README, I used real companies (Automattic, GitLab) and real department IDs (307170 = "Code Wrangling" at Automattic). Fake examples would've been useless for someone trying to replicate this.

4. AI accelerates weekend projects into production tools

I shipped two working scrapers in one weekend without reading a single API doc. AI handled research and implementation; I handled product decisions and testing. That's the real power of AI in 2026.

5. Open-sourcing on Apify was easy

Publishing to Apify Store took ~10 minutes:

Add README
Set pricing
Add input/output schemas
Add Banking information (they prefer PayPal)
Click "Publish"

What's Next

Both scrapers are live and stable. I will be using them on GlobalRemote twice a week, well within my free tier.

Potential improvements:

Add automated tests (right now it's just manual verification)
Add salary parsing to Ashby scraper (Greenhouse already extracts salary ranges)
Build a Lever scraper (if there's demand)

But honestly? I built these to solve my own problem. If other people find them useful, great. If not, I'm still updating GlobalRemote 2x/week without blowing my budget.