I wanted to evaluate model-based extraction in a way that would tell me more than benchmarks alone. The scenario is building an AI recruiting agent to help match candidates to job postings. To do this, we need to ingest job postings from career pages, aggregators, social media posts, and other messy sources. Every posting needs to be parsed into structured JSON: title, company, salary range, requirements, benefits.
I set up a comparison with a small dataset of 25 job postings across three model tiers to answer a practical question: does the quality difference between a more expensive model and a budget model justify the cost over time?
Setup
For this exploration, I used Baseten's Model APIs. You can use whatever model provider you like.
I picked three models across the cost spectrum (priced March 2026):
| Tier | Model | Active Params | ~Input $/1M tokens |
|---|---|---|---|
| Frontier | DeepSeek V3.1 | 671B / 37B active | $0.50 |
| Mid-tier | Nvidia Nemotron 3 Super | 120B / 12B active | $0.30 |
| Budget | OpenAI GPT-OSS-120B | 117B / 5.1B active | $0.10 |
I generated a dataset of 25 job postings with Claude, designed to reflect the kinds of messy variation you see in real job posting data: informal listings, non-English postings, missing or no fields, hourly rates vs. annual, multiple currencies. For production, this type of data would likely come from multiple sources and be larger.
The extraction prompt asks for valid JSON with ten fields: title, company, location, work model, salary min/max/currency, requirements, nice-to-haves, and benefits. Temperature is set to 0. For the purpose of this exploration, the same system prompt was used for the entire evaluation.
For scoring, scalar fields (title, company, location, and so on) are compared after normalization with exact match for strings, partial credit for substring containment, and a 5% tolerance band for numbers. Array fields (requirements, nice-to-haves, benefits) are scored using set overlap with a word-overlap threshold for fuzzy matching, then taking the minimum of recall and precision. The overall accuracy per posting is a weighted average across all fields, with title and requirements weighted highest because those matter most for this recruiting-agent use case.
I purposefully included one reasoning model because when you send a prompt to a reasoning model it will "think" first and that output is wrapped in think tags. This is something to consider when building your parser.
Example reasoning response might look like this:
<think>
The posting mentions "$150k - $180k" — I should normalize this to annual integers.
The location says "SF Bay Area" — should I interpret this as San Francisco?
The posting mentions "3 days in office" — this implies Hybrid, not On-site...
</think>
{"title": "Senior Engineer", "company": "Acme Corp", ...}
Reasoning models also affect cost because those thinking tokens count toward output. Nemotron averaged 702 output tokens per call compared to 142 for DeepSeek and 481 for OpenAI.
The results
| Metric | DeepSeek | Nemotron | OpenAI |
|---|---|---|---|
| Avg Accuracy | 83.5% | 80.8% | 82.2% |
| JSON Valid Rate | 25/25 | 25/25 | 25/25 |
| Avg Latency | 0.7s | 1.6s | 2.3s |
| Avg Cost/Posting | $0.0004 | $0.0007 | $0.0003 |
| Est. Cost/100K Posts | $42.24 | $66.08 | $28.86 |
All three models produce valid JSON 100% of the time. Accuracy is within a 3-point spread. The budget model retains ~98% of frontier quality at ~70% of the cost.
Where the models actually differ by field
The aggregate scores tells a partial story. Here's the per-field breakdown:
| Field | DeepSeek | Nemotron | OpenAI |
|---|---|---|---|
| title | 88% | 87% | 87% |
| company | 80% | 76% | 80% |
| location | 65% | 69% | 66% |
| work_model | 80% | 76% | 72% |
| salary_min | 82% | 82% | 80% |
| salary_max | 84% | 82% | 80% |
| requirements | 94% | 92% | 93% |
| nice_to_have | 93% | 87% | 93% |
| benefits | 82% | 70% | 83% |
A few things stand out. Location was low for everyone, 65-69% across the board. These postings include things like "SF Bay Area," "remote (US only)," and locations in Portuguese, so that is not surprising. DeepSeek has a slight edge on work-model extraction and nice-to-have extraction.
Nemotron's weakest spot is benefits at 70%. The repo does not establish a single cause for that, but the result is a useful reminder that extra reasoning tokens do not automatically translate into better structured extraction.
Requirements extraction was the highest scoring area for all three models.
Human review
In general, automated scoring is not enough to confidently choose a model for your agent. How much you validate and against which fields will vary by use case. You may want to review all fields in a subset of data, or you may have one field that must be 100% correct and choose to audit that field across everything.
Human review might reveal that your automated scoring weights don't reflect what actually matters for your use case.
In my case, because this was a small exploratory dataset, I reviewed a subset of outputs outside the repo with extra attention on fields that scored lower, especially work_model and location. The repo is meant as a companion for readers to run themselves, not as a checked-in record of my manual review.
A few interesting findings:
When a posting did not name a real company in the main content, such as a recruiter email or something ambiguous like "stealth startup," all three models either left the company unresolved or returned placeholder-like values such as "Stealth Startup." That is probably the right behavior for a strict extraction pipeline, but it might not be the behavior we want.
In a posting with dual currency salary bands, each model handled it differently. One took the first band, one mixed values across both bands, and one returned nothing. This could potentially be handled with different field design as I was only looking for salary min and max with no flexibility for the dual currency scenario.
In listings with a specific city that did not state remote, in-office, or hybrid, all models tended to set work_model to null. This is another example where whether or not this is acceptable is a product choice a human needs to make.
Cost at scale
At 100K postings per month:
- Frontier (DeepSeek V3.1): ~$42/month
- Mid-tier (Nemotron): ~$66/month
- Budget (GPT-OSS-120B): ~$29/month
The budget model saves you $12/month over frontier for a 1.3-point accuracy drop (not including adjustments from my human review). Nemotron costs more than both while scoring lower. The thinking tokens make it the worst value for this particular task.
If we scale this to 1M postings, the spread becomes roughly $422 vs $661 vs $289 per month, which makes the cost penalty for the reasoning model much more visible.
Making a final choice
For this use case of structured extraction from messy text at volume, I'd go with the budget model. Even with some small inaccuracies or hallucinations, the value from the budget extraction is still good enough for the proposed build.
Now, you may be thinking about the latency of the budget model (2.3s vs 0.7s), which would matter more if this were user-facing and synchronous. In this case, there is no reason the end user needs to trigger extraction and wait on it directly, so batching is a reasonable fit.
I'd skip the reasoning model for this kind of extraction. Nemotron's chain-of-thought was sometimes useful when handling ambiguous formatting, but for structured output the extra reasoning cost was not justified by the measured quality here.
Final thoughts
This exploration is 25 postings. To evaluate for production, you want a larger sample as the findings here could be within reasonable threshold.
I relied on the same system prompt and changes here could impact results and are worth exploring.
What you evaluate will depend on your final product as well. Your structured extraction problem might have different failure modes and need different scoring weights than mine.
The main takeaway here is that benchmarks won't tell you which model handles your messy data the best. Build something against what your model really needs to perform well at and see what comes back.
If you would like to run this analysis yourself, the project is hosted on Github. If you have questions or want to chat, please get in touch with me on LinkedIn or X.
Top comments (0)