DEV Community

Cover image for Choosing a model means measuring cost vs quality on your data
Amanda
Amanda

Posted on

Choosing a model means measuring cost vs quality on your data

I wanted to evaluate model-based extraction in a way that would tell me more than benchmarks alone. The scenario is building an AI recruiting agent to help match candidates to job postings. To do this, we need to ingest job postings from career pages, aggregators, social media posts, and other messy sources. Every posting needs to be parsed into structured JSON: title, company, salary range, requirements, benefits.

I set up a comparison with a small dataset of 25 job postings across three model tiers to answer a practical question: does the quality difference between a more expensive model and a budget model justify the cost over time?

Setup

For this exploration, I used Baseten's Model APIs. You can use whatever model provider you like.

I picked three models across the cost spectrum (priced March 2026):

Tier Model Active Params ~Input $/1M tokens
Frontier DeepSeek V3.1 671B / 37B active $0.50
Mid-tier Nvidia Nemotron 3 Super 120B / 12B active $0.30
Budget OpenAI GPT-OSS-120B 117B / 5.1B active $0.10

I generated a dataset of 25 job postings with Claude, designed to reflect the kinds of messy variation you see in real job posting data: informal listings, non-English postings, missing or no fields, hourly rates vs. annual, multiple currencies. For production, this type of data would likely come from multiple sources and be larger.

The extraction prompt asks for valid JSON with ten fields: title, company, location, work model, salary min/max/currency, requirements, nice-to-haves, and benefits. Temperature is set to 0. For the purpose of this exploration, the same system prompt was used for the entire evaluation.

For scoring, scalar fields (title, company, location, and so on) are compared after normalization with exact match for strings, partial credit for substring containment, and a 5% tolerance band for numbers. Array fields (requirements, nice-to-haves, benefits) are scored using set overlap with a word-overlap threshold for fuzzy matching, then taking the minimum of recall and precision. The overall accuracy per posting is a weighted average across all fields, with title and requirements weighted highest because those matter most for this recruiting-agent use case.

I purposefully included one reasoning model because when you send a prompt to a reasoning model it will "think" first and that output is wrapped in think tags. This is something to consider when building your parser.

Example reasoning response might look like this:

<think>
The posting mentions "$150k - $180k"  I should normalize this to annual integers.
The location says "SF Bay Area"  should I interpret this as San Francisco?
The posting mentions "3 days in office"  this implies Hybrid, not On-site...
</think>
{"title": "Senior Engineer", "company": "Acme Corp", ...}
Enter fullscreen mode Exit fullscreen mode

Reasoning models also affect cost because those thinking tokens count toward output. Nemotron averaged 702 output tokens per call compared to 142 for DeepSeek and 481 for OpenAI.

The results

Metric DeepSeek Nemotron OpenAI
Avg Accuracy 83.5% 80.8% 82.2%
JSON Valid Rate 25/25 25/25 25/25
Avg Latency 0.7s 1.6s 2.3s
Avg Cost/Posting $0.0004 $0.0007 $0.0003
Est. Cost/100K Posts $42.24 $66.08 $28.86

All three models produce valid JSON 100% of the time. Accuracy is within a 3-point spread. The budget model retains ~98% of frontier quality at ~70% of the cost.

Where the models actually differ by field

The aggregate scores tells a partial story. Here's the per-field breakdown:

Field DeepSeek Nemotron OpenAI
title 88% 87% 87%
company 80% 76% 80%
location 65% 69% 66%
work_model 80% 76% 72%
salary_min 82% 82% 80%
salary_max 84% 82% 80%
requirements 94% 92% 93%
nice_to_have 93% 87% 93%
benefits 82% 70% 83%

A few things stand out. Location was low for everyone, 65-69% across the board. These postings include things like "SF Bay Area," "remote (US only)," and locations in Portuguese, so that is not surprising. DeepSeek has a slight edge on work-model extraction and nice-to-have extraction.

Nemotron's weakest spot is benefits at 70%. The repo does not establish a single cause for that, but the result is a useful reminder that extra reasoning tokens do not automatically translate into better structured extraction.

Requirements extraction was the highest scoring area for all three models.

Human review

In general, automated scoring is not enough to confidently choose a model for your agent. How much you validate and against which fields will vary by use case. You may want to review all fields in a subset of data, or you may have one field that must be 100% correct and choose to audit that field across everything.

Human review might reveal that your automated scoring weights don't reflect what actually matters for your use case.

In my case, because this was a small exploratory dataset, I reviewed a subset of outputs outside the repo with extra attention on fields that scored lower, especially work_model and location. The repo is meant as a companion for readers to run themselves, not as a checked-in record of my manual review.

A few interesting findings:

When a posting did not name a real company in the main content, such as a recruiter email or something ambiguous like "stealth startup," all three models either left the company unresolved or returned placeholder-like values such as "Stealth Startup." That is probably the right behavior for a strict extraction pipeline, but it might not be the behavior we want.

In a posting with dual currency salary bands, each model handled it differently. One took the first band, one mixed values across both bands, and one returned nothing. This could potentially be handled with different field design as I was only looking for salary min and max with no flexibility for the dual currency scenario.

In listings with a specific city that did not state remote, in-office, or hybrid, all models tended to set work_model to null. This is another example where whether or not this is acceptable is a product choice a human needs to make.

Cost at scale

At 100K postings per month:

  • Frontier (DeepSeek V3.1): ~$42/month
  • Mid-tier (Nemotron): ~$66/month
  • Budget (GPT-OSS-120B): ~$29/month

The budget model saves you $12/month over frontier for a 1.3-point accuracy drop (not including adjustments from my human review). Nemotron costs more than both while scoring lower. The thinking tokens make it the worst value for this particular task.

If we scale this to 1M postings, the spread becomes roughly $422 vs $661 vs $289 per month, which makes the cost penalty for the reasoning model much more visible.

Making a final choice

For this use case of structured extraction from messy text at volume, I'd go with the budget model. Even with some small inaccuracies or hallucinations, the value from the budget extraction is still good enough for the proposed build.

Now, you may be thinking about the latency of the budget model (2.3s vs 0.7s), which would matter more if this were user-facing and synchronous. In this case, there is no reason the end user needs to trigger extraction and wait on it directly, so batching is a reasonable fit.

I'd skip the reasoning model for this kind of extraction. Nemotron's chain-of-thought was sometimes useful when handling ambiguous formatting, but for structured output the extra reasoning cost was not justified by the measured quality here.

Final thoughts

This exploration is 25 postings. To evaluate for production, you want a larger sample as the findings here could be within reasonable threshold.

I relied on the same system prompt and changes here could impact results and are worth exploring.

What you evaluate will depend on your final product as well. Your structured extraction problem might have different failure modes and need different scoring weights than mine.

The main takeaway here is that benchmarks won't tell you which model handles your messy data the best. Build something against what your model really needs to perform well at and see what comes back.

If you would like to run this analysis yourself, the project is hosted on Github. If you have questions or want to chat, please get in touch with me on LinkedIn or X.

Top comments (0)