DEV Community

Cover image for Choosing a model means measuring cost vs quality on your data
Amanda
Amanda

Posted on

Choosing a model means measuring cost vs quality on your data

I wanted to evaluate model-based extraction in a way that would tell me more than benchmarks alone. The scenario is building an AI recruiting agent to help match candidates to job postings. To do this, we need to ingest job postings from career pages, aggregators, social media posts, and other messy sources. Every posting needs to be parsed into structured JSON: title, company, salary range, requirements, benefits.

I set up a comparison with a small dataset of 25 job postings across three model tiers to answer a practical question: does the quality difference between a more expensive model and a budget model justify the cost over time?

All three models perform competitively on standard benchmarks, which is exactly why I couldn't rely on them to make this call.

Setup

For this exploration, I used Baseten's Model APIs. You can use whatever model provider you like.

I picked three models across the cost spectrum tiered by model positioning (priced March 2026):

Tier Model Active Params ~Input $/1M tokens
High DeepSeek V3.1 671B / 37B active $0.50
Mid-tier Nvidia Nemotron 3 Super 120B / 12B active $0.30
Budget OpenAI GPT-OSS-120B 120B / 5.1B active $0.10

I generated a dataset of 25 job postings with Claude, designed to reflect the kinds of messy variation you see in real job posting data: informal listings, non-English postings, missing or no fields, hourly rates vs. annual, multiple currencies. For production, this type of data would likely come from multiple sources and be larger.

The extraction prompt asks for valid JSON with ten fields: title, company, location, work model, salary min/max/currency, requirements, nice-to-haves, and benefits. Temperature is set to 0. For the purpose of this exploration, the same system prompt was used for the entire evaluation.

For scoring, scalar fields (title, company, location, etc) are compared after normalization with exact match for strings and a 5% tolerance band for numbers. Array fields (requirements, nice-to-haves, benefits) are scored using exact normalized item matches and F1 score. The overall accuracy per posting is a weighted average across all fields, with title and requirements weighted highest because those matter most for this recruiting agent use case.

This is a deliberately strict metric. For example, a model returning "senior engineer" instead of "senior software engineer" would not get full credit under this scorer even if a recruiter or downstream system might treat those as the same role family. This is a choice, however, and exact-match extraction accuracy is not the same thing as business usefulness.

I included one reasoning model, Nemotron, because when you send a prompt to a reasoning model it will "think" first and that output is wrapped in think tags. This is something to consider when building your parser. DeepSeek V3.1 is technically a hybrid model that supports both thinking and non-thinking modes. I didn't specify, so it ran in the default mode (non-thinking).

Example reasoning output might look like this:

<think>
The posting mentions "$150k - $180k" — I should normalize this to annual integers.
The location says "SF Bay Area" — should I interpret this as San Francisco?
The posting mentions "3 days in office" — this implies Hybrid, not On-site...
</think>
{"title": "Senior Engineer", "company": "Acme Corp", ...}
Enter fullscreen mode Exit fullscreen mode

Reasoning models also affect cost because those thinking tokens count toward output. Across three runs, Nemotron averaged roughly 735 output tokens per call compared to 141 for DeepSeek and 481 for OpenAI, which is a big part of why it ended up as the most expensive option in this comparison.

The results

I ran the comparison three times on the same dataset to account for variation in model runs. One clear pattern across runs is that DeepSeek was always first while GPT-OSS-120B and Nemotron were close with no clear winner for second place.

Metric DeepSeek Nemotron OpenAI
Mean Accuracy Across 3 Runs 74.0% 69.6% 70.4%
Accuracy Range Across Runs 74.0-74.0% 68.5-70.6% 70.1-70.6%
JSON Valid Rate 25/25 in every run 25/25 in every run 25/25 in every run
Avg Latency 0.70s 1.63s 2.13s
Avg Cost/Posting $0.00042 $0.00068 $0.00029
Est. Cost/100K Posts $42.00 $68.33 $28.67

All three models produce valid JSON 100% of the time.

Where the models actually differ by field

The aggregate scores tell a partial story. Here's the per-field breakdown averaged across all three runs:

Field DeepSeek Nemotron OpenAI
title 88.0% 84.0% 81.3%
company 80.0% 76.0% 80.0%
location 44.0% 32.0% 38.7%
work_model 77.3% 74.7% 72.0%
salary_min 82.0% 80.7% 80.7%
salary_max 84.0% 80.7% 80.7%
salary_currency 84.0% 80.0% 84.0%
requirements 54.6% 53.0% 50.1%
nice_to_have 74.0% 67.3% 71.6%
benefits 80.3% 75.0% 75.1%

A few things stand out. Location was low for everyone, 32-44% across the board. These postings include things like "SF Bay Area," "remote (US only)," and locations in Portuguese, so that is not surprising. DeepSeek has a slight edge across many categories.

Nemotron's weakest spots were location and requirements. While I don't know the single cause for this, it's a useful reminder that extra reasoning tokens do not automatically translate into better structured extraction.

Requirements and location were difficult for all the models.

Human review

In general, automated scoring is not enough to confidently choose a model for your agent. How much you validate and against which fields will vary by use case. You may want to review all fields in a subset of data, or you may have one field that must be 100% correct and choose to audit that field across everything.

Human review might reveal that your automated scoring weights don't reflect what actually matters for your use case.

In my case, because this was a small exploratory dataset, I reviewed a subset of outputs outside the repo with extra attention on fields that scored lower, especially work_model and location. The repo is meant as a companion for readers to run themselves, not as a checked-in record of my manual review.

A few interesting findings:

When a posting did not name a real company in the main content, such as a recruiter email or something ambiguous like "stealth startup," all three models either left the company unresolved or returned placeholder-like values such as "Stealth Startup." That is probably the right behavior for a strict extraction pipeline, but it might not be the behavior we want.

In a posting with dual currency salary bands, each model handled it differently. One took the first band, one mixed values across both bands, and one returned nothing. This could potentially be handled with different field design as I was only looking for salary min and max with no flexibility for the dual currency scenario.

In listings with a specific city that did not state remote, in-office, or hybrid, all models tended to set work_model to null. This is another example where whether or not this is acceptable is a product choice a human needs to make.

Cost at scale

At 100K postings per month:

  • High(DeepSeek V3.1): ~$42/month
  • Mid-tier (Nemotron): ~$69/month
  • Budget (GPT-OSS-120B): ~$29/month

The budget model saves you $13/month over the higher cost model for roughly a 3.6-point accuracy drop. Nemotron costs more than both while scoring lower on average. The thinking tokens make it the worst value for this particular task.

If we scale this to 1M postings, the spread becomes roughly $420 vs $683 vs $287 per month, which makes the cost penalty for the reasoning model more visible.

What's next

Ultimately this dataset is too small to come to a clear conclusion, but does offer some interesting areas to explore further against real data.

For this use case of structured extraction from messy text at volume, the numbers make the OpenAI budget model worth exploring further. Across three runs it stayed fairly close to DeepSeek on aggregate score while remaining the cheapest option, which is still an attractive tradeoff.

Now, you may be thinking about the latency of the budget model as it was consistently slower. This would matter more if this were user-facing and synchronous. In this case, there is no reason the end user needs to trigger extraction and wait on it directly, so batching is a reasonable fit.

I would be more cautious about claiming OpenAI is definitively better than Nemotron on quality alone. Under this strict scorer, those two were close enough that second place flipped once across the three runs. Ultimately, I would still skip the reasoning model for this kind of extraction. Nemotron's thinking was sometimes useful extracting from ambiguous formatting, but for structured output the extra cost was not justified by the measured quality here.

Final thoughts

I was definitely surprised by these findings and expected a more definitive "winner".

Now, this exploration is 25 postings which is a small sample size. Given the strict scoring mehtod in a dataset this small, each mark swings the accuracy score more than in a larger dataset. Generated data also misses other things that you will find grabbing this same type of data from real world sources like unexpected artifacts. With more data, more runs, and a more rigorous "human in the loop" step, we would see something different.

I also used the same system prompt for all models and test runs. Prompt variations could impact results and are worth exploring.

What you evaluate will depend on your final product as well. Your structured extraction problem might have different failure modes and need different scoring weights than mine.

The main takeaway here is that benchmarks alone won't tell you which model handles your messy data the best. Build something against what your model really needs to perform well at and see what comes back.

If you would like to run this analysis yourself, the project is hosted on Github. If you have questions or want to chat, please get in touch with me on LinkedIn or X.

Top comments (2)

Collapse
 
max_quimby profile image
Max Quimby

Really solid methodology here, especially the strict scoring approach. The dual-currency salary edge case is a great example of why synthetic benchmarks miss real-world messiness.

We ran into a similar cost-vs-quality tradeoff building an AI tutoring platform. Our initial instinct was to use Claude Opus for everything (best quality), but when we actually measured per-user costs, LLM calls were 97% of total COGS — infrastructure was noise. Switching the student-facing tutor to Gemini Flash while keeping Opus for the strategist/planning layer cut costs by ~40% with negligible quality impact on the user-facing side.

Two things I'd add from that experience:

  1. Task-specific routing matters more than model choice. The best model for extraction might be terrible for summarization. We ended up with 3 different models handling different tasks in the same pipeline.

  2. Always script your cost math. We and another team independently validated a cost estimate that turned out to have a 1000x unit error. Both teams missed it because the numbers "felt right." Now we have a Python script that's the single source of truth.

The Nemotron reasoning token overhead is a great catch — 525% more output tokens for roughly the same accuracy is a real trap for teams that default to "smarter = better."

Would love to see this expanded to 500+ postings. Bet the accuracy gaps either widen or close in interesting ways at scale.

Collapse
 
amandamartindev profile image
Amanda

Thank you so much for reading and the thoughtful comment! This was definitely a learning experience and exploration for me so I really appreciate the validation.

I'm convinced all developers building agentic products will need to start really understanding how to do these comparisons that used to feel like something only a researcher would do. It's a new world for all of us for sure.

Agreed on wanting a larger scale test! Small datasets cannot be trusted in real life and will absolutely add your points into my next project.

Crazy about the cost estiamte finding...it's giving Office Space "fractions of a penny error"