Amanda

Posted on Mar 22

Choosing a model means measuring cost vs quality on your data

#ai #llm

I wanted to evaluate model-based extraction in a way that would tell me more than benchmarks alone. The scenario is building an AI recruiting agent to help match candidates to job postings. To do this, we need to ingest job postings from career pages, aggregators, social media posts, and other messy sources. Every posting needs to be parsed into structured JSON: title, company, salary range, requirements, benefits.

I set up a comparison with a small dataset of 25 job postings across three model tiers to answer a practical question: does the quality difference between a more expensive model and a budget model justify the cost over time?

All three models perform competitively on standard benchmarks, which is exactly why I couldn't rely on them to make this call.

Setup

For this exploration, I used Baseten's Model APIs. You can use whatever model provider you like.

I picked three models across the cost spectrum tiered by model positioning (priced March 2026):

Tier	Model	Active Params	~Input $/1M tokens
High	DeepSeek V3.1	671B / 37B active	$0.50
Mid-tier	Nvidia Nemotron 3 Super	120B / 12B active	$0.30
Budget	OpenAI GPT-OSS-120B	120B / 5.1B active	$0.10

I generated a dataset of 25 job postings with Claude, designed to reflect the kinds of messy variation you see in real job posting data: informal listings, non-English postings, missing or no fields, hourly rates vs. annual, multiple currencies. For production, this type of data would likely come from multiple sources and be larger.

The extraction prompt asks for valid JSON with ten fields: title, company, location, work model, salary min/max/currency, requirements, nice-to-haves, and benefits. Temperature is set to 0. For the purpose of this exploration, the same system prompt was used for the entire evaluation.

For scoring, scalar fields (title, company, location, etc) are compared after normalization with exact match for strings and a 5% tolerance band for numbers. Array fields (requirements, nice-to-haves, benefits) are scored using exact normalized item matches and F1 score. The overall accuracy per posting is a weighted average across all fields, with title and requirements weighted highest because those matter most for this recruiting agent use case.

This is a deliberately strict metric. For example, a model returning "senior engineer" instead of "senior software engineer" would not get full credit under this scorer even if a recruiter or downstream system might treat those as the same role family. This is a choice, however, and exact-match extraction accuracy is not the same thing as business usefulness.

I included one reasoning model, Nemotron, because when you send a prompt to a reasoning model it will "think" first and that output is wrapped in think tags. This is something to consider when building your parser. DeepSeek V3.1 is technically a hybrid model that supports both thinking and non-thinking modes. I didn't specify, so it ran in the default mode (non-thinking).

Example reasoning output might look like this:

<think>
The posting mentions "$150k - $180k" — I should normalize this to annual integers.
The location says "SF Bay Area" — should I interpret this as San Francisco?
The posting mentions "3 days in office" — this implies Hybrid, not On-site...
</think>
{"title": "Senior Engineer", "company": "Acme Corp", ...}

Reasoning models also affect cost because those thinking tokens count toward output. Across three runs, Nemotron averaged roughly 735 output tokens per call compared to 141 for DeepSeek and 481 for OpenAI, which is a big part of why it ended up as the most expensive option in this comparison.

The results

I ran the comparison three times on the same dataset to account for variation in model runs. One clear pattern across runs is that DeepSeek was always first while GPT-OSS-120B and Nemotron were close with no clear winner for second place.

Metric	DeepSeek	Nemotron	OpenAI
Mean Accuracy Across 3 Runs	74.0%	69.6%	70.4%
Accuracy Range Across Runs	74.0-74.0%	68.5-70.6%	70.1-70.6%
JSON Valid Rate	25/25 in every run	25/25 in every run	25/25 in every run
Avg Latency	0.70s	1.63s	2.13s
Avg Cost/Posting	$0.00042	$0.00068	$0.00029
Est. Cost/100K Posts	$42.00	$68.33	$28.67

All three models produce valid JSON 100% of the time.

Where the models actually differ by field

The aggregate scores tell a partial story. Here's the per-field breakdown averaged across all three runs:

Field	DeepSeek	Nemotron	OpenAI
title	88.0%	84.0%	81.3%
company	80.0%	76.0%	80.0%
location	44.0%	32.0%	38.7%
work_model	77.3%	74.7%	72.0%
salary_min	82.0%	80.7%	80.7%
salary_max	84.0%	80.7%	80.7%
salary_currency	84.0%	80.0%	84.0%
requirements	54.6%	53.0%	50.1%
nice_to_have	74.0%	67.3%	71.6%
benefits	80.3%	75.0%	75.1%

A few things stand out. Location was low for everyone, 32-44% across the board. These postings include things like "SF Bay Area," "remote (US only)," and locations in Portuguese, so that is not surprising. DeepSeek has a slight edge across many categories.

Nemotron's weakest spots were location and requirements. While I don't know the single cause for this, it's a useful reminder that extra reasoning tokens do not automatically translate into better structured extraction.

Requirements and location were difficult for all the models.

Human review

In general, automated scoring is not enough to confidently choose a model for your agent. How much you validate and against which fields will vary by use case. You may want to review all fields in a subset of data, or you may have one field that must be 100% correct and choose to audit that field across everything.

Human review might reveal that your automated scoring weights don't reflect what actually matters for your use case.

In my case, because this was a small exploratory dataset, I reviewed a subset of outputs outside the repo with extra attention on fields that scored lower, especially work_model and location. The repo is meant as a companion for readers to run themselves, not as a checked-in record of my manual review.

A few interesting findings:

When a posting did not name a real company in the main content, such as a recruiter email or something ambiguous like "stealth startup," all three models either left the company unresolved or returned placeholder-like values such as "Stealth Startup." That is probably the right behavior for a strict extraction pipeline, but it might not be the behavior we want.

In a posting with dual currency salary bands, each model handled it differently. One took the first band, one mixed values across both bands, and one returned nothing. This could potentially be handled with different field design as I was only looking for salary min and max with no flexibility for the dual currency scenario.

In listings with a specific city that did not state remote, in-office, or hybrid, all models tended to set work_model to null. This is another example where whether or not this is acceptable is a product choice a human needs to make.

Cost at scale

At 100K postings per month:

High(DeepSeek V3.1): ~$42/month
Mid-tier (Nemotron): ~$69/month
Budget (GPT-OSS-120B): ~$29/month

The budget model saves you $13/month over the higher cost model for roughly a 3.6-point accuracy drop. Nemotron costs more than both while scoring lower on average. The thinking tokens make it the worst value for this particular task.

If we scale this to 1M postings, the spread becomes roughly $420 vs $683 vs $287 per month, which makes the cost penalty for the reasoning model more visible.

What's next

Ultimately this dataset is too small to come to a clear conclusion, but does offer some interesting areas to explore further against real data.

For this use case of structured extraction from messy text at volume, the numbers make the OpenAI budget model worth exploring further. Across three runs it stayed fairly close to DeepSeek on aggregate score while remaining the cheapest option, which is still an attractive tradeoff.

Now, you may be thinking about the latency of the budget model as it was consistently slower. This would matter more if this were user-facing and synchronous. In this case, there is no reason the end user needs to trigger extraction and wait on it directly, so batching is a reasonable fit.

I would be more cautious about claiming OpenAI is definitively better than Nemotron on quality alone. Under this strict scorer, those two were close enough that second place flipped once across the three runs. Ultimately, I would still skip the reasoning model for this kind of extraction. Nemotron's thinking was sometimes useful extracting from ambiguous formatting, but for structured output the extra cost was not justified by the measured quality here.

Final thoughts

I was definitely surprised by these findings and expected a more definitive "winner".

Now, this exploration is 25 postings which is a small sample size. Given the strict scoring mehtod in a dataset this small, each mark swings the accuracy score more than in a larger dataset. Generated data also misses other things that you will find grabbing this same type of data from real world sources like unexpected artifacts. With more data, more runs, and a more rigorous "human in the loop" step, we would see something different.

I also used the same system prompt for all models and test runs. Prompt variations could impact results and are worth exploring.

What you evaluate will depend on your final product as well. Your structured extraction problem might have different failure modes and need different scoring weights than mine.

The main takeaway here is that benchmarks alone won't tell you which model handles your messy data the best. Build something against what your model really needs to perform well at and see what comes back.

If you would like to run this analysis yourself, the project is hosted on Github. If you have questions or want to chat, please get in touch with me on LinkedIn or X.

Top comments (3)

Max Quimby • Mar 27

Really solid methodology here, especially the strict scoring approach. The dual-currency salary edge case is a great example of why synthetic benchmarks miss real-world messiness.

We ran into a similar cost-vs-quality tradeoff building an AI tutoring platform. Our initial instinct was to use Claude Opus for everything (best quality), but when we actually measured per-user costs, LLM calls were 97% of total COGS — infrastructure was noise. Switching the student-facing tutor to Gemini Flash while keeping Opus for the strategist/planning layer cut costs by ~40% with negligible quality impact on the user-facing side.

Two things I'd add from that experience:

Task-specific routing matters more than model choice. The best model for extraction might be terrible for summarization. We ended up with 3 different models handling different tasks in the same pipeline.
Always script your cost math. We and another team independently validated a cost estimate that turned out to have a 1000x unit error. Both teams missed it because the numbers "felt right." Now we have a Python script that's the single source of truth.

The Nemotron reasoning token overhead is a great catch — 525% more output tokens for roughly the same accuracy is a real trap for teams that default to "smarter = better."

Would love to see this expanded to 500+ postings. Bet the accuracy gaps either widen or close in interesting ways at scale.

Amanda • Mar 27

Thank you so much for reading and the thoughtful comment! This was definitely a learning experience and exploration for me so I really appreciate the validation.

I'm convinced all developers building agentic products will need to start really understanding how to do these comparisons that used to feel like something only a researcher would do. It's a new world for all of us for sure.

Agreed on wanting a larger scale test! Small datasets cannot be trusted in real life and will absolutely add your points into my next project.

Crazy about the cost estiamte finding...it's giving Office Space "fractions of a penny error"

Harjot Singh • Jun 1

it's interesting how you tackled the quality vs cost dilemma with model-based extraction in your AI recruiting agent. the practical approach really highlights the importance of evaluating real-world performance. by the way, if you're looking to quickly deploy a full next.js + postgres + auth app, Moonshift can get you up and running in about 7 minutes, and you own the code. let me know if you'd like to give it a free run.