DEV Community

Alex Wu
Alex Wu

Posted on

Stop Obsessing Over the AI Model. The Harness Is What Actually Matters.

Every week, someone asks me which LLM we use at Anythoughts.ai. GPT-4o? Claude 3.5? Gemini? They want the magic model — the one that makes everything work.

Here's the honest answer: the model is almost never the bottleneck.

After running AI agents autonomously for months — handling cold outreach, content publishing, SMB automation workflows — I've come to believe that 90% of agent quality comes from the harness, not the model.

What I Mean by "Harness"

The harness is everything around the model:

  • Context management — what you put in the prompt, what you leave out
  • Tool definitions — how you describe available actions to the agent
  • State and memory — how the agent tracks what's happened, what to do next
  • Error recovery — what happens when a tool call fails or the model hallucinates
  • Output validation — how you catch bad output before it hits production

The model is just a function: f(context) → tokens. The harness is everything else.

The Mistake I Made Early On

When we first built our outreach automation, I spent two weeks benchmarking models. I tested prompts across GPT-4o, Claude 3 Sonnet, Mistral Large. I built elaborate evaluation spreadsheets.

The results were... marginal. Maybe 10-15% quality difference between the best and worst.

Then I spent one day improving how we structured the context — cleaner tool descriptions, better few-shot examples, adding a validation step before the agent could mark a task complete.

Quality jumped 40%.

Same model. Better harness.

A Concrete Example

We have an agent that qualifies inbound leads and drafts first-touch emails. Here's what changed:

Before (model-focused thinking):

You are a sales assistant. Write a cold email to {name} at {company}.
Enter fullscreen mode Exit fullscreen mode

After (harness-focused thinking):

You are a sales assistant for Anythoughts.ai.

Context about the lead:
- Company: {company} ({industry}, {employee_count} employees)
- Role: {title}
- Pain point we believe they have: {inferred_pain}
- Our relevant solution: {solution_match}

Before writing, check:
1. Is the pain point plausible for this role and company size? (yes/no)
2. Do we have a direct solution match? (yes/no)

If both are yes: write a 3-sentence email. First sentence references a specific thing about their company. Second sentence connects it to one concrete outcome we've delivered. Third is a low-friction CTA.

If either is no: output SKIP with reason.
Enter fullscreen mode Exit fullscreen mode

The second prompt doesn't rely on the model being smarter. It gives the model a structured job to do, with explicit decision points and validation baked in.

Why Engineers Get This Wrong

Models are tangible. You can benchmark them. You can point to a number and say "this one scores 78.3% on HumanEval." There's a leaderboard.

Harness quality is fuzzy. How do you measure "context is well-structured"? There's no benchmark for "the agent recovers gracefully from tool errors."

So engineers optimize for what's measurable, even when it's not the real constraint.

What We Actually Optimize For

At Anythoughts.ai, we currently run on Claude Sonnet via AWS Bedrock. Not because we did extensive benchmarking — because it's reliable, reasonably priced, and integrated into our infra.

What we spend real time on:

  1. Skill files — structured markdown files that define exactly how each agent should behave, what tools it has, what outputs it should produce
  2. State tracking — every agent writes to persistent state files so context isn't lost between runs
  3. Validation steps — explicit checkpoints where the agent confirms its own output before taking irreversible actions
  4. Failure modes — logging what went wrong so we can improve the harness, not just retry with a different model

The Practical Takeaway

If your agent is producing bad output, before you switch models:

  1. Add more structure to the prompt — break big tasks into explicit sub-steps
  2. Add a validation step — make the agent check its own work before acting
  3. Improve your tool descriptions — be explicit about what each tool does and when to use it
  4. Add examples — one good few-shot example beats three paragraphs of instructions

If you've done all that and the model is still failing, then switch models.

In my experience, you'll rarely need to.


We're building Anythoughts.ai as a fully autonomous AI agency — agents handling real client work without human execution. If you're building something similar or want to follow the experiment, the blog is where we document what's actually working.

Top comments (0)