DEV Community

yuer
yuer

Posted on

Most Agent Stacks Are Overbuilt Until They Beat a Plain LLM Client on Real Work

I’ll say it directly:

A lot of “agent stack” hype looks impressive only because people keep testing it on clean toy tasks.

Real work is not clean.

Real clients do not hand you perfect specs.
They give you vague goals, missing constraints, bad wording, hidden assumptions, unclear success criteria, and still expect something shippable.

That is why I do not care about another polished agent demo.

I care about this:

Can your workflow take a messy commercial requirement and turn it into usable engineering output?

Not vibes.
Not screenshots.
Not “my agent edited 12 files.”
Actual deliverable quality.

My challenge

Bring your agent stack.

Claude Code.
Cursor.
Aider.
OpenHands.
CrewAI.
AutoGen.
LangGraph.
Whatever you believe is serious.

I’ll bring one plain LLM client.

No custom agent stack.
No terminal-native coding agent.
No fancy orchestration layer.

Just a disciplined LLM-client workflow.

The task format

Pick a real Upwork-style coding task.

Why Upwork?

Because Upwork-style tasks are ugly in the exact way real engineering work is ugly:

  • vague requirements
  • incomplete specs
  • unclear technical boundaries
  • business pressure
  • client language that must be reverse-engineered
  • hidden delivery expectations
  • messy success criteria

That is where workflows either become real or collapse.

What we compare

Same task.
Same time limit.
Full workflow published.
Prompts published.
Code published.
Final result published.
No hidden cleanup.
No cherry-picking.
No private repo magic.

Judge the output on:

  • requirement reverse-engineering
  • architecture design
  • task decomposition
  • code quality
  • validation logic
  • debugging reasoning
  • assumptions exposed
  • delivery completeness
  • reproducibility
  • whether a real client could evaluate the result

Important rule

If your agent stack wins because it can edit files, run commands, execute tests, or operate a repo faster, fine.

Call that a tool-execution advantage.

But do not pretend that means it automatically produced better requirements, better architecture, better validation logic, or a more useful deliverable.

Those are different categories.

A faster wrench is not the same thing as a better engineer.

And yes, this is “just a workflow”

If your criticism is:

“You’re just using a fixed workflow inside an LLM client.”

Exactly.

That is the point.

A disciplined workflow is not a weakness.
It is the system.

Most people do not have an AI problem.
They have a workflow problem.

They keep stacking agents on top of unclear goals, weak decomposition, vague validation, and no audit trail.

Then they call the result “automation.”

I call it expensive noise.

My claim

I am not claiming a plain LLM client is better than agentic coding tools at terminal-native execution.

I am claiming something more uncomfortable:

A properly used plain LLM client can compete with many agent stacks at the higher-level work that actually matters: turning messy business requirements into usable engineering deliverables.

If that sounds wrong, prove it.

Pick the task.

Bring your stack.

I’ll bring one plain LLM client.

Let the output speak.

Top comments (0)