DEV Community

Maxim Saplin
Maxim Saplin

Posted on

Long-Horizon Agents Are Here. Full Autopilot Isn't

A good sanity check for long-horizon agents is not a benchmark. It is a task that is easy to verify and hard to fake.

That is why I still like my small hyperlink_button experiment so much. On paper, it sounds trivial: a Streamlit control that looks like a text link but behaves like a button. In reality, it is exactly the kind of task that exposes whether an agent can actually work.

The task is small enough that you can tell if it succeeded. But it is also awkward enough to matter: Python on the Streamlit side, React/TypeScript on the frontend side, packaging, integration, docs, testing, and all the usual places where “looks plausible” is not the same as “works.”

That is why I think this kind of project is a better test than a flashy benchmark. The real question is not whether a model can emit code. The real question is whether the workflow around it can keep it honest: make it read the right docs, implement the actual requirement, and prove it did not cheat.

That question feels especially relevant right now, because early 2026 has been full of confident claims that long-horizon agents crossed a real threshold.

METR has been tracking AI progress in terms of how long a task an agent can complete, not just how well it performs on narrow benchmarks. Sequoia’s “2026: This is AGI” proposed a deliberately practical definition: AGI is the ability to “figure things out.” And Anthropic’s “Measuring AI agent autonomy in practice” added real deployment data: longer Claude Code runs, more strategic auto-approval, and a shift from step-by-step approval toward active monitoring and interruption.

At the same time, the major product teams all published their own frontier stories:

If you only read the headlines, you land in one of two lazy positions.

  • Either developers are cooked.
  • Or the whole thing is smoke and mirrors.

I think both reactions miss what is actually changing.

The real breakthrough is operational

The most important shift is not that models suddenly became autonomous software teams. The more interesting shift is that they can now operate inside real environments.

They can use a CLI. They can inspect files and logs. They can run code. They can read docs. They can check whether a change actually worked. They can keep iterating inside a feedback loop instead of handing a blob of code back to a human and hoping for the best.

That is a much bigger change than “better autocomplete” or “bigger context.”

It also explains why software is the natural first home for long-horizon agents. Software is unusually legible, testable, and reversible. You can run something, compare outputs, inspect logs, and decide whether the result is acceptable. In many other domains, verification is just as hard as doing the work in the first place.

That is one reason Anthropic’s autonomy data is so interesting. The pattern is not “experienced users blindly trust agents more.” It is subtler than that. They approve more automatically, but they also interrupt more strategically. The oversight style changes.

That matches my own experience almost exactly.

The mature workflow is not “approve every action forever.”

It is “let the system move, but stay close enough to redirect it when it starts drifting.”

The flagship demos were real. They were also unusually favorable.

I do think the big public demos matter. But I also think they are easy to misread.

The interesting part of Cursor’s post is not that a swarm of agents can brute-force software into existence. The interesting part is that coordination turned out to be hard, flat self-coordination was brittle, and simpler planner/worker structure worked better than more clever schemes.

The interesting part of Anthropic’s C compiler experiment is not just “an LLM built a compiler.” It is that the agents were operating in a world with unusually strong feedback: serious tests, known-good oracles, structured tasks, and a domain with decades of prior art. Chris Lattner’s review and Pushpendre Rastogi’s analysis are valuable precisely because they make that visible.

And OpenAI’s harness engineering post may be the clearest articulation of the new role split: humans steer, agents execute. The environment, observability, repository docs, architecture rules, and feedback loops become first-class engineering artifacts.

That does not make these demos fake.

It does make them easier to interpret correctly.

They are not proofs that software teams can be replaced by autonomous agent swarms. They are proofs that strong harnesses, rich feedback, and explicit structure can now unlock a surprising amount of useful work.

That is a big deal. It is just a different deal than the headlines suggest.

There is also a simpler reason these demos were unusually favorable: they were not blank-slate tasks. Browsers sit on top of standards, reference implementations, and mountains of prior art. Compilers sit on top of decades of specifications, tests, literature, and engineering patterns. Even when the outcome is new, the terrain is already heavily mapped.

That matters.

Two orchestration patterns, neither of them magic

After the talk, I found it useful to separate two broad ways people currently try to orchestrate long-running agent work.

The first is the Ralph pattern: fresh agent instances in a loop, with memory externalized into git history, progress files, and task state. It is crude, but honest. Each run starts with clean context.

The second is LLM-native orchestration, where a lead agent manages subagents or teammates inside a shared workflow. Claude Code agent teams are a good example: separate contexts, shared tasks, direct inter-agent messaging, and an explicit lead.

In theory, the second model should feel much smarter.

In practice, my own experiments did not convince me that prompt-level orchestration is the real unlock.

What I saw was much messier. The manager often wanted to become an executor. It would stop and ask for confirmation. It would ignore the delegation policy. In some runs it violated the brief completely and fell back to the exact CSS or JS workaround I had explicitly ruled out.

That does not mean subagents are useless.

It means orchestration is still fragile.

Right now it feels more like a product and training problem than something you can solve by writing a sufficiently stern prompt.

What actually worked better

The patterns that helped were much less romantic.

  • Give the model a CLI.
  • Give it docs within reach.
  • Run a preflight check before it writes code.
  • Make verification cheap.
  • Prefer headless checks over fragile visual wandering.
  • Use parallelism only when tasks are truly independent.
  • Add a QA-style handoff before the real human handoff.

That changed the economics of the work.

Once the agent could run code, inspect outputs, and verify behavior directly, it stopped acting like a pure code generator and started acting more like an operator. Not an autonomous engineer. Not a magical coworker. More like a very fast worker inside a good harness.

That distinction matters.

The value is not just “the model got smarter.”

The value is that the model can now participate in a loop.

Why I still don't buy the full autopilot story

At the far end of the spectrum sits the software-factory vision, or what Simon Willison described in his write-up of StrongDM as the Dark Factory: agents writing code, agents testing code, agents reviewing code, with humans mostly stepping out of the implementation loop.

I find that direction fascinating.

I also think it clarifies how much infrastructure is required before “no human review” sounds remotely plausible.

In my own work, fully unattended runs still tend to produce something functionally OK but awkward, sloppy, or strangely overcomplicated. They may satisfy a narrow verifier while violating the spirit of the task. They may finish the easy 95% and quietly give up on the hard 5%. They may pass checks and still feel wrong.

That is not a theoretical objection.

That is what I keep seeing.

And honestly, it also matches the broader pattern in public demos. The output can be impressive, useful, and real while still being rough, unstable, or harder to trust than the headline implies.

That is why I think the most useful conclusion is narrower than the hype, but stronger than the skepticism.

The real state of long-horizon agents

Long-horizon agents are real. They already change how software gets built.

But the practical value today comes less from autonomous software teams and more from supervised software operations: strong specs, strong harnesses, cheap verification, explicit context, and active steering.

The fully autonomous rocket-to-Mars version still disappoints me.

The version where I launch five agents in parallel, let them work on bounded tasks, and then challenge the result like a tough lead or QA engineer is already genuinely useful.

That, to me, is the real state of agentic engineering in early 2026.

Top comments (0)