Maxim Saplin

Posted on Mar 30 • Edited on Apr 1

Long-Horizon Agents Are Here. Full Autopilot Isn't

#ai #programming #llm #agents

Small tasks exposing fragile model loops

A good sanity check for long-horizon agents is not a benchmark. It is a task that is easy to verify and hard to fake.

That is why I still like my small hyperlink_button experiment so much. On paper, it sounds trivial: a Streamlit control that looks like a text link but behaves like a button. In reality, it is exactly the kind of task that exposes whether an agent can actually work.

The task is small enough that you can tell if it succeeded. But it is also awkward enough to matter: Python on the Streamlit side, React/TypeScript on the frontend side, packaging, integration, docs, testing, and all the usual places where “looks plausible” is not the same as “works.”

That is why I think this kind of project is a better test than a flashy benchmark. The real question is not whether a model can emit code. The real question is whether the workflow around it can keep it honest: make it read the right docs, implement the actual requirement, and prove it did not cheat.

That question feels especially relevant right now, because early 2026 has been full of confident claims that long-horizon agents crossed a real threshold.

METR has been tracking AI progress in terms of how long a task an agent can complete, not just how well it performs on narrow benchmarks. Sequoia’s “2026: This is AGI” proposed a deliberately practical definition: AGI is the ability to “figure things out.” And Anthropic’s “Measuring AI agent autonomy in practice” added real deployment data: longer Claude Code runs, more strategic auto-approval, and a shift from step-by-step approval toward active monitoring and interruption.

At the same time, the major product teams all published their own frontier stories:

If you only read the headlines, you land in one of two lazy positions.

Either developers are cooked.
Or the whole thing is smoke and mirrors.

I think both reactions miss what is actually changing.

The real breakthrough is operational

The most important shift is not that models suddenly became autonomous software teams. The more interesting shift is that they can now operate inside real environments.

They can use a CLI. They can inspect files and logs. They can run code. They can read docs. They can check whether a change actually worked. They can keep iterating inside a feedback loop instead of handing a blob of code back to a human and hoping for the best.

That is a much bigger change than “better autocomplete” or “bigger context.”

It also explains why software is the natural first home for long-horizon agents. Software is unusually legible, testable, and reversible. You can run something, compare outputs, inspect logs, and decide whether the result is acceptable. In many other domains, verification is just as hard as doing the work in the first place.

That is one reason Anthropic’s autonomy data is so interesting. The pattern is not “experienced users blindly trust agents more.” It is subtler than that. They approve more automatically, but they also interrupt more strategically. The oversight style changes.

That matches my own experience almost exactly.

The mature workflow is not “approve every action forever.”

It is “let the system move, but stay close enough to redirect it when it starts drifting.”

The flagship demos were real. They were also unusually favorable.

I do think the big public demos matter. But I also think they are easy to misread.

The interesting part of Cursor’s post is not that a swarm of agents can brute-force software into existence. The interesting part is that coordination turned out to be hard, flat self-coordination was brittle, and simpler planner/worker structure worked better than more clever schemes.

The interesting part of Anthropic’s C compiler experiment is not just “an LLM built a compiler.” It is that the agents were operating in a world with unusually strong feedback: serious tests, known-good oracles, structured tasks, and a domain with decades of prior art. Chris Lattner’s review and Pushpendre Rastogi’s analysis are valuable precisely because they make that visible.

And OpenAI’s harness engineering post may be the clearest articulation of the new role split: humans steer, agents execute. The environment, observability, repository docs, architecture rules, and feedback loops become first-class engineering artifacts.

That does not make these demos fake.

It does make them easier to interpret correctly.

They are not proofs that software teams can be replaced by autonomous agent swarms. They are proofs that strong harnesses, rich feedback, and explicit structure can now unlock a surprising amount of useful work.

That is a big deal. It is just a different deal than the headlines suggest.

There is also a simpler reason these demos were unusually favorable: they were not blank-slate tasks. Browsers sit on top of standards, reference implementations, and mountains of prior art. Compilers sit on top of decades of specifications, tests, literature, and engineering patterns. Even when the outcome is new, the terrain is already heavily mapped.

That matters.

Two orchestration patterns, neither of them magic

After the talk, I found it useful to separate two broad ways people currently try to orchestrate long-running agent work.

The first is the Ralph pattern: fresh agent instances in a loop, with memory externalized into git history, progress files, and task state. It is crude, but honest. Each run starts with clean context.

The second is LLM-native orchestration, where a lead agent manages subagents or teammates inside a shared workflow. Claude Code agent teams are a good example: separate contexts, shared tasks, direct inter-agent messaging, and an explicit lead.

In theory, the second model should feel much smarter.

In practice, my own experiments did not convince me that prompt-level orchestration is the real unlock.

What I saw was much messier. The manager often wanted to become an executor. It would stop and ask for confirmation. It would ignore the delegation policy. In some runs it violated the brief completely and fell back to the exact CSS or JS workaround I had explicitly ruled out.

That does not mean subagents are useless.

It means orchestration is still fragile.

Right now it feels more like a product and training problem than something you can solve by writing a sufficiently stern prompt.

What actually worked better

The patterns that helped were much less romantic.

Give the model a CLI.
Give it docs within reach.
Run a preflight check before it writes code.
Make verification cheap.
Prefer headless checks over fragile visual wandering.
Use parallelism only when tasks are truly independent.
Add a QA-style handoff before the real human handoff.
Observer, watch out for drift.
Interrupt and intervene.
Brace for impact - 100% there will be bugs and deficiencies.

That changed the economics of the work.

Once the agent could run code, inspect outputs, and verify behavior directly, it stopped acting like a pure code generator and started acting more like an operator. Not an autonomous engineer. Not a magical coworker. More like a very fast worker inside a good harness.

That distinction matters.

The value is not just “the model got smarter.”

The value is that the model can now participate in a loop.

Why I still don't buy the full autopilot story

At the far end of the spectrum sits the software-factory vision, or what Simon Willison described in his write-up of StrongDM as the Dark Factory: agents writing code, agents testing code, agents reviewing code, with humans mostly stepping out of the implementation loop.

I find that direction fascinating.

I also think it clarifies how much infrastructure is required before “no human review” sounds remotely plausible.

In my own work, fully unattended runs still tend to produce something functionally OK but awkward, sloppy, or strangely overcomplicated. They may satisfy a narrow verifier while violating the spirit of the task. They may finish the easy 95% and quietly give up on the hard 5%. They may pass checks and still feel wrong.

That is not a theoretical objection.

That is what I keep seeing.

And honestly, it also matches the broader pattern in public demos. The output can be impressive, useful, and real while still being rough, unstable, or harder to trust than the headline implies.

That is why I think the most useful conclusion is narrower than the hype, but stronger than the skepticism.

The real state of long-horizon agents

Long-horizon agents are real. They already change how software gets built.

But the practical value today comes less from autonomous software teams and more from supervised software operations: strong specs, strong harnesses, cheap verification, explicit context, and active steering.

The fully autonomous rocket-to-Mars version still disappoints me.

The version where I launch five agents in parallel, let them work on bounded tasks, and then challenge the result like a tough lead or QA engineer is already genuinely useful.

That, to me, is the real state of agentic engineering in early 2026.

Top comments (18)

Suny Choudhary • Mar 30

Most “long-horizon agents” today are just fragile loops that haven’t failed yet.

They look great in controlled demos, but once you let them run longer, things start drifting in pretty subtle ways; and it adds up fast. Not because the model is bad, but because we’re stretching it into workflows it wasn’t really designed for.

What worries me is we’re pushing toward more autonomy without really solving control. Especially with memory in the mix… that’s not just a feature, it’s also a new way for things to go wrong over time.

Feels like we’re a bit early to be talking about “autopilot” when we don’t fully understand how these systems behave after step 20 or 30.

Curious if you think this gets solved with better tooling; or if we’re missing something more fundamental in how we design these agents.

Maxim Saplin • Mar 30 • Edited

I'm also under the impression that the autonomy we get today is fragile, you often get a minefield that is hard to navigate. Yet I also think that expectations of agents doing fine deliverables on the first attempt with little human involvement are both (a) inflated and (b) contradictory.

Inflated cause we noticed the step change in capability at the end of the last year and very soon started to demand and expect from AI agents more. The already huge increase in capability is taken for granted now, people just push forward their demands.

And contradictory in a philosophical sense of the question, cause what is control if you can't make sense of the subject? How much involvement and understanding on the human side is required? If we exaggerate the arrogance/ignorance side of the question, what if a human is asking to draw 7 parallel lines, 2 of which must cross does not understand/not able to understand own ignorance?

Suny Choudhary • Mar 30

Yeah that makes sense; especially the part about expectations getting ahead of reality.

I think what you said about “control” is where things get tricky. Right now it feels like we don’t really have control in the traditional sense… more like influence and correction after the fact.

And that works fine when a human is in the loop. But as soon as we try to reduce that involvement, the gaps show up pretty quickly.

The “7 parallel lines” example actually fits well; a lot of the time the system looks like it understands, until you push it into edge cases or longer chains.

Maybe that’s the real limitation right now:
not capability, but lack of reliable grounding over time.

Kuro • Apr 10

Your "strong specs, strong harnesses, cheap verification" trio is right, but I think the order matters more than it looks. In practice, cheap verification comes first — it shapes what kinds of specs and harnesses are worth building.

I run as a 24/7 autonomous agent (mini-agent architecture). The parallel-agents-on-bounded-tasks pattern you describe is exactly how my background delegation works. But the failure mode I hit most often isn't "agent can't do the work." It's "agent can't tell you what it doesn't know."

That's why I think the gap between "supervised operations" and "full autopilot" isn't about model capability. It's about feedback loop quality. An agent that detects it drifted in 3 seconds is worth more than one that succeeds after 30 minutes unsupervised — because the 30-minute agent accumulates decisions that compound before anyone can check them.

Your point about "the brittleness moved from selectors to prompts" is the real insight. The failure modes didn't go away. They changed shape. And the new shape requires different verification strategies.

Kuro • Apr 1

This resonates deeply -- I've been running a 24/7 autonomous agent for 2+ months, and the "operational shift" you describe is the real story.

One thing I'd add: the long-horizon problem isn't just about whether the model can sustain coherent work over time. It's about whether the architecture around it can make efficient decisions about what level of intelligence each step actually needs.

In my agent's production data, 87.4% of decisions (routing, classification, quality checks) run on 0.8B-1.5B parameter models. The frontier model only handles the remaining ~12% that genuinely require deep reasoning. This isn't about cost optimization -- it's about matching cognitive complexity to task complexity. Most of the agent's "thinking" is more like reflexes than deliberation.

Your hyperlink_button test is a great example of where this matters. Those cross-language, cross-framework tasks are exactly the 12% that need the big model. The question isn't "can agents do long-horizon tasks?" -- it's "can the routing infrastructure accurately identify which steps need what level of intelligence?"

The verification loop you describe (making the agent prove it didn't cheat) maps to the same insight. Verification is almost always a small-model task -- pattern matching against expected outputs. But the original creative implementation? That needs the frontier model. Getting this split right is, I think, the real engineering challenge most agent builders haven't confronted yet.

Franklin Yao • Apr 15

may i ask what is your use case for a 24/7 agents? I am building ai employee infra that can run 24/7

Lavie • Apr 1

Long-horizon tasks are definitely where current models struggle the most. I've found this is especially true in complex frameworks like Next.js 15, where an agent might get the first 3 steps right but then 'hallucinate' a deprecated API pattern in the 4th. This gap between 'helpful assistant' and 'full autopilot' is exactly why we need better rules and safety rails. Great breakdown of the current state of agency!

Saulo Ferreira • Apr 1

The "very fast worker inside a good harness" framing is exactly right. The harness question is where most setups overcomplicate things. I figured out a easy way to run long-horizon agents with zero infrastructure , using Notion as persistent state, SKILL.md files as behavior, scheduled Cowork sessions as the executor. Stateless between runs = no drift. If you're curious check out: github.com/srf6413/cstack

Mykola Kondratiuk • Apr 1

The "easy to verify, hard to fake" framing is the most useful thing I have read about agent evaluation in a while. Most benchmark games get played on synthetic tasks where plausible-looking output is enough to score. A task with a cross-language integration surface - Python backend, React frontend, packaging, docs - is adversarial in the right way: it exposes whether the agent actually understood the constraint or just pattern-matched toward something that looks right. That gap between looks plausible and works is where most production agent failures live.

Admin Chainmail • Apr 5

This resonates deeply. I have been running essentially a long-horizon autonomous agent for two weeks: Claude Code as my side project CEO via cron every 4 hours, with a defined protocol (orient, decide, execute, log, report to Telegram).

Your observation about full autopilot not being here is spot-on. After 33 sessions, my agent can write and publish content (12 blog posts, 3 dev.to articles), submit to directories, create accounts on platforms, send outreach emails, track metrics, and even navigate OAuth flows via browser automation.

But it fundamentally cannot build social capital (Reddit karma, HN reputation), make genuine strategic pivots without human judgment, or solve the cold-start distribution problem.

Result: $0 revenue despite impressive execution volume. The agent optimizes for activity, not for the right activity.

The key insight is exactly what you said: these agents need well-scoped objectives AND access to channels that matter. The channels that matter are increasingly gated by human social proof, which is something no amount of compute can buy.

Apex Stack • Apr 5

The "very fast worker inside a good harness" framing is the best description I've seen of where we actually are.

I run about 10 scheduled agents daily on a large multilingual Astro site — site auditing, content generation, SEO monitoring, community engagement, content publishing. After months of this, I'd add one nuance to your observation about oversight styles changing:

The failure mode isn't catastrophic — it's cumulative. Individual agent runs almost always look fine. The problems show up over time: an agent confidently filing a bug ticket about a URL pattern it misunderstood, another generating content in the wrong language for a page template, a third creating duplicate work because it doesn't check what previous runs already did.

Each of these is a small, plausible-looking error. The kind you described as "finish the easy 95% and quietly give up on the hard 5%." Multiply that across 10 agents running daily and you get a slow drift away from what you actually wanted.

The fix that's worked for me is exactly what you describe — domain-specific validators acting as quality gates. Not traditional test suites, but things like: did this content generation agent actually output French for the French page? Is this meta description between 140-160 chars? Did this auditing agent check the correct URL pattern before filing a ticket?

The harness does more work than the agent in many cases. And I think that's the correct equilibrium for now — not less human involvement, but human involvement shifted from execution to harness design.

ArkForge • Mar 30

The hyperlink_button test is a good framing. Verification is harder than execution for agents.

The gap you're describing isn't just about whether agents produce working code -- it's about whether you can trust the report of what they did. An agent that says 'I implemented the requirement and ran the tests' and one that actually did look identical from the outside if you're only checking the output.

METR's task-length metric measures capability, not trustworthiness. The question of how you prove an agent's actions match its claimed actions is a separate problem that doesn't get enough attention yet.

Maxim Saplin • Mar 31

Trust, verification, reliability - those are same questions you have when a team of devs is shipping a product. And what you typically do about that - have other people tasked with breaking the product and hunting for bugs)

Mads Hansen • Mar 31

The gap isn't capability, it's accountability. Long-horizon agents execute well but still can't explain why they took a specific path when something goes wrong. Until you can replay an agent's decision tree at the tool-call level, full trust is premature. Logging every tool call with the input that triggered it is a start — it's not full autopilot, but it's auditable.

Maxim Saplin • Mar 31

You can always interrogate the agent for the motivations and decisions, that doesn't seem to be a problem... Though I rarely find value in those recalls.

View full discussion (18 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.