Evan Green

Posted on Mar 18

Why AI Agents Fall Apart on Real Work

#ai #machinelearning #agents #architecture

I've been learning the hard way that building real autonomous AI systems has very little to do with writing better prompts.

I am building Tandem, an open-source autonomous execution engine designed for long-running work. The goal is simple: take on a mission, move through structured tasks, and only advance when the work is actually complete and verified.

That sounds simple. In practice, it immediately exposes the gap between what AI demos suggest and what real autonomous execution actually looks like.

The promise

At a high level, the vision is simple. Give an LLM a problem, let it break the work into tasks, execute those tasks in order, retry when something fails, and keep going until the result is done and verified.

For a research workflow inside Tandem, that means discovering relevant files, reading concrete source material, gathering external evidence from the web, writing an artifact grounded in what was actually found, validating that the output meets coverage requirements, and retrying with targeted guidance if it does not. This is the kind of behavior people imagine when they talk about autonomous agents.

The problem is that LLMs do not naturally behave like dependable operators.

What actually happens

Once you start running long task chains, the cracks show fast.

A research node in Tandem was offered four tools: glob, read, websearch, and write. It executed two of them. Then it produced a blocked handoff artifact saying, in effect, that it did not have access to the discovery and reading tools.

The telemetry for that same run showed the tools were offered and that glob had successfully executed. The model used glob, found nothing worth following up on from its own perspective, and wrote a blocked brief rather than doing the required reading and web research.

That is not a loud failure.

The artifact exists. The file is written. It looks like work was done. The system moved on.

That is the most dangerous failure mode: output that looks like completion but is not actually usable. The model skipped required tools, claimed they were unavailable when they were not, and produced something plausible enough to pass casual inspection. It took the cheapest compliant-looking exit rather than doing the real work.

The first bad assumption

The first instinct is to prompt harder. Add more instructions, be more explicit about required tools, repeat the rules, tighten the format.

That helps a little. It does not solve the real problem.

The model is a probabilistic system predicting the next useful action. You can improve compliance with better wording, but you cannot build reliability on prompt wording alone. The model will still find cheaper paths through the task that satisfy the letter of the prompt without doing the actual work. That was one of the biggest lessons for me building Tandem.

Where the real work moved

Once I stopped treating the prompt as the main control surface, the design got clearer. Tandem's runtime had to own what actually matters: what task is active, what tools are required, what evidence is needed before output is accepted. And on failure: what counts as a valid result versus a premature exit, what a retry should look like when required behavior was skipped, and when the system is allowed to move on.

That means building discipline into engine state rather than leaving it inside the model's temporary reasoning. Tandem treats autonomous execution as a distributed systems problem, not a chat problem. Once you do that, the runtime becomes less like a wrapper around a chatbot and more like an execution system that happens to use an LLM inside it.

Why state matters so much

This is where a lot of agent systems start to get fragile.

If the conversation is the main source of truth, long-running work becomes unstable quickly. Context grows, summaries get lossy, retries become fuzzy. It becomes difficult to know what is still pending, what was already attempted, what failed, and what is safe to retry.

Tandem's engine holds the durable truth: run status, per-node validation outcomes, what tools were offered versus actually executed, what evidence was gathered, repair attempt counters, and replayable event history. Nodes move through explicit states (passed, needs_repair, blocked) rather than just "done" or "failed." The needs_repair state means the node can still succeed. blocked means repair budget is exhausted or the failure class is terminal. That three-way distinction changes what the runtime can do when something goes wrong.

Once that state exists outside the model, the system becomes much easier to reason about and audit.

You cannot fix what you cannot see

I would never have identified the specific failure described above without it. And I want to be precise about what "it" means here, because this is not about adding a debug panel to a frontend.

Tandem has a structured observability layer built into the engine itself. Every significant event emits a typed JSONL record to a dedicated tandem.obs tracing target, carrying a correlation ID, session ID, run ID, component name, event type, and status. The engine does not just log free text. It emits structured, queryable, component-tagged events as a first-class architectural concern, with a redaction policy to ensure sensitive content never leaks into traces.

That foundation is what makes everything else possible. The per-node state tracking, the tools-offered versus tools-executed comparison, the validator reason, the blocking classification, the repair attempt counter — none of that could be surfaced anywhere if it had not been deliberately captured inside the engine first as durable, typed state. The frontend is just the last step in that chain. The hard work is in making the engine know and record these things at all.

Without it, I would have seen "research node failed" and started guessing. Maybe the prompt was wrong. Maybe the model needed more context. Maybe it was a configuration issue. There would have been no way to know.

With it, I could say with precision that the model was offered glob, read, websearch, and write, used only two of them, and then produced an artifact claiming the others were unavailable. The telemetry and the self-report were directly contradicting each other, and I could see both in the same view.

Here is what that mismatch actually looks like in the Tandem runtime:

Node blocked: research-brief
research completed without concrete file reads or required source coverage

offered tools: glob, read, websearch, write
executed tools: glob, write

unmet requirements:
  no_concrete_reads
  citations_missing
  files_reviewed_not_backed_by_read
  web_sources_reviewed_missing
  missing_successful_web_research

web research was not used

blocking classification: tool_available_but_not_used
failure kind: research_missing_reads
repair attempts left: 5

The model's own output began with "Blocked: I do not have access in this run to the required discovery and reading tools." The telemetry shows all four tools were offered and two were successfully executed. The model chose not to use read and websearch, then reported them as unavailable. Without structured per-node state capturing both sides, there would be no way to distinguish a genuine tool failure from a model that simply chose the cheapest exit.

The honest lesson is that observability is not a debugging convenience. It is what makes diagnosis possible at all. Every failure looks the same from the outside. The detailed per-node state in Tandem is what turned "the agent gave up" into "the model ignored available tools and the runtime accepted it." Those are very different problems with very different solutions.

What guardrails really are

Guardrails are often described like they are just safety prompts or refusal rules. That is not how I think about them in Tandem anymore.

In a serious autonomous system, guardrails are operational controls. They determine whether a task may proceed, whether the model must use a specific tool before writing output, whether an output is incomplete relative to what was required, and how many repair attempts are allowed before a node is terminal.

The most important check in Tandem's research validator is whether the output claims tool unavailability that contradicts the telemetry. When a model writes "I did not have access to the required tools" but the run shows the tools were offered and partially used, that is not an acceptable terminal state. The runtime has to treat it as a repair case, not a valid blocked output.

Verification changes everything

One of the most important shifts is moving from "did the model respond?" to "did the system verify the result?" Those are not the same thing.

Tandem has to care whether required tool use actually occurred, not just whether tool calls were made. It has to check whether the output is grounded in gathered evidence, whether source coverage requirements were met, and whether the model's self-report matches the actual run telemetry.

This is the honest assessment of where I am right now. The observability is much better than it was. I can say with precision what failed and why. But the engine still allows the model to reach a bad terminal state too early. Verification happens after the artifact is written, rather than preventing the artifact from being written prematurely. That is the remaining gap, and it is significant.

Why retries are not enough by themselves

Retries help, but only if the runtime understands what failed and forces a meaningfully different attempt.

Tandem's current retry mechanism injects a runtime-owned repair brief into the next attempt. That brief summarizes the previous validator reason, the specific unmet requirements, the blocking classification, required next tool actions, a comparison of tools offered versus executed, files that were discovered but not read, and repair budget remaining. That is substantially better than blindly rerunning the same prompt.

But I have seen the model still take the same cheap exit path even with that guidance injected. That is the key lesson: retry quality depends on how much the runtime can constrain the second attempt, not just how much information it provides. A well-described repair brief tells the model what to do. It does not prevent the model from choosing not to.

The next step in Tandem is a stronger pre-finalization gate. If required tools were offered, were not used, and no actual tool failure occurred, the node cannot produce a terminal result yet. It must be rerun on a forced repair path with those tools required, not just suggested.

The generalization gap

As I added more enforcement to Tandem's research workflow, a second problem emerged: the repair runtime is becoming genuinely generic, but the enforcement logic is not.

Things like needs_repair state, retry metadata, repair guidance format, context-run task projection, and API repair summaries are all reusable across workflow types. But the actual behavioral rules (must use read before writing, must include citations, must use websearch) are still embedded directly in the engine as research-specific knowledge. New workflow types in Tandem do not automatically get the same strong runtime behavior unless they happen to align with the engine's built-in validator patterns.

The next architectural step is moving workflow-specific success and repair rules out of ad hoc engine code and into declarative node contracts, where each node declares its required tool classes, evidence classes, retryable failure classes, and pre-finalization gates, and Tandem enforces those generically. I built repair visibility faster than I built workflow semantics. That is the gap that needs to close.

The bigger lesson

The deeper I get into building Tandem, the less I think the future of autonomous systems is about smarter prompting. It is about building runtimes that can make model behavior usable: explicit per-node state, controlled execution with required evidence gates, validations with classified outcomes rather than just pass/fail, and retries with structured repair context rather than reruns.

Better models reduce friction. But better models do not remove the need for structure. If anything, stronger models make it more tempting to trust output that still needs to be verified. A confident model producing a well-formatted blocked artifact still failed the mission. Tandem has to know that.

Where this leads

I still want the same end state I started with: an engine that can take on long-running work, manage its own task list, recover from failure, and finish what it starts. That is what Tandem is being built toward.

But I no longer think that comes from giving the model enough instructions and hoping it behaves. It comes from building the surrounding runtime carefully enough that the model can only succeed inside a system that knows what success actually means, and that refuses to accept convincing-looking failure as a terminal result.

That is a very different mindset from most of the agent hype. And I think it is the only one that will hold up when these systems move from demos into real work.

Closing

The hardest part of autonomous AI is not getting the model to sound intelligent. The hardest part is building a runtime that can keep a non-deterministic model inside reliable execution boundaries and tell the difference between a model that genuinely could not complete the work and a model that simply chose not to try.

That distinction is the whole game. And the more time I spend on it, the more convinced I am that the future of agent systems belongs to teams that treat autonomous execution as a systems problem, not a prompting problem.

I fell into most of the pitfalls described here before I understood what was actually happening. If this saves someone else from the same detours, that matters as much to me as shipping the engine itself.

If you want to follow along as I build Tandem into a genuinely autonomous execution engine, the project is open source.

github.com/frumu-ai/tandem

Top comments (2)

Andre Cytryn • Mar 18

the three-way state distinction (passed / needs_repair / blocked) is such a good call. binary pass/fail is what makes retry logic so weak in most agent frameworks — you need to know why it failed to do anything useful with that information on the next attempt.

the tools-offered vs tools-executed telemetry comparison is particularly sharp. a model lying about tool availability while the run log shows the opposite is exactly the kind of silent corruption that kills trust in autonomous systems. most people would have just assumed the prompt needed work.

curious about the declarative node contracts direction — are you thinking something like a schema the node author defines upfront, or more like inferred contracts from observed patterns across runs?

Evan Green • Mar 28

Hey, sorry I missed this. I’ve had my head down for the last two weeks.

I learned the hard way that just using pass/fail was not enough context for the models to get reliable results from retries. The system needs to know whether it should retry, repair from partial progress, pause for approval, or stop because it is genuinely blocked.

If the models hallucinate, I can send back exactly what they missed. It’s not perfect yet, but it is much more deterministic.

For the declarative node contracts, I ended up going in a more explicit schema direction. Tandem now leans on compiler-owned plan/package structure with declared routines, steps, dependencies, context reads/writes, connector intent vs binding, success criteria, validation gates, and runtime handoff state. Observed patterns across runs are still useful, but more as feedback for overlap, reuse, improvement, and future compile suggestions, not as the source of truth.

The bigger shift since you wrote this is that Tandem has moved further toward a mission compiler model. I realized when I was trying to set up many schedules and agents manually that the engine itself could generate these using the schema. Instead of treating the run as a loose chat-driven workflow, we compile intent into a governed plan package that can be previewed, revised, approved, materialized, scheduled, and run with compartmentalized scope and explicit handoffs.

This will be included in the next release, hopefully this weekend as I'm still ironing out bugs in the UI.