Leena Malhotra

Posted on Dec 18, 2025

I Let an AI Agent Handle a Multi-Step Task. Here's Where It Broke

#agents #ai #webdev #programming

I gave an AI agent what seemed like a straightforward task: analyze our API usage logs, identify rate limit violations, generate a report summarizing the patterns, and draft an email to affected users explaining the changes we'd need to make.

Four steps. Clear inputs. Well-defined outputs. The kind of task that should be perfect for AI automation.

Three hours later, I had a partially complete report, two abandoned attempts, and a much clearer understanding of where current AI agents actually break down on complex workflows.

The problem wasn't the AI's intelligence. The problem was that multi-step tasks aren't actually linear—and AI agents don't handle branching, backtracking, and context preservation the way humans naturally do.

What I Expected vs. What Happened

My mental model of the task was simple:

Parse logs →
Analyze patterns →
Generate report →
Draft email

Linear. Sequential. Each step feeds the next. This is how we decompose complex tasks in our heads, and it's how most AI agent frameworks are designed.

But here's what actually happened:

Step 1 started fine. The agent parsed the logs correctly, identified rate limit events, and extracted relevant fields. No issues.

Step 2 revealed ambiguity. When analyzing patterns, the agent found both IP-based violations and user-based violations. It needed to decide: should the report focus on users or IPs? I never specified. A human would ask. The agent guessed—and guessed wrong. It optimized for IP analysis when I needed user-level insights.

Step 3 hit a dead end. Because Step 2 went the wrong direction, the report it generated was technically correct but strategically useless. I needed to backtrack, reframe the analysis around users, and regenerate the report. But the agent had already moved forward. Its context was locked into the IP-based analysis path.

Step 4 never happened. By the time I manually corrected Step 2 and regenerated Step 3, the agent had lost enough context that the email draft was generic and disconnected from the specific patterns we'd found.

The task wasn't impossible. The AI had all the capabilities needed. But the workflow broke because multi-step tasks require human-like judgment about when to branch, when to backtrack, and when to ask for clarification—and current AI agents don't handle these meta-decisions well.

Where AI Agents Actually Break

After running dozens of similar experiments, I've identified the specific failure modes that plague multi-step AI workflows:

Ambiguity resolution fails silently. When an AI agent encounters ambiguity—multiple valid interpretations of a step—it doesn't stop to ask. It picks one interpretation (often the first or most common) and continues. By the time you realize it went the wrong direction, it's three steps downstream and the context is contaminated.

Context windows create artificial boundaries. Long multi-step workflows exceed token limits. The agent starts forgetting early decisions. Step 7 might contradict Step 2 because the original context has been truncated. Humans maintain a mental model of the entire task; agents lose coherence as context windows fill.

Error recovery is non-existent. When a step fails or produces unexpected output, agents don't backtrack gracefully. They either halt completely or continue with corrupted state. Humans naturally say "that didn't work, let me try a different approach." Agents lack this adaptive error handling.

Branching logic is implicit, not explicit. Multi-step tasks rarely follow a single path. Step 3 might need to be different based on what Step 2 discovered. Humans handle this branching naturally. Agents need it explicitly programmed—and if you could program it explicitly, you wouldn't need an AI agent.

Progress isn't resumable. If a multi-step workflow breaks at Step 5 of 8, you can't just fix Step 5 and continue. The entire context is lost. You're starting from scratch. There's no equivalent of "pick up where we left off" in most agent frameworks.

The Task Types That Work vs. Those That Don't

Through experimentation, I've developed a rough taxonomy of what current AI agents handle well versus where they consistently break:

Tasks that work:

Linear workflows with no branching ("parse this, transform that, output result")
Single-domain operations where context is narrow and well-defined
Tasks where every step has exactly one correct next action
Workflows where intermediate failures are obvious and early
Short sequences (3-4 steps max) that fit comfortably in context windows

Tasks that break:

Workflows requiring judgment calls about which path to take
Multi-domain tasks where context shifts between steps
Long sequences (7+ steps) that stress context window limits
Tasks where early errors don't become obvious until later steps
Workflows requiring backtracking or iterative refinement

The pattern is clear: AI agents excel at execution within constraints but fail at the meta-level decisions about how to adapt the workflow as information emerges.

What Actually Makes Multi-Step Workflows Complex

The real complexity in multi-step tasks isn't the individual steps—it's the coordination layer between them.

State management across steps. Each step produces outputs that become inputs for later steps. But those outputs might be partial, ambiguous, or reveal information that changes how later steps should execute. Managing this evolving state is trivial for humans, difficult for agents.

Adaptive planning. The plan you make at Step 1 often needs revision by Step 4 based on what you've learned. "I thought we'd need a detailed analysis, but the pattern is simpler than expected—skip the detailed analysis and jump to recommendations." Agents don't replan well.

Error detection at a distance. Sometimes you don't realize a step failed until three steps later when the output doesn't make sense. Humans backtrack naturally. Agents need explicit error handling at every step—and we're not good at predicting every way a step might fail.

Context compression. Humans naturally compress context as tasks progress, retaining what's relevant and discarding what's not. "We found 12 user segments but only 3 matter for this analysis—focus on those." Agents keep everything until context windows overflow.

The Workflow Patterns That Help

After many broken attempts, I've found patterns that make multi-step AI tasks more reliable:

Explicit checkpoints with human validation. Break workflows into phases with mandatory human review between them. Don't let the agent continue to Step 4 until you've verified Steps 1-3 produced what you needed. This catches wrong paths early.

Stateless steps where possible. Design each step to be independently executable with explicit inputs. Instead of "analyze the data from the previous step," provide explicit data references: "analyze data at users.csv, focusing on columns X, Y, Z." This makes steps resumable and debuggable.

Narrow task scope ruthlessly. Instead of "analyze logs and generate report," break it into "analyze logs for pattern A" and separately "generate report from analyzed_patterns.json." Smaller, focused tasks reduce branching possibilities and context requirements.

Use tools for state persistence. Don't rely on the agent's context window to carry state. Use explicit artifacts—files, databases, structured outputs—that persist between steps and can be inspected/modified independently.

Build error detection into workflow design. Include validation steps: "verify that the analysis contains user-level data, not IP-level data" catches wrong paths before they propagate. Make validation explicit, not implicit.

The Tool Limitations That Matter

The breakdown points aren't just about AI capabilities—they're about tooling gaps:

No standard for workflow state. There's no equivalent of "save game" for AI agent workflows. If something breaks, you're restarting from scratch. We need standardized ways to snapshot workflow state and resume from checkpoints.

Context window constraints are hard limits. Token limits create artificial boundaries in multi-step workflows. When context exceeds limits, coherence collapses. Until context windows are effectively unlimited or tools handle context compression intelligently, long workflows will break.

Error modes are opaque. When an agent makes a wrong decision at Step 3, it's not obvious what went wrong until Step 6 fails. Better observability into agent decision-making would help catch errors earlier.

Branching isn't first-class. Most agent frameworks assume linear workflows. When tasks require conditional branching ("if pattern A, do X; if pattern B, do Y"), you're fighting the framework. Branching should be a native concept, not a hack.

Tools like Crompt AI help by providing access to multiple models that can cross-check each other's work—using Claude Sonnet 4.5 for analysis and GPT-5 for validation catches errors that single-model workflows miss. The Content Writer and Email Assistant can be used in sequence for multi-step content creation with explicit handoffs between stages.

But fundamentally, we're still missing workflow orchestration tools that handle the messy reality of multi-step tasks: branching, backtracking, context management, and error recovery.

What Works Today: The Pragmatic Approach

Given current limitations, here's what actually works for multi-step AI workflows:

Use AI agents for sub-tasks, not full workflows. Let AI handle "analyze this dataset" or "generate this report" as isolated tasks. You orchestrate the workflow, handling branching and error recovery yourself. Think of AI agents as tools you compose, not autonomous workers you delegate to.

Build in manual checkpoints. After each significant step, review outputs before proceeding. This catches wrong paths early and keeps context manageable. The overhead is worth it to avoid three-hour dead ends.

Make intermediate artifacts explicit. Force each step to produce a file or structured output. This makes workflow state inspectable and allows resuming from any point. It also helps AI agents by making inputs/outputs explicit rather than context-dependent.

Use multiple models strategically. Different models have different strengths. Claude Opus 4.1 might be better for complex analysis steps while GPT-5 mini is faster for simple transformations. Matching model to task type reduces failure rates.

Keep tasks scoped to 3-5 steps max. Longer workflows have exponentially higher failure rates. If you need more steps, break the workflow into multiple sessions with explicit handoffs.

The Future I'm Waiting For

The breakthroughs we need aren't about smarter models—they're about better workflow tooling:

Workflow state that persists and resumes. Like version control for workflows. Every step creates a checkpoint. If something breaks, resume from the last good state.

Native branching and backtracking. Tools that treat "if-then" logic and "that didn't work, try this instead" as first-class workflow primitives.

Context compression that preserves semantic meaning. As workflows get long, automatically compress earlier context while retaining information that matters for later steps.

Explicit uncertainty handling. When an agent is uncertain ("should I analyze by user or by IP?"), it should surface that uncertainty and ask, not guess.

Observable decision-making. I want to see why an agent chose path A over path B at each decision point. Transparency in agent reasoning makes debugging possible.

Some of these capabilities are emerging. Multi-model platforms like Crompt enable comparing different models' approaches to the same task, which catches errors through cross-validation.

We're still years away from AI agents that handle complex, branching, multi-step workflows with the adaptability humans bring naturally.

The Honest Assessment

Current AI agents are incredibly useful for well-defined, narrow tasks. They're productivity multipliers when you know exactly what you want and can specify it precisely.

But for complex workflows that involve judgment, branching, error recovery, and context management across many steps? We're not there yet. The intelligence is sufficient. The workflow tooling isn't.

The failure mode isn't that AI is too dumb. It's that multi-step tasks are harder than they appear, and our tools for orchestrating AI work don't match the complexity of real-world workflows.

Understanding where agents break isn't pessimism—it's pragmatism. It helps you deploy AI effectively for what works today while avoiding the three-hour dead ends that come from expecting more than current tools can deliver.

The future is agents that handle complex workflows autonomously. The present is agents that excel at well-defined sub-tasks within human-orchestrated workflows.

Know the difference, and you'll get far more value from AI than those who expect magic and get frustrated when reality doesn't cooperate.

-Leena:)

DEV Community