Jay

Posted on Mar 25

What Is Tool Chaining in LLMs? Why It Breaks and How to Think About Orchestration

#ai #llm #python #architecture

Your agent chains three tool calls together. The first returns slightly malformed output. The second accepts it but misinterprets a field. By the third call, the entire chain has gone off the rails. No error was thrown. Your logs look clean. The user got confidently wrong answers.
If you've built anything with LLM agents beyond a demo, you've hit this. It's called the cascading failure problem, and research from Zhu et al. (2025) confirms it: error propagation from early mistakes cascading into later failures is the single biggest barrier to building dependable LLM agents.
I've spent a lot of time debugging these kinds of failures, and I want to break down why tool chaining is so fragile, what the actual failure modes look like, and what patterns hold up in production.

Tool Chaining, Quickly Defined

Tool chaining is when an LLM agent executes multiple tool calls in sequence, where each tool's output becomes input for the next. The agent gets a user query, calls an API, processes the result with a second tool, and builds a final response from the combined output.
A single tool call is simple. Chaining is where dependencies show up. The agent has to figure out execution order, track intermediate state, and handle partial failures while staying on task.
In multi-agent systems, this gets worse. One agent calls a tool, hands the result to a second agent, which runs its own tool chain before returning. The orchestration overhead stacks fast, and so do the failure points.
Here's a concrete example: a user asks an agent to pull earnings data, compare it against competitors, and generate a summary. The first call returns revenue in the wrong currency. The comparison runs fine but produces misleading figures. The summary confidently presents wrong data. Nothing errored out. That's the core danger when you chain tools without validation and observability.

Why Tool Chains Break in Production

Context Gets Lost Across Calls

LLMs work within a finite context window. Every tool call adds tokens: function parameters, response payloads, reasoning traces. In long chains, critical context from early steps gets pushed out of the window or buried under intermediate results.
This isn't theoretical. Research shows LLMs lose performance on information buried in the middle of long contexts, even with large windows. When your agent forgets a user constraint from step 1 by the time it hits step 5, the output might be structurally valid but factually wrong. The user asked for revenue in USD, but that detail got lost three calls ago.

What actually helps:

Pass structured state objects between calls, not raw text. Keeps payloads compact and parseable.
Summarize intermediate results before forwarding. Strip metadata the next tool doesn't need.
Use frameworks with explicit state management. LangGraph, for example, provides durable state across graph nodes so context stays inspectable and doesn't just float in the prompt.

Cascading Failures Compound Silently

This is the biggest production risk. When one tool returns bad or partial data, the error flows downstream and compounds at every step. Unlike traditional software where bad data throws exceptions, LLM tool chains fail silently because the agent treats garbage output as valid input and keeps going.
A 2025 study on OpenReview that analyzed failed agent trajectories found error propagation was the most common failure pattern. Memory and reflection errors were the most frequent sources of cascades. Once they start, they're extremely hard to reverse mid-chain.
In multi-agent setups, it's amplified further. The Gradient Institute found that transitive trust chains between agents mean a single wrong output propagates through the entire system without verification. OWASP ASI08 specifically flags cascading failures as a top security risk in agentic AI.

Context Window Saturation

Every tool call eats tokens. A chain of five calls can burn through 40-60% of your available context before the agent even starts generating its final response. Even with models offering massive token limits, the "lost in the middle" problem means the agent's attention degrades on information that isn't near the beginning or end of the context.

Picking a Framework for Multi-Tool Orchestration

The framework you choose shapes how much of this you have to handle yourself. Here's how the main options compare for production use in 2026:
LangGraph is my go-to for anything stateful or branching. It models tool chains as explicit state machines: every node is a tool call or decision point, edges define transitions. You can plug in retry logic, fallback paths, and human-in-the-loop checkpoints at specific stages. Its durable execution feature means if a chain breaks at step 4 of 7, you resume from step 4 instead of restarting. Deep tracing through LangSmith with state transition capture.
LangChain is still the fastest way to get started. Its LCEL pipe syntax makes linear tool chains quick to compose. But for production workloads with branching or parallel calls, most teams I've seen migrate to LangGraph for finer control.
AutoGen works well for multi-agent conversation patterns. It uses message-passing with built-in function call semantics. Observability is moderate and usually needs external tooling for production-grade traces.
CrewAI takes a role-based approach to multi-agent task execution. Tool assignment happens per role, which is intuitive but can mean longer deliberation before tool calls. Basic logging out of the box.

Tracing and Observability Are Not Optional

You can't fix what you can't see. Tool chain failures are often silent, so a chain that returns wrong answers without errors looks perfectly healthy in your logs unless you have distributed tracing on every step.
What to capture in every tool chain execution:

Input and output of each tool call. Exact parameters and full responses so you can replay failures.
Latency per step. A slow tool can cascade into downstream timeouts.
Token consumption. Track context window usage to spot saturation before it degrades output quality.
Agent reasoning between calls. Chain-of-thought capture helps you find logic errors that data alone won't reveal.

Tools like LangSmith, Langfuse, and Future AGI provide native tracing for LangGraph and LangChain workflows. Future AGI's traceAI SDK integrates with OpenTelemetry and includes built-in evaluation metrics for completeness, groundedness, and function calling accuracy.

Evaluating Tool Chains Beyond "Did It Work?"

Tracing tells you what happened. Evaluation tells you whether it was correct. For tool chains, you need to cover multiple dimensions:

Tool selection accuracy: Did the agent pick the right tool at each step?
Parameter correctness: Were the arguments valid and complete?
Chain completion rate: What percentage of multi-step chains finish without errors, fallbacks, or manual correction?
Output faithfulness: Does the final response reflect the tool data accurately without hallucinations?
Error recovery rate: When a tool returns an error, how often does the agent actually recover?

Running these at scale means automation. Platforms like Future AGI attach evaluation metrics directly to traces, scoring every execution and creating a continuous feedback loop. The point is to make evaluation a part of the pipeline, not something you run manually after incidents.

Patterns That Hold Up in Production

These are the patterns I've seen consistently improve reliability across real deployments:

Validate at every boundary. Put input and output validation between every tool call using Pydantic or JSON Schema. Don't rely on the LLM to notice malformed data. Explicit validation catches errors at the source before they propagate.
Plan first, execute second. Research from Scale AI shows that having the LLM formulate a structured plan (as JSON or code) before executing it through a deterministic executor reduces tool chaining errors significantly. Separating reasoning from execution is a big win.
Implement circuit breakers. If a tool fails or returns unexpected results more than N times, break the circuit and return a graceful failure. Don't let one broken tool take down the entire workflow.
Keep chains short. Longer chains mean more failure surface and more context consumption. If you need more than 5-6 sequential calls, restructure into sub-chains or parallel branches.
Test with adversarial inputs. Your happy-path tests will pass. Production traffic won't be happy-path. Test with empty tool responses, oversized payloads, unexpected types, and ambiguous queries.
Trace everything from day one. Instrument your tool chains with distributed tracing on the first deployment. When something breaks in production, traces are the difference between hours of debugging and a 10-minute fix.

FAQ

Why don't LLM tool chain errors throw exceptions like normal software?

Because the LLM treats tool outputs as text, not typed data. If a tool returns malformed JSON or wrong values, the model doesn't crash. It interprets whatever it got and keeps going. That's why schema validation between every step matters so much. The LLM won't catch bad data for you.

Is a longer context window the fix for context loss in tool chains?

Not really. Even with million-token windows, research shows LLMs lose attention on information in the middle of the context. A bigger window gives you more room, but it doesn't solve the core problem. Structured state management and summarization between steps are more reliable than just hoping the model remembers everything.

When should I use LangGraph over LangChain for tool chaining?

If your chain is linear and simple, LangChain's LCEL syntax is faster to set up. Once you need conditional branching, retries at specific steps, or durable execution (resume from failure point), LangGraph gives you that control. Most teams I've talked to start with LangChain and move to LangGraph when their chains get complex enough to need explicit state machines.

How do I know if my tool chain is consuming too much of the context window?

Trace your token usage per step. If your chain of tool calls is eating 40-60% of available tokens before the agent generates its final response, you're in the danger zone. Summarize intermediate outputs aggressively and strip metadata the downstream tools don't need.

What's the simplest thing I can do today to make my tool chains more reliable?

Add Pydantic or JSON Schema validation on the output of every single tool call. It takes maybe 30 minutes to set up and catches the majority of silent data corruption issues before they cascade. It's the highest-leverage change you can make.

DEV Community