DEV Community

Cover image for Your First AI Agent Will Fail. Here's How to Debug It.
klement Gunndu
klement Gunndu

Posted on

Your First AI Agent Will Fail. Here's How to Debug It.

Your AI agent worked perfectly in testing. Then it hit production and called the wrong tool 14 times in a loop, burned $40 of API credits, and returned gibberish to your user.

This is not a rare scenario. It's the default scenario.

The reason most developers don't catch this early is simple: they have no visibility into what the agent is actually doing. LLM calls look like black boxes. Tool invocations are invisible. When something goes wrong, you're left reading the final output and guessing backward.

This guide gives you four concrete debugging patterns—from zero-setup verbose mode to production-grade tracing with LangSmith. Each one works. Start with the first. Graduate to the fourth when you need it.


Why AI Agents Fail Differently Than Regular Code

Before the debugging patterns, understand what makes agents hard to debug.

In regular code, failures are deterministic: the same input produces the same bug. In AI agents, failures are probabilistic: the same input might work 9 times and fail on the 10th, and the failure might look completely different each time.

The four failure modes that catch first-time agent builders:

1. Tool call loops — The agent decides it needs more information, calls a tool, doesn't trust the result, calls it again, and repeats until you hit a max_iterations limit (or don't, and burn unlimited credits).

2. Hallucinated tool arguments — The agent calls a real tool with fabricated parameters. Your search_database(user_id=99999) runs successfully but returns nothing, because user 99999 doesn't exist. The agent then reasons from an empty result.

3. Context window overflow — Your agent passes all prior conversation history on every step. After 10 tool calls, the prompt is 80K tokens. The LLM starts forgetting earlier instructions.

4. Silent reasoning failures — The agent's chain-of-thought goes wrong, but the final answer looks plausible. No error is raised. The user gets a confident, wrong answer.

All four are invisible without observability. Here's how to see them.


Pattern 1: Verbose Mode (30 Seconds to Set Up)

The fastest way to see inside your agent. Zero additional packages required.

from langchain.agents import AgentExecutor, create_react_agent
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.tools import tool

@tool
def search_docs(query: str) -> str:
    """Search the documentation for an answer."""
    return f"Result for: {query}"

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = PromptTemplate.from_template(
    "Answer the question using the tools available.\n"
    "Tools: {tools}\nTool names: {tool_names}\n"
    "Question: {input}\n{agent_scratchpad}"
)

agent = create_react_agent(llm=llm, tools=[search_docs], prompt=prompt)

# verbose=True prints every thought, tool call, and observation
agent_executor = AgentExecutor(
    agent=agent,
    tools=[search_docs],
    verbose=True,          # <-- this is all you need
    max_iterations=5,      # always bound your agents
    handle_parsing_errors=True
)

result = agent_executor.invoke({"input": "What is LangChain?"})
Enter fullscreen mode Exit fullscreen mode

With verbose=True, you'll see the agent's internal monologue in your terminal:

> Entering new AgentExecutor chain...
Thought: I need to search for information about LangChain.
Action: search_docs
Action Input: LangChain framework overview
Observation: Result for: LangChain framework overview
Thought: I now have enough information to answer.
Final Answer: LangChain is...
> Finished chain.
Enter fullscreen mode Exit fullscreen mode

This is enough to catch tool call loops (you'll see the same action repeating) and basic reasoning failures.

When verbose mode is not enough: When you have nested chains, multiple agents, or you need to persist debug history across runs.


Pattern 2: Global Debug Mode (More Detail, Still Free)

LangChain's debug mode prints structured JSON-like output for every step—prompts sent to the LLM, raw responses, tool inputs and outputs.

from langchain.globals import set_debug, set_verbose

# Debug mode: full raw I/O at every step
set_debug(True)

# Verbose mode: human-readable summary (less noise)
set_verbose(True)

# You can use both. Debug gives raw, verbose gives summary.
agent_executor = AgentExecutor(
    agent=agent,
    tools=[search_docs],
    max_iterations=5,
    handle_parsing_errors=True
)

result = agent_executor.invoke({"input": "What is LangChain?"})
Enter fullscreen mode Exit fullscreen mode

Debug mode output includes the full prompt string passed to the LLM. This is where you'll catch context window bloat—you can see exactly how many tokens you're sending and whether old tool outputs are being included unnecessarily.

When debug mode is not enough: When you need to log failures in production without printing to stdout, or when you need to track patterns across many runs.


Pattern 3: Custom Callback Handler (Log What Matters)

LangChain's callback system lets you hook into every event in the agent's lifecycle. You inherit from BaseCallbackHandler and override only the events you care about.

from langchain_core.callbacks import BaseCallbackHandler
from langchain_core.agents import AgentAction, AgentFinish
from langchain_core.outputs import LLMResult
from typing import Any, Union
import json
import logging

# Configure your logger
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("agent_debug")


class AgentDebugCallback(BaseCallbackHandler):
    """Logs tool calls, errors, and final answers. Stores them for analysis."""

    def __init__(self):
        self.tool_calls = []
        self.errors = []
        self.step_count = 0

    def on_agent_action(self, action: AgentAction, **kwargs: Any) -> Any:
        """Fires before each tool invocation."""
        self.step_count += 1
        entry = {
            "step": self.step_count,
            "tool": action.tool,
            "input": action.tool_input,
        }
        self.tool_calls.append(entry)
        logger.info(f"Step {self.step_count}: calling {action.tool}({action.tool_input})")

    def on_tool_end(self, output: str, **kwargs: Any) -> Any:
        """Fires after a tool returns."""
        logger.info(f"Tool returned: {output[:200]}")  # truncate long outputs

    def on_tool_error(
        self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any
    ) -> Any:
        """Fires when a tool raises an exception."""
        self.errors.append(str(error))
        logger.error(f"Tool error: {error}")

    def on_llm_error(
        self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any
    ) -> Any:
        """Fires when the LLM call fails."""
        self.errors.append(str(error))
        logger.error(f"LLM error: {error}")

    def on_agent_finish(self, finish: AgentFinish, **kwargs: Any) -> Any:
        """Fires when the agent reaches a final answer."""
        logger.info(f"Agent finished in {self.step_count} steps")
        if self.errors:
            logger.warning(f"Errors encountered: {self.errors}")


# Usage
debug_callback = AgentDebugCallback()

agent_executor = AgentExecutor(
    agent=agent,
    tools=[search_docs],
    callbacks=[debug_callback],   # attach the callback
    max_iterations=5,
    handle_parsing_errors=True
)

result = agent_executor.invoke({"input": "What is LangChain?"})

# Inspect what happened
print(f"Total tool calls: {len(debug_callback.tool_calls)}")
print(f"Errors: {debug_callback.errors}")
print(f"Steps: {debug_callback.step_count}")
Enter fullscreen mode Exit fullscreen mode

The callback gives you structured data you can write to a database, send to a monitoring system, or aggregate across sessions. This is how you detect "this agent makes 3x more tool calls on questions about X than on questions about Y."

Key events to hook into:

  • on_agent_action — before tool call (capture what the agent decided)
  • on_tool_end — after tool returns (capture what it got back)
  • on_tool_error — tool exception (capture what failed)
  • on_llm_error — LLM exception (rate limits, context overflow)
  • on_agent_finish — final answer (capture completion, step count)

When callbacks are not enough: When you need to share traces with your team, replay failures, or run evals over production traces.


Pattern 4: LangSmith Tracing (Production-Grade)

LangSmith is LangChain's observability platform. It captures a complete trace of every run—every LLM call, every tool invocation, every token—and stores them for inspection and analysis.

Setup (as of February 2026):

pip install langchain_openai langchain_core langsmith
Enter fullscreen mode Exit fullscreen mode

Set environment variables:

export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY=<your-key>   # get from smith.langchain.com
export OPENAI_API_KEY=<your-key>
Enter fullscreen mode Exit fullscreen mode

No code changes required. Once the environment variables are set, every LangChain run is automatically traced.

from langchain.agents import AgentExecutor, create_react_agent
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.tools import tool

@tool
def search_docs(query: str) -> str:
    """Search the documentation for an answer."""
    return f"Result for: {query}"

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = PromptTemplate.from_template(
    "Answer the question using the tools available.\n"
    "Tools: {tools}\nTool names: {tool_names}\n"
    "Question: {input}\n{agent_scratchpad}"
)

agent = create_react_agent(llm=llm, tools=[search_docs], prompt=prompt)

agent_executor = AgentExecutor(
    agent=agent,
    tools=[search_docs],
    max_iterations=5,
    handle_parsing_errors=True
)

# This run is automatically traced to LangSmith
result = agent_executor.invoke(
    {"input": "What is LangChain?"},
    config={"run_name": "debug-session-001"}  # optional: name your trace
)
Enter fullscreen mode Exit fullscreen mode

In the LangSmith UI at smith.langchain.com, you'll see a timeline of every step. You can click into any LLM call to see the exact prompt that was sent and the exact completion that came back. You can filter traces by latency, error rate, and token count.

For selective tracing (useful when you only want to trace specific runs, not everything):

import langsmith as ls

with ls.tracing_context(enabled=True):
    result = agent_executor.invoke({"input": "What is LangChain?"})

# Runs outside the context manager are NOT traced
result = agent_executor.invoke({"input": "This run is not traced"})
Enter fullscreen mode Exit fullscreen mode

LangSmith has a free tier. Paid plans add team sharing, evals, and automated alerts.


Which Pattern to Start With

Your situation Use this
First time building an agent verbose=True
Debugging specific input failure set_debug(True)
Need persistent logs or metrics Custom BaseCallbackHandler
Team collaboration, production LangSmith

The order matters. Start with verbose mode—it costs zero and reveals most issues. Graduate to custom callbacks when you need structured data. Add LangSmith when you're running in production and need trace history.


Three Bugs Verbose Mode Will Catch Immediately

Once you turn on verbose, look for these patterns:

Loop detection: The same tool being called more than twice in a row with nearly identical inputs. This is the hallmark of a confused agent. Fix: add explicit instructions in your system prompt about what to do when a tool returns no results.

Context bleed: You ask "What is X?" but the agent's first thought references information from a previous, unrelated question. This means your agent isn't isolating context between runs. Fix: clear the agent's memory between sessions.

Phantom reasoning: The agent says "Based on my search, X is true" but you can see in the verbose output that the search tool returned an empty result. The agent is filling in the gap from LLM training data. Fix: add an explicit check for empty tool results in your prompt ("If the tool returns nothing, say you don't know").


One Debugging Setup to Ship with Every Agent

Don't wait until production breaks. Add this to every agent you build:

from langchain.agents import AgentExecutor
from langchain.globals import set_verbose
import os

def build_agent_executor(agent, tools, debug: bool = False):
    """
    Wrap any agent with debug tooling.

    Args:
        debug: If True, enables verbose mode and LangSmith tracing
               (requires LANGSMITH_API_KEY env var).
    """
    if debug:
        set_verbose(True)
        # LangSmith traces automatically if env vars are set
        os.environ.setdefault("LANGSMITH_TRACING", "true")

    return AgentExecutor(
        agent=agent,
        tools=tools,
        verbose=debug,
        max_iterations=10,        # always set this
        max_execution_time=60,    # seconds — prevent runaway agents
        handle_parsing_errors=True,
        return_intermediate_steps=debug  # capture steps when debugging
    )


# Development
executor = build_agent_executor(agent, tools, debug=True)

# Production
executor = build_agent_executor(agent, tools, debug=False)
Enter fullscreen mode Exit fullscreen mode

The return_intermediate_steps=True flag returns the full list of (AgentAction, observation) tuples alongside the final answer. Useful when you want to inspect what the agent did without enabling full verbose output.


What Comes Next

Debugging tells you an agent is broken. Evaluation tells you it's consistently right.

Once your agent is working under these four patterns, the next step is setting up an eval loop: define what "correct" looks like for your task, score every run against that definition, and track the score over time as you change prompts and models.

That's a different article. But start with visibility first. You can't improve what you can't see.


Follow @klement_gunndu for more AI engineering content. We're building in public.

Top comments (0)