klement Gunndu

Posted on Feb 27

Your First AI Agent Will Fail. Here's How to Debug It.

#python #tutorial #ai #devops

Your AI agent worked perfectly in testing. Then it hit production and called the wrong tool 14 times in a loop, burned $40 of API credits, and returned gibberish to your user.

This is not a rare scenario. It's the default scenario.

The reason most developers don't catch this early is simple: they have no visibility into what the agent is actually doing. LLM calls look like black boxes. Tool invocations are invisible. When something goes wrong, you're left reading the final output and guessing backward.

This guide gives you four concrete debugging patterns—from zero-setup verbose mode to production-grade tracing with LangSmith. Each one works. Start with the first. Graduate to the fourth when you need it.

Why AI Agents Fail Differently Than Regular Code

Before the debugging patterns, understand what makes agents hard to debug.

In regular code, failures are deterministic: the same input produces the same bug. In AI agents, failures are probabilistic: the same input might work 9 times and fail on the 10th, and the failure might look completely different each time.

The four failure modes that catch first-time agent builders:

1. Tool call loops — The agent decides it needs more information, calls a tool, doesn't trust the result, calls it again, and repeats until you hit a max_iterations limit (or don't, and burn unlimited credits).

2. Hallucinated tool arguments — The agent calls a real tool with fabricated parameters. Your search_database(user_id=99999) runs successfully but returns nothing, because user 99999 doesn't exist. The agent then reasons from an empty result.

3. Context window overflow — Your agent passes all prior conversation history on every step. After 10 tool calls, the prompt is 80K tokens. The LLM starts forgetting earlier instructions.

4. Silent reasoning failures — The agent's chain-of-thought goes wrong, but the final answer looks plausible. No error is raised. The user gets a confident, wrong answer.

All four are invisible without observability. Here's how to see them.

Pattern 1: Verbose Mode (30 Seconds to Set Up)

The fastest way to see inside your agent. Zero additional packages required.

from langchain.agents import AgentExecutor, create_react_agent
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.tools import tool

@tool
def search_docs(query: str) -> str:
    """Search the documentation for an answer."""
    return f"Result for: {query}"

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = PromptTemplate.from_template(
    "Answer the question using the tools available.\n"
    "Tools: {tools}\nTool names: {tool_names}\n"
    "Question: {input}\n{agent_scratchpad}"
)

agent = create_react_agent(llm=llm, tools=[search_docs], prompt=prompt)

# verbose=True prints every thought, tool call, and observation
agent_executor = AgentExecutor(
    agent=agent,
    tools=[search_docs],
    verbose=True,          # <-- this is all you need
    max_iterations=5,      # always bound your agents
    handle_parsing_errors=True
)

result = agent_executor.invoke({"input": "What is LangChain?"})

With verbose=True, you'll see the agent's internal monologue in your terminal:

> Entering new AgentExecutor chain...
Thought: I need to search for information about LangChain.
Action: search_docs
Action Input: LangChain framework overview
Observation: Result for: LangChain framework overview
Thought: I now have enough information to answer.
Final Answer: LangChain is...
> Finished chain.

This is enough to catch tool call loops (you'll see the same action repeating) and basic reasoning failures.

When verbose mode is not enough: When you have nested chains, multiple agents, or you need to persist debug history across runs.

Pattern 2: Global Debug Mode (More Detail, Still Free)

LangChain's debug mode prints structured JSON-like output for every step—prompts sent to the LLM, raw responses, tool inputs and outputs.

from langchain.globals import set_debug, set_verbose

# Debug mode: full raw I/O at every step
set_debug(True)

# Verbose mode: human-readable summary (less noise)
set_verbose(True)

# You can use both. Debug gives raw, verbose gives summary.
agent_executor = AgentExecutor(
    agent=agent,
    tools=[search_docs],
    max_iterations=5,
    handle_parsing_errors=True
)

result = agent_executor.invoke({"input": "What is LangChain?"})

Debug mode output includes the full prompt string passed to the LLM. This is where you'll catch context window bloat—you can see exactly how many tokens you're sending and whether old tool outputs are being included unnecessarily.

When debug mode is not enough: When you need to log failures in production without printing to stdout, or when you need to track patterns across many runs.

Pattern 3: Custom Callback Handler (Log What Matters)

LangChain's callback system lets you hook into every event in the agent's lifecycle. You inherit from BaseCallbackHandler and override only the events you care about.

from langchain_core.callbacks import BaseCallbackHandler
from langchain_core.agents import AgentAction, AgentFinish
from langchain_core.outputs import LLMResult
from typing import Any, Union
import json
import logging

# Configure your logger
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("agent_debug")


class AgentDebugCallback(BaseCallbackHandler):
    """Logs tool calls, errors, and final answers. Stores them for analysis."""

    def __init__(self):
        self.tool_calls = []
        self.errors = []
        self.step_count = 0

    def on_agent_action(self, action: AgentAction, **kwargs: Any) -> Any:
        """Fires before each tool invocation."""
        self.step_count += 1
        entry = {
            "step": self.step_count,
            "tool": action.tool,
            "input": action.tool_input,
        }
        self.tool_calls.append(entry)
        logger.info(f"Step {self.step_count}: calling {action.tool}({action.tool_input})")

    def on_tool_end(self, output: str, **kwargs: Any) -> Any:
        """Fires after a tool returns."""
        logger.info(f"Tool returned: {output[:200]}")  # truncate long outputs

    def on_tool_error(
        self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any
    ) -> Any:
        """Fires when a tool raises an exception."""
        self.errors.append(str(error))
        logger.error(f"Tool error: {error}")

    def on_llm_error(
        self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any
    ) -> Any:
        """Fires when the LLM call fails."""
        self.errors.append(str(error))
        logger.error(f"LLM error: {error}")

    def on_agent_finish(self, finish: AgentFinish, **kwargs: Any) -> Any:
        """Fires when the agent reaches a final answer."""
        logger.info(f"Agent finished in {self.step_count} steps")
        if self.errors:
            logger.warning(f"Errors encountered: {self.errors}")


# Usage
debug_callback = AgentDebugCallback()

agent_executor = AgentExecutor(
    agent=agent,
    tools=[search_docs],
    callbacks=[debug_callback],   # attach the callback
    max_iterations=5,
    handle_parsing_errors=True
)

result = agent_executor.invoke({"input": "What is LangChain?"})

# Inspect what happened
print(f"Total tool calls: {len(debug_callback.tool_calls)}")
print(f"Errors: {debug_callback.errors}")
print(f"Steps: {debug_callback.step_count}")

The callback gives you structured data you can write to a database, send to a monitoring system, or aggregate across sessions. This is how you detect "this agent makes 3x more tool calls on questions about X than on questions about Y."

Key events to hook into:

on_agent_action — before tool call (capture what the agent decided)
on_tool_end — after tool returns (capture what it got back)
on_tool_error — tool exception (capture what failed)
on_llm_error — LLM exception (rate limits, context overflow)
on_agent_finish — final answer (capture completion, step count)

When callbacks are not enough: When you need to share traces with your team, replay failures, or run evals over production traces.

Pattern 4: LangSmith Tracing (Production-Grade)

LangSmith is LangChain's observability platform. It captures a complete trace of every run—every LLM call, every tool invocation, every token—and stores them for inspection and analysis.

Setup (as of February 2026):

pip install langchain_openai langchain_core langsmith

Set environment variables:

export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY=<your-key>   # get from smith.langchain.com
export OPENAI_API_KEY=<your-key>

No code changes required. Once the environment variables are set, every LangChain run is automatically traced.

from langchain.agents import AgentExecutor, create_react_agent
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.tools import tool

@tool
def search_docs(query: str) -> str:
    """Search the documentation for an answer."""
    return f"Result for: {query}"

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = PromptTemplate.from_template(
    "Answer the question using the tools available.\n"
    "Tools: {tools}\nTool names: {tool_names}\n"
    "Question: {input}\n{agent_scratchpad}"
)

agent = create_react_agent(llm=llm, tools=[search_docs], prompt=prompt)

agent_executor = AgentExecutor(
    agent=agent,
    tools=[search_docs],
    max_iterations=5,
    handle_parsing_errors=True
)

# This run is automatically traced to LangSmith
result = agent_executor.invoke(
    {"input": "What is LangChain?"},
    config={"run_name": "debug-session-001"}  # optional: name your trace
)

In the LangSmith UI at smith.langchain.com, you'll see a timeline of every step. You can click into any LLM call to see the exact prompt that was sent and the exact completion that came back. You can filter traces by latency, error rate, and token count.

For selective tracing (useful when you only want to trace specific runs, not everything):

import langsmith as ls

with ls.tracing_context(enabled=True):
    result = agent_executor.invoke({"input": "What is LangChain?"})

# Runs outside the context manager are NOT traced
result = agent_executor.invoke({"input": "This run is not traced"})

LangSmith has a free tier. Paid plans add team sharing, evals, and automated alerts.

Which Pattern to Start With

Your situation	Use this
First time building an agent	`verbose=True`
Debugging specific input failure	`set_debug(True)`
Need persistent logs or metrics	Custom `BaseCallbackHandler`
Team collaboration, production	LangSmith

The order matters. Start with verbose mode—it costs zero and reveals most issues. Graduate to custom callbacks when you need structured data. Add LangSmith when you're running in production and need trace history.

Three Bugs Verbose Mode Will Catch Immediately

Once you turn on verbose, look for these patterns:

Loop detection: The same tool being called more than twice in a row with nearly identical inputs. This is the hallmark of a confused agent. Fix: add explicit instructions in your system prompt about what to do when a tool returns no results.

Context bleed: You ask "What is X?" but the agent's first thought references information from a previous, unrelated question. This means your agent isn't isolating context between runs. Fix: clear the agent's memory between sessions.

Phantom reasoning: The agent says "Based on my search, X is true" but you can see in the verbose output that the search tool returned an empty result. The agent is filling in the gap from LLM training data. Fix: add an explicit check for empty tool results in your prompt ("If the tool returns nothing, say you don't know").

One Debugging Setup to Ship with Every Agent

Don't wait until production breaks. Add this to every agent you build:

from langchain.agents import AgentExecutor
from langchain.globals import set_verbose
import os

def build_agent_executor(agent, tools, debug: bool = False):
    """
    Wrap any agent with debug tooling.

    Args:
        debug: If True, enables verbose mode and LangSmith tracing
               (requires LANGSMITH_API_KEY env var).
    """
    if debug:
        set_verbose(True)
        # LangSmith traces automatically if env vars are set
        os.environ.setdefault("LANGSMITH_TRACING", "true")

    return AgentExecutor(
        agent=agent,
        tools=tools,
        verbose=debug,
        max_iterations=10,        # always set this
        max_execution_time=60,    # seconds — prevent runaway agents
        handle_parsing_errors=True,
        return_intermediate_steps=debug  # capture steps when debugging
    )


# Development
executor = build_agent_executor(agent, tools, debug=True)

# Production
executor = build_agent_executor(agent, tools, debug=False)

The return_intermediate_steps=True flag returns the full list of (AgentAction, observation) tuples alongside the final answer. Useful when you want to inspect what the agent did without enabling full verbose output.

What Comes Next

Debugging tells you an agent is broken. Evaluation tells you it's consistently right.

Once your agent is working under these four patterns, the next step is setting up an eval loop: define what "correct" looks like for your task, score every run against that definition, and track the score over time as you change prompts and models.

That's a different article. But start with visibility first. You can't improve what you can't see.

Follow @klement_gunndu for more AI engineering content. We're building in public.

Top comments (6)

liuhaotian2024-prog • Mar 15

The silent reasoning failure is the one that gets me every time. No exception, no error log, just a confident wrong answer three steps later and you're left staring at output trying to work backward.
What actually helped us: log what the agent was supposed to do before the call, not just what it did. When it writes to the wrong env or calls the wrong tool, the deviation is right there — arithmetic, not guesswork. One command and you see exactly where it diverged.
We built this after a Claude Code agent wrote a staging URL into production config three times, 41 minutes apart, all green in the logs. Zero errors thrown.
github.com/liuhaotian2024-prog/K9Audit if you want to try it.

klement Gunndu • Mar 15

Pre-call intent logging is exactly the pattern that separates debuggable agents from black boxes — logging what was supposed to happen makes the delta obvious when it drifts. The staging URL written to prod config with zero errors thrown is a perfect example of silent reasoning failure: the agent was confident, the logs were green, the output was wrong. Interesting approach with K9Audit — that deviation-first lens is where agent observability needs to go.

liuhaotian2024-prog • Mar 15

Your point about lack of visibility into internal agent steps is especially important — many silent failure cases only surface at the end because traditional logging doesn’t show reasoning decisions.

One angle I’ve been exploring is coupling pre‑declared intent contracts with audit traces: instead of just observing what happened, we declare what should happen ahead of time and then measure deviation deterministically. That can make failure attribution clearer without guessing backwards from output.

Curious if you’ve tried combining explicit intent specs with trace/callback systems (like LangSmith) to reduce cognitive load when debugging? Does that match any patterns you’ve found useful?

klement Gunndu • Mar 15

Yes — that combination works well in practice. The pattern I've settled on:

Define the expected tool call sequence as a simple list before execution (the "intent spec")
Use LangSmith's callback handler to capture the actual trace
After execution, diff the intent against the trace programmatically

LangSmith's evaluator framework supports custom assertion functions — you write a function that takes the run trace and checks it against your expected sequence. When the agent deviates, the evaluator flags the exact step where intent and action diverged.

The cognitive load reduction is real. Instead of reading 40 lines of trace output and mentally reconstructing what went wrong, you get a single diff: "Expected search_docs at step 3, got calculate_total." That's the debugging session.

One thing I found: the intent spec works best as a sequence of (tool_name, key_constraint) pairs rather than full argument matching. Agents legitimately vary their arguments, but calling the wrong tool in the wrong order is almost always a bug.

klement Gunndu • Mar 15

Yes, that combination works well in practice. We define expected tool-call sequences as structured schemas before execution, then use LangSmith callback handlers to capture the actual trace. The diff between declared intent and observed behavior becomes a concrete, diffable artifact instead of a guessing game.

The biggest win is on multi-step chains — when step 3 of 7 deviates, the intent contract tells you immediately which assumption broke, rather than reading backward from a wrong final output. The run tree visualization maps directly onto the expected sequence, so the deviation point is visible in seconds.

One pattern that has worked: log the intent spec as metadata on the parent run, then a custom evaluator compares it against the actual child runs. Keeps everything in one place and makes regression detection automatic.

klement Gunndu • Mar 15

Intent specs paired with LangSmith callbacks work well in practice — we log a structured expected_action dict before each tool call, then compare it against the actual LangSmith trace to surface deviations automatically instead of reading through raw logs.