DEV Community

angu10
angu10

Posted on

Stop Print Debugging Your AI Agents: A Deep Dive into Agent Observability

Table of Contents


The Invisible Agent Problem

It's 2 AM. Your AI agent just went into an infinite loop consuming API credits. Again.

You've built what should be a simple customer service agent:

  1. Parse user question
  2. Search knowledge base
  3. Query database if needed
  4. Format response
  5. Maybe escalate to human support

Simple, right? Except somewhere in those 5 steps, your agent:

  • Called the same database query 15 times
  • Got stuck in a loop asking the LLM to "try again"
  • Hallucinated data that doesn't exist
  • Crashed with a cryptic error in step 4

And you have no idea which one until you start debugging.

The Print Statement Spiral

So you do what every developer does. You add logging:

def call_llm(prompt):
    print(f"[DEBUG] Calling LLM with: {prompt[:50]}...")
    start = time.time()
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    print(f"[DEBUG] LLM took {time.time() - start:.2f}s")
    result = response.choices[0].message.content
    print(f"[DEBUG] Got response: {result[:50]}...")
    return result

def search_database(query):
    print(f"[DEBUG] Searching DB: {query}")
    results = db.query(query)
    print(f"[DEBUG] Found {len(results)} results")
    return results

def get_customer_info(customer_id):
    print(f"[DEBUG] Getting customer {customer_id}")
    customer = db.get(customer_id)
    print(f"[DEBUG] Customer: {customer.get('name', 'Unknown')}")
    return customer
Enter fullscreen mode Exit fullscreen mode

An hour later, your terminal looks like this:

[DEBUG] Calling LLM with: Find all orders for customer John Smith...
[DEBUG] LLM took 1.23s
[DEBUG] Got response: I'll search for that customer...
[DEBUG] Searching DB: customer_name=John Smith
[DEBUG] Found 2 results
[DEBUG] Getting customer 123
[DEBUG] Customer: John Smith
[DEBUG] Calling LLM with: Here are the customer details: {'id': 123...
[DEBUG] LLM took 0.87s
[DEBUG] Got response: Let me get their orders...
[DEBUG] Searching DB: orders WHERE customer_id=123
[DEBUG] Found 3 results
[DEBUG] Calling LLM with: Here are the orders: [{'id': 1001, 'to...
[DEBUG] LLM took 1.45s
[DEBUG] Got response: The customer has 3 orders...
Enter fullscreen mode Exit fullscreen mode

You're staring at hundreds of lines of logs trying to answer basic questions:

  • How many times did we call the LLM?
  • What was the total execution time?
  • Which step failed?
  • What were the actual arguments passed to each function?
  • When did it start looping?

This is not sustainable.

The Real Cost of Poor Observability

Let me share some real numbers from my experience building AI agents:

Time Spent Debugging:

  • Print debugging: 2-4 hours per bug
  • Adding proper logging: 30 minutes per function
  • Actually finding the bug: 15 minutes
  • Total: 3-5 hours for issues that should take 15 minutes

Developer Frustration:

  • Losing context between debugging sessions
  • Unable to reproduce issues
  • No way to compare "working" vs "broken" runs
  • Every new team member asks: "How do I debug this?"

API Inefficiency:

  • Agents making 3x more API calls than necessary
  • Inefficient prompts using excessive tokens
  • Unable to identify performance bottlenecks

We've spent decades building amazing developer tools for web apps, mobile apps, backend services. But for AI agents? We're back to print() statements like it's 1995.

Why Current Solutions Fall Short

Before building Agent Recorder, I tried everything:

1. Standard Logging Libraries

import logging

logger = logging.getLogger(__name__)

def call_llm(prompt):
    logger.info(f"Calling LLM with prompt: {prompt}")
    response = llm.invoke(prompt)
    logger.info(f"Got response: {response}")
    return response
Enter fullscreen mode Exit fullscreen mode

Problems:

  • Still just text logs in a file
  • No structure, no visualization
  • Manual instrumentation everywhere
  • Hard to correlate across async calls
  • No timing information without extra code

2. Cloud Observability Tools (DataDog, New Relic, etc.)

Problems:

  • Expensive for small teams and individuals
  • Send your prompts/responses to third-party servers (security issue)
  • Heavy SDKs that bloat your dependencies
  • Designed for traditional apps, not agent workflows
  • Over-engineered for "just see what my agent did"

3. LLM Provider Dashboards (OpenAI, Anthropic)

Problems:

  • Only see LLM calls, not your tool calls
  • No local context (what led to this call?)
  • Delayed (not real-time)
  • Can't see your custom logic
  • Vendor lock-in

4. Framework-Specific Tools (LangSmith for LangChain)

Problems:

  • Only works with that framework
  • Requires rewriting code to use their patterns
  • Still cloud-based with subscription fees
  • What if you use raw APIs or multiple frameworks?

What I needed was simple:

  • See every LLM call and tool call
  • Local storage (my data, my machine)
  • Framework-agnostic (works with anything)
  • Minimal code changes
  • Beautiful visualization
  • Free and open source

That tool didn't exist. So I built it.

Introducing Agent Recorder

Agent Recorder is Redux DevTools for AI agents. If you've ever used Redux DevTools for React development, you know the power of seeing every action, every state change, with the ability to inspect, time-travel, and understand your application flow.

Now imagine that, but for your AI agent's execution.

The Two-Decorator Solution

Here's all you need to add to your code:

from agent_recorder import llm_call, tool_call

@llm_call(run_name="customer-service-agent")
def call_llm(prompt):
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

@tool_call(run_name="customer-service-agent")
def search_database(query):
    results = db.query(query)
    return results

@tool_call(run_name="customer-service-agent")
def get_customer_orders(customer_id):
    orders = db.query(f"SELECT * FROM orders WHERE customer_id = {customer_id}")
    return orders
Enter fullscreen mode Exit fullscreen mode

That's it. No context managers, no complex setup, no configuration files.

The run_name parameter groups related calls together. All functions decorated with run_name="customer-service-agent" will be recorded in the same timeline.

What Gets Captured Automatically

Every decorated function automatically logs:

  1. Function name - What was called
  2. Arguments - All input parameters with their values
  3. Return value - Complete output from the function
  4. Duration - Execution time in milliseconds
  5. Timestamp - Exact time of invocation
  6. Errors - Full exception details if it failed
  7. Parent tracking - For nested function calls

No manual annotation needed. Just add the decorator.

Running Your Agent

Use your functions exactly as before:

# This is your agent logic - unchanged!
user_question = "Find all orders for customer John Smith"

# Step 1: Ask LLM to understand the query
intent = call_llm(f"User asks: {user_question}")

# Step 2: Search for the customer
customers = search_database("customer_name='John Smith'")

# Step 3: Get their orders
if customers:
    customer = customers[0]
    orders = get_customer_orders(customer['id'])

    # Step 4: Summarize results
    summary = call_llm(f"Summarize these orders: {orders}")
    print(summary)
Enter fullscreen mode Exit fullscreen mode

Everything is being recorded in the background.

Viewing the Timeline

When your agent finishes (or crashes), run:

agent-recorder view latest
Enter fullscreen mode Exit fullscreen mode

Your browser opens to a beautiful web-based timeline showing the complete execution flow.

How It Works: Technical Deep Dive

Let me walk you through the architecture and implementation details.

1. Decorator-Based Instrumentation

When you write:

@llm_call(run_name="my-agent")
def call_llm(prompt):
    return "response"
Enter fullscreen mode Exit fullscreen mode

Here's what happens under the hood:

  1. Registry Lookup: Agent Recorder checks if a Recorder instance exists for "my-agent"
  2. Auto-Creation: If not, it creates one with a unique run ID (timestamp + UUID)
  3. Function Wrapping: Your function gets wrapped with timing and logging logic
  4. Execution: When called, it captures args, executes the function, captures the result
  5. Event Writing: Writes a structured event to a JSONL file immediately

The actual implementation:

def llm_call(run_name: str, name: Optional[str] = None,
             capture_args: bool = True, capture_result: bool = True):
    # Get or create a Recorder instance for this run_name
    recorder = _get_or_create_recorder(run_name)

    # Return the actual decorator that wraps your function
    return recorder.llm_call(name=name, capture_args=capture_args,
                            capture_result=capture_result)
Enter fullscreen mode Exit fullscreen mode

2. Event Storage Format

All events are stored as JSONL (JSON Lines) - one JSON object per line. This format is:

  • Streamable: Can write events as they happen
  • Parseable: Easy to read line-by-line
  • Crash-resistant: If your program crashes, all events up to that point are saved
  • Tooling-friendly: Standard format used by many data tools

Example event:

{
  "run_id": "20260103_192705_c2207bde",
  "event_id": "4f85a880-2ab7-45bf-a0ba-9c776581a5de",
  "timestamp": "2026-01-03T19:27:06.097562",
  "type": "llm_call",
  "parent_id": null,
  "data": {
    "function_name": "call_llm",
    "args": {
      "prompt": "User asks: Find all orders for customer John Smith"
    },
    "duration_ms": 760,
    "error": null,
    "result": "I'll help you find customer information. Let me search the database."
  }
}
Enter fullscreen mode Exit fullscreen mode

Storage location: ~/.agent-recorder/runs/<run_id>.jsonl

3. Event Types

Agent Recorder tracks 5 event types:

  1. run_start - Marks the beginning of a run
   {
     "type": "run_start",
     "data": {
       "name": "customer-service-agent",
       "run_id": "20260103_192705_c2207bde",
       "timestamp": "2026-01-03T19:27:05.337192"
     }
   }
Enter fullscreen mode Exit fullscreen mode
  1. llm_call - LLM function execution
   {
     "type": "llm_call",
     "data": {
       "function_name": "call_llm",
       "args": {"prompt": "..."},
       "result": "...",
       "duration_ms": 1234
     }
   }
Enter fullscreen mode Exit fullscreen mode
  1. tool_call - Tool function execution
   {
     "type": "tool_call",
     "data": {
       "function_name": "search_database",
       "args": {"query": "..."},
       "result": [...],
       "duration_ms": 340
     }
   }
Enter fullscreen mode Exit fullscreen mode
  1. error - Exception that occurred
   {
     "type": "error",
     "data": {
       "error_type": "ValueError",
       "message": "Customer not found",
       "traceback": "..."
     }
   }
Enter fullscreen mode Exit fullscreen mode
  1. run_end - Marks completion (optional in v0.1.1)

4. Async Support

The same decorators work seamlessly with async functions:

@llm_call(run_name="async-agent")
async def call_llm_async(prompt):
    response = await openai_async.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

@tool_call(run_name="async-agent")
async def fetch_weather(city):
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://api.weather.com/{city}")
        return response.json()

# Use with asyncio
async def main():
    result = await call_llm_async("What's the weather in SF?")
    weather = await fetch_weather("San Francisco")

asyncio.run(main())
Enter fullscreen mode Exit fullscreen mode

Agent Recorder detects if your function is a coroutine and handles it appropriately.

5. Web Viewer Architecture

The viewer is a self-contained HTML file with:

  • No external dependencies (no CDN calls)
  • Vanilla JavaScript for parsing JSONL
  • CSS for the timeline UI
  • Syntax highlighting for JSON data
  • Collapsible event cards
  • Search and filter capabilities

When you run agent-recorder view latest, it:

  1. Finds the latest run in ~/.agent-recorder/runs/
  2. Starts a local HTTP server (default port 8765)
  3. Serves the HTML viewer + JSONL data
  4. Opens your browser to http://localhost:8765/runs/<run_id>.html

Everything stays local. No data leaves your machine.

Real-World Use Cases

Let me show you how Agent Recorder solves actual problems I've encountered.

Use Case 1: Debugging Infinite Loops

The Problem: Agent keeps calling the same tool over and over.

Without Agent Recorder:

[DEBUG] Calling search_database with query: customer_name='John'
[DEBUG] Got 0 results
[DEBUG] Calling LLM...
[DEBUG] LLM says: Let me search again
[DEBUG] Calling search_database with query: customer_name='John'
[DEBUG] Got 0 results
[DEBUG] Calling LLM...
[DEBUG] LLM says: Let me search again
... (500 more lines)
Enter fullscreen mode Exit fullscreen mode

You have to manually count log lines and realize it's looping.

With Agent Recorder:

Open the timeline and immediately see:

1. llm_call - "Find customer John"
2. tool_call - search_database(query="customer_name='John'") → []
3. llm_call - "I got no results, let me try again"
4. tool_call - search_database(query="customer_name='John'") → []
5. llm_call - "I got no results, let me try again"
6. tool_call - search_database(query="customer_name='John'") → []
... (pattern visible immediately)
Enter fullscreen mode Exit fullscreen mode

The fix: The database query is wrong (should be customer_name='John Smith'). Also, the LLM needs explicit instruction to stop after 1 failed attempt.

Time saved: 2 hours → 5 minutes

Use Case 2: Performance Optimization

The Problem: Agent is slow but you don't know which part.

With Agent Recorder:

Look at the timeline durations:

1. llm_call - 1.2s ⚡ (acceptable)
2. tool_call - search_database - 3.8s 🐌 (SLOW!)
3. tool_call - get_orders - 0.4s ⚡
4. llm_call - 0.9s ⚡
Enter fullscreen mode Exit fullscreen mode

The fix: Add a database index on customer_name. Duration drops to 0.2s.

Result: Total execution time: 6.3s → 2.7s (57% faster)

Use Case 3: Token Usage Optimization

The Problem: High API usage, unclear why.

With Agent Recorder:

Export the run to JSON:

agent-recorder export <run_id> -o run.json
Enter fullscreen mode Exit fullscreen mode

Write a quick script to analyze:

import json

total_prompt_length = 0
total_calls = 0

with open('run.json') as f:
    data = json.load(f)
    for event in data['events']:
        if event['type'] == 'llm_call':
            total_calls += 1
            prompt = event['data']['args'].get('prompt', '')
            total_prompt_length += len(prompt)

print(f"Total LLM calls: {total_calls}")
print(f"Average prompt length: {total_prompt_length / total_calls}")
Enter fullscreen mode Exit fullscreen mode

Discovery: One LLM call had a 5000-character prompt that included the entire knowledge base unnecessarily.

The fix: Pass only relevant excerpts to the LLM. Token usage drops significantly.

Use Case 4: Comparing Runs

The Problem: "It worked yesterday, now it's broken."

With Agent Recorder:

# List all runs
agent-recorder list

# Output:
# 20260102_143022_abc123  customer-agent  2026-01-02 14:30:22 (working)
# 20260103_192705_c2207b  customer-agent  2026-01-03 19:27:05 (broken)

# Export both
agent-recorder export 20260102_143022_abc123 -o working.json
agent-recorder export 20260103_192705_c2207b -o broken.json

# Compare with diff tool or custom script
Enter fullscreen mode Exit fullscreen mode

Discovery: In the broken version, a new validation step was added that always returns empty results.

Time saved: 4 hours → 15 minutes

Use Case 5: Onboarding New Team Members

The Problem: "How does this agent work?"

With Agent Recorder:

Run a sample execution:

python examples/customer_service_agent.py
agent-recorder view latest
Enter fullscreen mode Exit fullscreen mode

Show them the timeline. They instantly understand:

  1. Agent asks LLM to parse the query
  2. LLM decides which tools to call
  3. Agent executes tools (database, API calls)
  4. LLM synthesizes the response

No documentation needed. The timeline is living documentation.

Comparing Approaches

Let me compare different debugging approaches with a real scenario:

Scenario: Debug why customer order lookup fails for "John Smith"

Approach 1: Print Statements

def find_orders(customer_name):
    print(f"DEBUG: Looking for {customer_name}")
    customers = search_customers(customer_name)
    print(f"DEBUG: Found {len(customers)} customers")
    if not customers:
        print("DEBUG: No customers found, returning empty")
        return []
    print(f"DEBUG: Getting orders for {customers[0]['id']}")
    orders = get_orders(customers[0]['id'])
    print(f"DEBUG: Got {len(orders)} orders")
    return orders
Enter fullscreen mode Exit fullscreen mode

Time to find bug: 30-60 minutes
Lines of debug code: 15-20
After fixing: Remove all print statements
If it breaks again: Add them all back

Approach 2: Logging Framework

import logging
logger = logging.getLogger(__name__)

def find_orders(customer_name):
    logger.info(f"Looking for customer: {customer_name}")
    customers = search_customers(customer_name)
    logger.info(f"Found {len(customers)} customers")
    if not customers:
        logger.warning("No customers found")
        return []
    logger.info(f"Getting orders for customer {customers[0]['id']}")
    orders = get_orders(customers[0]['id'])
    logger.info(f"Retrieved {len(orders)} orders")
    return orders
Enter fullscreen mode Exit fullscreen mode

Time to find bug: 20-30 minutes
Lines of debug code: 20-25 (permanent overhead)
After fixing: Logs stay (clutter over time)
Visualization: Still just text in a file

Approach 3: Cloud Observability (e.g., DataDog)

from ddtrace import tracer

@tracer.wrap()
def find_orders(customer_name):
    with tracer.trace("search_customers"):
        customers = search_customers(customer_name)
    with tracer.trace("get_orders"):
        if customers:
            orders = get_orders(customers[0]['id'])
            return orders
    return []
Enter fullscreen mode Exit fullscreen mode

Time to find bug: 10-15 minutes
Setup time: 2-3 hours (SDK, config, account)
Ongoing: Monthly subscription
Security: Data sent to third-party
Lines of instrumentation: 15-20

Approach 4: Agent Recorder

from agent_recorder import tool_call

@tool_call(run_name="order-lookup")
def find_orders(customer_name):
    customers = search_customers(customer_name)
    if not customers:
        return []
    orders = get_orders(customers[0]['id'])
    return orders

@tool_call(run_name="order-lookup")
def search_customers(name):
    return db.query(f"SELECT * FROM customers WHERE name = '{name}'")

@tool_call(run_name="order-lookup")
def get_orders(customer_id):
    return db.query(f"SELECT * FROM orders WHERE customer_id = {customer_id}")
Enter fullscreen mode Exit fullscreen mode

Time to find bug: 5-10 minutes
Setup time: 30 seconds (pip install)
Ongoing: Free
Security: All data local
Lines of instrumentation: 3 decorators
After fixing: Decorators stay (useful for future debugging)

Winner: Agent Recorder provides the best balance of simplicity, effectiveness, and privacy.

Building Production-Ready Agents

Agent Recorder isn't just for debugging - it's essential for production agents.

1. Handling Sensitive Data

Don't log API keys or personal information:

@llm_call(run_name="secure-agent", capture_args=False)
def call_llm_with_key(api_key: str, prompt: str):
    # api_key won't be logged
    return openai.chat.completions.create(
        api_key=api_key,
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content

@tool_call(run_name="secure-agent", capture_result=False)
def fetch_user_pii(user_id: str):
    # Result won't be logged (but function call and args will)
    return db.get_user_sensitive_info(user_id)
Enter fullscreen mode Exit fullscreen mode

2. Custom Storage Location

For production deployments:

@llm_call(
    run_name="prod-agent",
    storage_dir="/var/log/agent-recorder"
)
def call_llm(prompt):
    return response
Enter fullscreen mode Exit fullscreen mode

3. Cleanup Old Runs

Keep disk usage under control:

# Delete runs older than 7 days
agent-recorder cleanup --older-than 7d

# Dry run to see what would be deleted
agent-recorder cleanup --older-than 7d --dry-run
Enter fullscreen mode Exit fullscreen mode

4. Automated Analysis

Export and analyze runs programmatically:

import json
from pathlib import Path

def analyze_run(run_id):
    # Export to JSON
    export_path = Path(f"/tmp/{run_id}.json")
    os.system(f"agent-recorder export {run_id} -o {export_path}")

    # Load and analyze
    with open(export_path) as f:
        data = json.load(f)

    stats = {
        'total_llm_calls': 0,
        'total_tool_calls': 0,
        'total_duration': 0,
        'errors': []
    }

    for event in data['events']:
        if event['type'] == 'llm_call':
            stats['total_llm_calls'] += 1
            stats['total_duration'] += event['data']['duration_ms']
        elif event['type'] == 'tool_call':
            stats['total_tool_calls'] += 1
            stats['total_duration'] += event['data']['duration_ms']
        elif event['type'] == 'error':
            stats['errors'].append(event['data'])

    return stats
Enter fullscreen mode Exit fullscreen mode

The Road Ahead

Agent Recorder v0.1.1 is just the beginning. Here's what's coming:

v0.2.0 - Enhanced Visualization (Planned)

  • Tree/Graph View: See nested calls as a visual tree
  • Token Counting: Automatic token counting for OpenAI/Anthropic
  • Cost Estimation: Calculate API costs for each run
  • Performance Metrics: Identify bottlenecks automatically
  • Export Formats: PDF, HTML, CSV for reports

v0.3.0 - Framework Integrations (Planned)

  • LangChain Adapter: Auto-instrument LangChain agents
  • LlamaIndex Adapter: Seamless integration with LlamaIndex
  • AutoGen Support: Track multi-agent conversations
  • CrewAI Integration: Monitor crew workflows

v0.4.0 - Advanced Features (Planned)

  • Real-time Streaming: Watch agent execution live
  • Multi-agent Support: Track multiple agents interacting
  • Diff View: Compare two runs side-by-side
  • Custom Events: Log your own event types
  • Plugin System: Extend with custom visualizations

v0.5.0 - Language Ports (Community Welcome!)

  • TypeScript/Node.js SDK: For JavaScript agents
  • Go SDK: For Go-based agents
  • Rust SDK: For high-performance agents

Want to contribute? Check out the GitHub repo for good first issues!

Conclusion: Observability Is Not Optional

As AI agents move from prototypes to production, observability isn't a nice-to-have - it's essential.

You can't optimize what you can't measure.
You can't debug what you can't see.
You can't trust what you can't verify.

Agent Recorder gives you that visibility with:

  • ✅ Two simple decorators
  • ✅ Zero configuration
  • ✅ Local-first architecture
  • ✅ Framework-agnostic design
  • ✅ Beautiful visualization
  • ✅ Free and open source

Get Started Today

# Clone and install
git clone https://github.com/yourusername/agent-recorder.git
cd agent-recorder
pip install -e .

# Try the example
python examples/simple_agent.py

# View the recording
agent-recorder view latest
Enter fullscreen mode Exit fullscreen mode

GitHub: https://github.com/yourusername/agent-recorder
License: MIT
Docs: See README.md for full documentation

Join the Community

Agent Recorder is open source and built for the community. Whether you:

  • Found a bug → Open an issue
  • Have a feature idea → Start a discussion
  • Want to contribute → Submit a PR
  • Built something cool → Share your story

We're building the future of agent observability together.

Star the repo if you find it useful - it helps others discover the project!

Top comments (0)