angu10

Posted on Jan 4

Stop Print Debugging Your AI Agents: A Deep Dive into Agent Observability

#python #ai #agents

The Invisible Agent Problem
Why Current Solutions Fall Short
Introducing Agent Recorder
How It Works: Technical Deep Dive
Real-World Use Cases
Comparing Approaches
Building Production-Ready Agents
The Road Ahead

The Invisible Agent Problem

It's 2 AM. Your AI agent just went into an infinite loop consuming API credits. Again.

You've built what should be a simple customer service agent:

Parse user question
Search knowledge base
Query database if needed
Format response
Maybe escalate to human support

Simple, right? Except somewhere in those 5 steps, your agent:

Called the same database query 15 times
Got stuck in a loop asking the LLM to "try again"
Hallucinated data that doesn't exist
Crashed with a cryptic error in step 4

And you have no idea which one until you start debugging.

The Print Statement Spiral

So you do what every developer does. You add logging:

def call_llm(prompt):
    print(f"[DEBUG] Calling LLM with: {prompt[:50]}...")
    start = time.time()
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    print(f"[DEBUG] LLM took {time.time() - start:.2f}s")
    result = response.choices[0].message.content
    print(f"[DEBUG] Got response: {result[:50]}...")
    return result

def search_database(query):
    print(f"[DEBUG] Searching DB: {query}")
    results = db.query(query)
    print(f"[DEBUG] Found {len(results)} results")
    return results

def get_customer_info(customer_id):
    print(f"[DEBUG] Getting customer {customer_id}")
    customer = db.get(customer_id)
    print(f"[DEBUG] Customer: {customer.get('name', 'Unknown')}")
    return customer

An hour later, your terminal looks like this:

[DEBUG] Calling LLM with: Find all orders for customer John Smith...
[DEBUG] LLM took 1.23s
[DEBUG] Got response: I'll search for that customer...
[DEBUG] Searching DB: customer_name=John Smith
[DEBUG] Found 2 results
[DEBUG] Getting customer 123
[DEBUG] Customer: John Smith
[DEBUG] Calling LLM with: Here are the customer details: {'id': 123...
[DEBUG] LLM took 0.87s
[DEBUG] Got response: Let me get their orders...
[DEBUG] Searching DB: orders WHERE customer_id=123
[DEBUG] Found 3 results
[DEBUG] Calling LLM with: Here are the orders: [{'id': 1001, 'to...
[DEBUG] LLM took 1.45s
[DEBUG] Got response: The customer has 3 orders...

You're staring at hundreds of lines of logs trying to answer basic questions:

How many times did we call the LLM?
What was the total execution time?
Which step failed?
What were the actual arguments passed to each function?
When did it start looping?

This is not sustainable.

The Real Cost of Poor Observability

Let me share some real numbers from my experience building AI agents:

Time Spent Debugging:

Print debugging: 2-4 hours per bug
Adding proper logging: 30 minutes per function
Actually finding the bug: 15 minutes
Total: 3-5 hours for issues that should take 15 minutes

Developer Frustration:

Losing context between debugging sessions
Unable to reproduce issues
No way to compare "working" vs "broken" runs
Every new team member asks: "How do I debug this?"

API Inefficiency:

Agents making 3x more API calls than necessary
Inefficient prompts using excessive tokens
Unable to identify performance bottlenecks

We've spent decades building amazing developer tools for web apps, mobile apps, backend services. But for AI agents? We're back to print() statements like it's 1995.

Why Current Solutions Fall Short

Before building Agent Recorder, I tried everything:

1. Standard Logging Libraries

import logging

logger = logging.getLogger(__name__)

def call_llm(prompt):
    logger.info(f"Calling LLM with prompt: {prompt}")
    response = llm.invoke(prompt)
    logger.info(f"Got response: {response}")
    return response

Problems:

Still just text logs in a file
No structure, no visualization
Manual instrumentation everywhere
Hard to correlate across async calls
No timing information without extra code

2. Cloud Observability Tools (DataDog, New Relic, etc.)

Problems:

Expensive for small teams and individuals
Send your prompts/responses to third-party servers (security issue)
Heavy SDKs that bloat your dependencies
Designed for traditional apps, not agent workflows
Over-engineered for "just see what my agent did"

3. LLM Provider Dashboards (OpenAI, Anthropic)

Problems:

Only see LLM calls, not your tool calls
No local context (what led to this call?)
Delayed (not real-time)
Can't see your custom logic
Vendor lock-in

4. Framework-Specific Tools (LangSmith for LangChain)

Problems:

Only works with that framework
Requires rewriting code to use their patterns
Still cloud-based with subscription fees
What if you use raw APIs or multiple frameworks?

What I needed was simple:

See every LLM call and tool call
Local storage (my data, my machine)
Framework-agnostic (works with anything)
Minimal code changes
Beautiful visualization
Free and open source

That tool didn't exist. So I built it.

Introducing Agent Recorder

Agent Recorder is Redux DevTools for AI agents. If you've ever used Redux DevTools for React development, you know the power of seeing every action, every state change, with the ability to inspect, time-travel, and understand your application flow.

Now imagine that, but for your AI agent's execution.

The Two-Decorator Solution

Here's all you need to add to your code:

from agent_recorder import llm_call, tool_call

@llm_call(run_name="customer-service-agent")
def call_llm(prompt):
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

@tool_call(run_name="customer-service-agent")
def search_database(query):
    results = db.query(query)
    return results

@tool_call(run_name="customer-service-agent")
def get_customer_orders(customer_id):
    orders = db.query(f"SELECT * FROM orders WHERE customer_id = {customer_id}")
    return orders

That's it. No context managers, no complex setup, no configuration files.

The run_name parameter groups related calls together. All functions decorated with run_name="customer-service-agent" will be recorded in the same timeline.

What Gets Captured Automatically

Every decorated function automatically logs:

Function name - What was called
Arguments - All input parameters with their values
Return value - Complete output from the function
Duration - Execution time in milliseconds
Timestamp - Exact time of invocation
Errors - Full exception details if it failed
Parent tracking - For nested function calls

No manual annotation needed. Just add the decorator.

Running Your Agent

Use your functions exactly as before:

# This is your agent logic - unchanged!
user_question = "Find all orders for customer John Smith"

# Step 1: Ask LLM to understand the query
intent = call_llm(f"User asks: {user_question}")

# Step 2: Search for the customer
customers = search_database("customer_name='John Smith'")

# Step 3: Get their orders
if customers:
    customer = customers[0]
    orders = get_customer_orders(customer['id'])

    # Step 4: Summarize results
    summary = call_llm(f"Summarize these orders: {orders}")
    print(summary)

Everything is being recorded in the background.

Viewing the Timeline

When your agent finishes (or crashes), run:

agent-recorder view latest

Your browser opens to a beautiful web-based timeline showing the complete execution flow.

How It Works: Technical Deep Dive

Let me walk you through the architecture and implementation details.

1. Decorator-Based Instrumentation

When you write:

@llm_call(run_name="my-agent")
def call_llm(prompt):
    return "response"

Here's what happens under the hood:

Registry Lookup: Agent Recorder checks if a Recorder instance exists for "my-agent"
Auto-Creation: If not, it creates one with a unique run ID (timestamp + UUID)
Function Wrapping: Your function gets wrapped with timing and logging logic
Execution: When called, it captures args, executes the function, captures the result
Event Writing: Writes a structured event to a JSONL file immediately

The actual implementation:

def llm_call(run_name: str, name: Optional[str] = None,
             capture_args: bool = True, capture_result: bool = True):
    # Get or create a Recorder instance for this run_name
    recorder = _get_or_create_recorder(run_name)

    # Return the actual decorator that wraps your function
    return recorder.llm_call(name=name, capture_args=capture_args,
                            capture_result=capture_result)

2. Event Storage Format

All events are stored as JSONL (JSON Lines) - one JSON object per line. This format is:

Streamable: Can write events as they happen
Parseable: Easy to read line-by-line
Crash-resistant: If your program crashes, all events up to that point are saved
Tooling-friendly: Standard format used by many data tools

Example event:

{
  "run_id": "20260103_192705_c2207bde",
  "event_id": "4f85a880-2ab7-45bf-a0ba-9c776581a5de",
  "timestamp": "2026-01-03T19:27:06.097562",
  "type": "llm_call",
  "parent_id": null,
  "data": {
    "function_name": "call_llm",
    "args": {
      "prompt": "User asks: Find all orders for customer John Smith"
    },
    "duration_ms": 760,
    "error": null,
    "result": "I'll help you find customer information. Let me search the database."
  }
}

Storage location: ~/.agent-recorder/runs/<run_id>.jsonl

3. Event Types

Agent Recorder tracks 5 event types:

run_start - Marks the beginning of a run

   {
     "type": "run_start",
     "data": {
       "name": "customer-service-agent",
       "run_id": "20260103_192705_c2207bde",
       "timestamp": "2026-01-03T19:27:05.337192"
     }
   }

llm_call - LLM function execution

   {
     "type": "llm_call",
     "data": {
       "function_name": "call_llm",
       "args": {"prompt": "..."},
       "result": "...",
       "duration_ms": 1234
     }
   }

tool_call - Tool function execution

   {
     "type": "tool_call",
     "data": {
       "function_name": "search_database",
       "args": {"query": "..."},
       "result": [...],
       "duration_ms": 340
     }
   }

error - Exception that occurred

   {
     "type": "error",
     "data": {
       "error_type": "ValueError",
       "message": "Customer not found",
       "traceback": "..."
     }
   }

run_end - Marks completion (optional in v0.1.1)

4. Async Support

The same decorators work seamlessly with async functions:

@llm_call(run_name="async-agent")
async def call_llm_async(prompt):
    response = await openai_async.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

@tool_call(run_name="async-agent")
async def fetch_weather(city):
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://api.weather.com/{city}")
        return response.json()

# Use with asyncio
async def main():
    result = await call_llm_async("What's the weather in SF?")
    weather = await fetch_weather("San Francisco")

asyncio.run(main())

Agent Recorder detects if your function is a coroutine and handles it appropriately.

5. Web Viewer Architecture

The viewer is a self-contained HTML file with:

No external dependencies (no CDN calls)
Vanilla JavaScript for parsing JSONL
CSS for the timeline UI
Syntax highlighting for JSON data
Collapsible event cards
Search and filter capabilities

When you run agent-recorder view latest, it:

Finds the latest run in ~/.agent-recorder/runs/
Starts a local HTTP server (default port 8765)
Serves the HTML viewer + JSONL data
Opens your browser to http://localhost:8765/runs/<run_id>.html

Everything stays local. No data leaves your machine.

Real-World Use Cases

Let me show you how Agent Recorder solves actual problems I've encountered.

Use Case 1: Debugging Infinite Loops

The Problem: Agent keeps calling the same tool over and over.

Without Agent Recorder:

[DEBUG] Calling search_database with query: customer_name='John'
[DEBUG] Got 0 results
[DEBUG] Calling LLM...
[DEBUG] LLM says: Let me search again
[DEBUG] Calling search_database with query: customer_name='John'
[DEBUG] Got 0 results
[DEBUG] Calling LLM...
[DEBUG] LLM says: Let me search again
... (500 more lines)

You have to manually count log lines and realize it's looping.

With Agent Recorder:

Open the timeline and immediately see:

1. llm_call - "Find customer John"
2. tool_call - search_database(query="customer_name='John'") → []
3. llm_call - "I got no results, let me try again"
4. tool_call - search_database(query="customer_name='John'") → []
5. llm_call - "I got no results, let me try again"
6. tool_call - search_database(query="customer_name='John'") → []
... (pattern visible immediately)

The fix: The database query is wrong (should be customer_name='John Smith'). Also, the LLM needs explicit instruction to stop after 1 failed attempt.

Time saved: 2 hours → 5 minutes

Use Case 2: Performance Optimization

The Problem: Agent is slow but you don't know which part.

With Agent Recorder:

Look at the timeline durations:

1. llm_call - 1.2s ⚡ (acceptable)
2. tool_call - search_database - 3.8s 🐌 (SLOW!)
3. tool_call - get_orders - 0.4s ⚡
4. llm_call - 0.9s ⚡

The fix: Add a database index on customer_name. Duration drops to 0.2s.

Result: Total execution time: 6.3s → 2.7s (57% faster)

Use Case 3: Token Usage Optimization

The Problem: High API usage, unclear why.

With Agent Recorder:

Export the run to JSON:

agent-recorder export <run_id> -o run.json

Write a quick script to analyze:

import json

total_prompt_length = 0
total_calls = 0

with open('run.json') as f:
    data = json.load(f)
    for event in data['events']:
        if event['type'] == 'llm_call':
            total_calls += 1
            prompt = event['data']['args'].get('prompt', '')
            total_prompt_length += len(prompt)

print(f"Total LLM calls: {total_calls}")
print(f"Average prompt length: {total_prompt_length / total_calls}")

Discovery: One LLM call had a 5000-character prompt that included the entire knowledge base unnecessarily.

The fix: Pass only relevant excerpts to the LLM. Token usage drops significantly.

Use Case 4: Comparing Runs

The Problem: "It worked yesterday, now it's broken."

With Agent Recorder:

# List all runs
agent-recorder list

# Output:
# 20260102_143022_abc123  customer-agent  2026-01-02 14:30:22 (working)
# 20260103_192705_c2207b  customer-agent  2026-01-03 19:27:05 (broken)

# Export both
agent-recorder export 20260102_143022_abc123 -o working.json
agent-recorder export 20260103_192705_c2207b -o broken.json

# Compare with diff tool or custom script

Discovery: In the broken version, a new validation step was added that always returns empty results.

Time saved: 4 hours → 15 minutes

Use Case 5: Onboarding New Team Members

The Problem: "How does this agent work?"

With Agent Recorder:

Run a sample execution:

python examples/customer_service_agent.py
agent-recorder view latest

Show them the timeline. They instantly understand:

Agent asks LLM to parse the query
LLM decides which tools to call
Agent executes tools (database, API calls)
LLM synthesizes the response

No documentation needed. The timeline is living documentation.

Comparing Approaches

Let me compare different debugging approaches with a real scenario:

Scenario: Debug why customer order lookup fails for "John Smith"

Approach 1: Print Statements

def find_orders(customer_name):
    print(f"DEBUG: Looking for {customer_name}")
    customers = search_customers(customer_name)
    print(f"DEBUG: Found {len(customers)} customers")
    if not customers:
        print("DEBUG: No customers found, returning empty")
        return []
    print(f"DEBUG: Getting orders for {customers[0]['id']}")
    orders = get_orders(customers[0]['id'])
    print(f"DEBUG: Got {len(orders)} orders")
    return orders

Time to find bug: 30-60 minutes
Lines of debug code: 15-20
After fixing: Remove all print statements
If it breaks again: Add them all back

Approach 2: Logging Framework

import logging
logger = logging.getLogger(__name__)

def find_orders(customer_name):
    logger.info(f"Looking for customer: {customer_name}")
    customers = search_customers(customer_name)
    logger.info(f"Found {len(customers)} customers")
    if not customers:
        logger.warning("No customers found")
        return []
    logger.info(f"Getting orders for customer {customers[0]['id']}")
    orders = get_orders(customers[0]['id'])
    logger.info(f"Retrieved {len(orders)} orders")
    return orders

Time to find bug: 20-30 minutes
Lines of debug code: 20-25 (permanent overhead)
After fixing: Logs stay (clutter over time)
Visualization: Still just text in a file

Approach 3: Cloud Observability (e.g., DataDog)

from ddtrace import tracer

@tracer.wrap()
def find_orders(customer_name):
    with tracer.trace("search_customers"):
        customers = search_customers(customer_name)
    with tracer.trace("get_orders"):
        if customers:
            orders = get_orders(customers[0]['id'])
            return orders
    return []

Time to find bug: 10-15 minutes
Setup time: 2-3 hours (SDK, config, account)
Ongoing: Monthly subscription
Security: Data sent to third-party
Lines of instrumentation: 15-20

Approach 4: Agent Recorder

from agent_recorder import tool_call

@tool_call(run_name="order-lookup")
def find_orders(customer_name):
    customers = search_customers(customer_name)
    if not customers:
        return []
    orders = get_orders(customers[0]['id'])
    return orders

@tool_call(run_name="order-lookup")
def search_customers(name):
    return db.query(f"SELECT * FROM customers WHERE name = '{name}'")

@tool_call(run_name="order-lookup")
def get_orders(customer_id):
    return db.query(f"SELECT * FROM orders WHERE customer_id = {customer_id}")

Time to find bug: 5-10 minutes
Setup time: 30 seconds (pip install)
Ongoing: Free
Security: All data local
Lines of instrumentation: 3 decorators
After fixing: Decorators stay (useful for future debugging)

Winner: Agent Recorder provides the best balance of simplicity, effectiveness, and privacy.

Building Production-Ready Agents

Agent Recorder isn't just for debugging - it's essential for production agents.

1. Handling Sensitive Data

Don't log API keys or personal information:

@llm_call(run_name="secure-agent", capture_args=False)
def call_llm_with_key(api_key: str, prompt: str):
    # api_key won't be logged
    return openai.chat.completions.create(
        api_key=api_key,
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content

@tool_call(run_name="secure-agent", capture_result=False)
def fetch_user_pii(user_id: str):
    # Result won't be logged (but function call and args will)
    return db.get_user_sensitive_info(user_id)

2. Custom Storage Location

For production deployments:

@llm_call(
    run_name="prod-agent",
    storage_dir="/var/log/agent-recorder"
)
def call_llm(prompt):
    return response

3. Cleanup Old Runs

Keep disk usage under control:

# Delete runs older than 7 days
agent-recorder cleanup --older-than 7d

# Dry run to see what would be deleted
agent-recorder cleanup --older-than 7d --dry-run

4. Automated Analysis

Export and analyze runs programmatically:

import json
from pathlib import Path

def analyze_run(run_id):
    # Export to JSON
    export_path = Path(f"/tmp/{run_id}.json")
    os.system(f"agent-recorder export {run_id} -o {export_path}")

    # Load and analyze
    with open(export_path) as f:
        data = json.load(f)

    stats = {
        'total_llm_calls': 0,
        'total_tool_calls': 0,
        'total_duration': 0,
        'errors': []
    }

    for event in data['events']:
        if event['type'] == 'llm_call':
            stats['total_llm_calls'] += 1
            stats['total_duration'] += event['data']['duration_ms']
        elif event['type'] == 'tool_call':
            stats['total_tool_calls'] += 1
            stats['total_duration'] += event['data']['duration_ms']
        elif event['type'] == 'error':
            stats['errors'].append(event['data'])

    return stats

The Road Ahead

Agent Recorder v0.1.1 is just the beginning. Here's what's coming:

v0.2.0 - Enhanced Visualization (Planned)

Tree/Graph View: See nested calls as a visual tree
Token Counting: Automatic token counting for OpenAI/Anthropic
Cost Estimation: Calculate API costs for each run
Performance Metrics: Identify bottlenecks automatically
Export Formats: PDF, HTML, CSV for reports

v0.3.0 - Framework Integrations (Planned)

LangChain Adapter: Auto-instrument LangChain agents
LlamaIndex Adapter: Seamless integration with LlamaIndex
AutoGen Support: Track multi-agent conversations
CrewAI Integration: Monitor crew workflows

v0.4.0 - Advanced Features (Planned)

Real-time Streaming: Watch agent execution live
Multi-agent Support: Track multiple agents interacting
Diff View: Compare two runs side-by-side
Custom Events: Log your own event types
Plugin System: Extend with custom visualizations

v0.5.0 - Language Ports (Community Welcome!)

TypeScript/Node.js SDK: For JavaScript agents
Go SDK: For Go-based agents
Rust SDK: For high-performance agents

Want to contribute? Check out the GitHub repo for good first issues!

Conclusion: Observability Is Not Optional

As AI agents move from prototypes to production, observability isn't a nice-to-have - it's essential.

You can't optimize what you can't measure.
You can't debug what you can't see.
You can't trust what you can't verify.

Agent Recorder gives you that visibility with:

✅ Two simple decorators
✅ Zero configuration
✅ Local-first architecture
✅ Framework-agnostic design
✅ Beautiful visualization
✅ Free and open source

Get Started Today

# Clone and install
git clone https://github.com/yourusername/agent-recorder.git
cd agent-recorder
pip install -e .

# Try the example
python examples/simple_agent.py

# View the recording
agent-recorder view latest

GitHub: https://github.com/yourusername/agent-recorder
License: MIT
Docs: See README.md for full documentation

Join the Community

Agent Recorder is open source and built for the community. Whether you:

Found a bug → Open an issue
Have a feature idea → Start a discussion
Want to contribute → Submit a PR
Built something cool → Share your story

We're building the future of agent observability together.

Star the repo if you find it useful - it helps others discover the project!

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

Table of Contents

The Invisible Agent Problem

The Print Statement Spiral

The Real Cost of Poor Observability

Why Current Solutions Fall Short

1. Standard Logging Libraries

2. Cloud Observability Tools (DataDog, New Relic, etc.)

3. LLM Provider Dashboards (OpenAI, Anthropic)

4. Framework-Specific Tools (LangSmith for LangChain)

Introducing Agent Recorder

The Two-Decorator Solution

What Gets Captured Automatically

Running Your Agent

Viewing the Timeline

How It Works: Technical Deep Dive

1. Decorator-Based Instrumentation

2. Event Storage Format

3. Event Types

4. Async Support

5. Web Viewer Architecture

Real-World Use Cases

Use Case 1: Debugging Infinite Loops

Use Case 2: Performance Optimization

Use Case 3: Token Usage Optimization

Use Case 4: Comparing Runs

Use Case 5: Onboarding New Team Members

Comparing Approaches

Approach 1: Print Statements

Approach 2: Logging Framework

Approach 3: Cloud Observability (e.g., DataDog)

Approach 4: Agent Recorder

Building Production-Ready Agents

1. Handling Sensitive Data

2. Custom Storage Location

3. Cleanup Old Runs

4. Automated Analysis

The Road Ahead

v0.2.0 - Enhanced Visualization (Planned)

v0.3.0 - Framework Integrations (Planned)

v0.4.0 - Advanced Features (Planned)

v0.5.0 - Language Ports (Community Welcome!)

Conclusion: Observability Is Not Optional

Get Started Today

Join the Community