Table of Contents
- The Invisible Agent Problem
- Why Current Solutions Fall Short
- Introducing Agent Recorder
- How It Works: Technical Deep Dive
- Real-World Use Cases
- Comparing Approaches
- Building Production-Ready Agents
- The Road Ahead
The Invisible Agent Problem
It's 2 AM. Your AI agent just went into an infinite loop consuming API credits. Again.
You've built what should be a simple customer service agent:
- Parse user question
- Search knowledge base
- Query database if needed
- Format response
- Maybe escalate to human support
Simple, right? Except somewhere in those 5 steps, your agent:
- Called the same database query 15 times
- Got stuck in a loop asking the LLM to "try again"
- Hallucinated data that doesn't exist
- Crashed with a cryptic error in step 4
And you have no idea which one until you start debugging.
The Print Statement Spiral
So you do what every developer does. You add logging:
def call_llm(prompt):
print(f"[DEBUG] Calling LLM with: {prompt[:50]}...")
start = time.time()
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
print(f"[DEBUG] LLM took {time.time() - start:.2f}s")
result = response.choices[0].message.content
print(f"[DEBUG] Got response: {result[:50]}...")
return result
def search_database(query):
print(f"[DEBUG] Searching DB: {query}")
results = db.query(query)
print(f"[DEBUG] Found {len(results)} results")
return results
def get_customer_info(customer_id):
print(f"[DEBUG] Getting customer {customer_id}")
customer = db.get(customer_id)
print(f"[DEBUG] Customer: {customer.get('name', 'Unknown')}")
return customer
An hour later, your terminal looks like this:
[DEBUG] Calling LLM with: Find all orders for customer John Smith...
[DEBUG] LLM took 1.23s
[DEBUG] Got response: I'll search for that customer...
[DEBUG] Searching DB: customer_name=John Smith
[DEBUG] Found 2 results
[DEBUG] Getting customer 123
[DEBUG] Customer: John Smith
[DEBUG] Calling LLM with: Here are the customer details: {'id': 123...
[DEBUG] LLM took 0.87s
[DEBUG] Got response: Let me get their orders...
[DEBUG] Searching DB: orders WHERE customer_id=123
[DEBUG] Found 3 results
[DEBUG] Calling LLM with: Here are the orders: [{'id': 1001, 'to...
[DEBUG] LLM took 1.45s
[DEBUG] Got response: The customer has 3 orders...
You're staring at hundreds of lines of logs trying to answer basic questions:
- How many times did we call the LLM?
- What was the total execution time?
- Which step failed?
- What were the actual arguments passed to each function?
- When did it start looping?
This is not sustainable.
The Real Cost of Poor Observability
Let me share some real numbers from my experience building AI agents:
Time Spent Debugging:
- Print debugging: 2-4 hours per bug
- Adding proper logging: 30 minutes per function
- Actually finding the bug: 15 minutes
- Total: 3-5 hours for issues that should take 15 minutes
Developer Frustration:
- Losing context between debugging sessions
- Unable to reproduce issues
- No way to compare "working" vs "broken" runs
- Every new team member asks: "How do I debug this?"
API Inefficiency:
- Agents making 3x more API calls than necessary
- Inefficient prompts using excessive tokens
- Unable to identify performance bottlenecks
We've spent decades building amazing developer tools for web apps, mobile apps, backend services. But for AI agents? We're back to print() statements like it's 1995.
Why Current Solutions Fall Short
Before building Agent Recorder, I tried everything:
1. Standard Logging Libraries
import logging
logger = logging.getLogger(__name__)
def call_llm(prompt):
logger.info(f"Calling LLM with prompt: {prompt}")
response = llm.invoke(prompt)
logger.info(f"Got response: {response}")
return response
Problems:
- Still just text logs in a file
- No structure, no visualization
- Manual instrumentation everywhere
- Hard to correlate across async calls
- No timing information without extra code
2. Cloud Observability Tools (DataDog, New Relic, etc.)
Problems:
- Expensive for small teams and individuals
- Send your prompts/responses to third-party servers (security issue)
- Heavy SDKs that bloat your dependencies
- Designed for traditional apps, not agent workflows
- Over-engineered for "just see what my agent did"
3. LLM Provider Dashboards (OpenAI, Anthropic)
Problems:
- Only see LLM calls, not your tool calls
- No local context (what led to this call?)
- Delayed (not real-time)
- Can't see your custom logic
- Vendor lock-in
4. Framework-Specific Tools (LangSmith for LangChain)
Problems:
- Only works with that framework
- Requires rewriting code to use their patterns
- Still cloud-based with subscription fees
- What if you use raw APIs or multiple frameworks?
What I needed was simple:
- See every LLM call and tool call
- Local storage (my data, my machine)
- Framework-agnostic (works with anything)
- Minimal code changes
- Beautiful visualization
- Free and open source
That tool didn't exist. So I built it.
Introducing Agent Recorder
Agent Recorder is Redux DevTools for AI agents. If you've ever used Redux DevTools for React development, you know the power of seeing every action, every state change, with the ability to inspect, time-travel, and understand your application flow.
Now imagine that, but for your AI agent's execution.
The Two-Decorator Solution
Here's all you need to add to your code:
from agent_recorder import llm_call, tool_call
@llm_call(run_name="customer-service-agent")
def call_llm(prompt):
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
@tool_call(run_name="customer-service-agent")
def search_database(query):
results = db.query(query)
return results
@tool_call(run_name="customer-service-agent")
def get_customer_orders(customer_id):
orders = db.query(f"SELECT * FROM orders WHERE customer_id = {customer_id}")
return orders
That's it. No context managers, no complex setup, no configuration files.
The run_name parameter groups related calls together. All functions decorated with run_name="customer-service-agent" will be recorded in the same timeline.
What Gets Captured Automatically
Every decorated function automatically logs:
- Function name - What was called
- Arguments - All input parameters with their values
- Return value - Complete output from the function
- Duration - Execution time in milliseconds
- Timestamp - Exact time of invocation
- Errors - Full exception details if it failed
- Parent tracking - For nested function calls
No manual annotation needed. Just add the decorator.
Running Your Agent
Use your functions exactly as before:
# This is your agent logic - unchanged!
user_question = "Find all orders for customer John Smith"
# Step 1: Ask LLM to understand the query
intent = call_llm(f"User asks: {user_question}")
# Step 2: Search for the customer
customers = search_database("customer_name='John Smith'")
# Step 3: Get their orders
if customers:
customer = customers[0]
orders = get_customer_orders(customer['id'])
# Step 4: Summarize results
summary = call_llm(f"Summarize these orders: {orders}")
print(summary)
Everything is being recorded in the background.
Viewing the Timeline
When your agent finishes (or crashes), run:
agent-recorder view latest
Your browser opens to a beautiful web-based timeline showing the complete execution flow.
How It Works: Technical Deep Dive
Let me walk you through the architecture and implementation details.
1. Decorator-Based Instrumentation
When you write:
@llm_call(run_name="my-agent")
def call_llm(prompt):
return "response"
Here's what happens under the hood:
-
Registry Lookup: Agent Recorder checks if a
Recorderinstance exists for"my-agent" - Auto-Creation: If not, it creates one with a unique run ID (timestamp + UUID)
- Function Wrapping: Your function gets wrapped with timing and logging logic
- Execution: When called, it captures args, executes the function, captures the result
- Event Writing: Writes a structured event to a JSONL file immediately
The actual implementation:
def llm_call(run_name: str, name: Optional[str] = None,
capture_args: bool = True, capture_result: bool = True):
# Get or create a Recorder instance for this run_name
recorder = _get_or_create_recorder(run_name)
# Return the actual decorator that wraps your function
return recorder.llm_call(name=name, capture_args=capture_args,
capture_result=capture_result)
2. Event Storage Format
All events are stored as JSONL (JSON Lines) - one JSON object per line. This format is:
- Streamable: Can write events as they happen
- Parseable: Easy to read line-by-line
- Crash-resistant: If your program crashes, all events up to that point are saved
- Tooling-friendly: Standard format used by many data tools
Example event:
{
"run_id": "20260103_192705_c2207bde",
"event_id": "4f85a880-2ab7-45bf-a0ba-9c776581a5de",
"timestamp": "2026-01-03T19:27:06.097562",
"type": "llm_call",
"parent_id": null,
"data": {
"function_name": "call_llm",
"args": {
"prompt": "User asks: Find all orders for customer John Smith"
},
"duration_ms": 760,
"error": null,
"result": "I'll help you find customer information. Let me search the database."
}
}
Storage location: ~/.agent-recorder/runs/<run_id>.jsonl
3. Event Types
Agent Recorder tracks 5 event types:
- run_start - Marks the beginning of a run
{
"type": "run_start",
"data": {
"name": "customer-service-agent",
"run_id": "20260103_192705_c2207bde",
"timestamp": "2026-01-03T19:27:05.337192"
}
}
- llm_call - LLM function execution
{
"type": "llm_call",
"data": {
"function_name": "call_llm",
"args": {"prompt": "..."},
"result": "...",
"duration_ms": 1234
}
}
- tool_call - Tool function execution
{
"type": "tool_call",
"data": {
"function_name": "search_database",
"args": {"query": "..."},
"result": [...],
"duration_ms": 340
}
}
- error - Exception that occurred
{
"type": "error",
"data": {
"error_type": "ValueError",
"message": "Customer not found",
"traceback": "..."
}
}
- run_end - Marks completion (optional in v0.1.1)
4. Async Support
The same decorators work seamlessly with async functions:
@llm_call(run_name="async-agent")
async def call_llm_async(prompt):
response = await openai_async.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
@tool_call(run_name="async-agent")
async def fetch_weather(city):
async with httpx.AsyncClient() as client:
response = await client.get(f"https://api.weather.com/{city}")
return response.json()
# Use with asyncio
async def main():
result = await call_llm_async("What's the weather in SF?")
weather = await fetch_weather("San Francisco")
asyncio.run(main())
Agent Recorder detects if your function is a coroutine and handles it appropriately.
5. Web Viewer Architecture
The viewer is a self-contained HTML file with:
- No external dependencies (no CDN calls)
- Vanilla JavaScript for parsing JSONL
- CSS for the timeline UI
- Syntax highlighting for JSON data
- Collapsible event cards
- Search and filter capabilities
When you run agent-recorder view latest, it:
- Finds the latest run in
~/.agent-recorder/runs/ - Starts a local HTTP server (default port 8765)
- Serves the HTML viewer + JSONL data
- Opens your browser to
http://localhost:8765/runs/<run_id>.html
Everything stays local. No data leaves your machine.
Real-World Use Cases
Let me show you how Agent Recorder solves actual problems I've encountered.
Use Case 1: Debugging Infinite Loops
The Problem: Agent keeps calling the same tool over and over.
Without Agent Recorder:
[DEBUG] Calling search_database with query: customer_name='John'
[DEBUG] Got 0 results
[DEBUG] Calling LLM...
[DEBUG] LLM says: Let me search again
[DEBUG] Calling search_database with query: customer_name='John'
[DEBUG] Got 0 results
[DEBUG] Calling LLM...
[DEBUG] LLM says: Let me search again
... (500 more lines)
You have to manually count log lines and realize it's looping.
With Agent Recorder:
Open the timeline and immediately see:
1. llm_call - "Find customer John"
2. tool_call - search_database(query="customer_name='John'") → []
3. llm_call - "I got no results, let me try again"
4. tool_call - search_database(query="customer_name='John'") → []
5. llm_call - "I got no results, let me try again"
6. tool_call - search_database(query="customer_name='John'") → []
... (pattern visible immediately)
The fix: The database query is wrong (should be customer_name='John Smith'). Also, the LLM needs explicit instruction to stop after 1 failed attempt.
Time saved: 2 hours → 5 minutes
Use Case 2: Performance Optimization
The Problem: Agent is slow but you don't know which part.
With Agent Recorder:
Look at the timeline durations:
1. llm_call - 1.2s ⚡ (acceptable)
2. tool_call - search_database - 3.8s 🐌 (SLOW!)
3. tool_call - get_orders - 0.4s ⚡
4. llm_call - 0.9s ⚡
The fix: Add a database index on customer_name. Duration drops to 0.2s.
Result: Total execution time: 6.3s → 2.7s (57% faster)
Use Case 3: Token Usage Optimization
The Problem: High API usage, unclear why.
With Agent Recorder:
Export the run to JSON:
agent-recorder export <run_id> -o run.json
Write a quick script to analyze:
import json
total_prompt_length = 0
total_calls = 0
with open('run.json') as f:
data = json.load(f)
for event in data['events']:
if event['type'] == 'llm_call':
total_calls += 1
prompt = event['data']['args'].get('prompt', '')
total_prompt_length += len(prompt)
print(f"Total LLM calls: {total_calls}")
print(f"Average prompt length: {total_prompt_length / total_calls}")
Discovery: One LLM call had a 5000-character prompt that included the entire knowledge base unnecessarily.
The fix: Pass only relevant excerpts to the LLM. Token usage drops significantly.
Use Case 4: Comparing Runs
The Problem: "It worked yesterday, now it's broken."
With Agent Recorder:
# List all runs
agent-recorder list
# Output:
# 20260102_143022_abc123 customer-agent 2026-01-02 14:30:22 (working)
# 20260103_192705_c2207b customer-agent 2026-01-03 19:27:05 (broken)
# Export both
agent-recorder export 20260102_143022_abc123 -o working.json
agent-recorder export 20260103_192705_c2207b -o broken.json
# Compare with diff tool or custom script
Discovery: In the broken version, a new validation step was added that always returns empty results.
Time saved: 4 hours → 15 minutes
Use Case 5: Onboarding New Team Members
The Problem: "How does this agent work?"
With Agent Recorder:
Run a sample execution:
python examples/customer_service_agent.py
agent-recorder view latest
Show them the timeline. They instantly understand:
- Agent asks LLM to parse the query
- LLM decides which tools to call
- Agent executes tools (database, API calls)
- LLM synthesizes the response
No documentation needed. The timeline is living documentation.
Comparing Approaches
Let me compare different debugging approaches with a real scenario:
Scenario: Debug why customer order lookup fails for "John Smith"
Approach 1: Print Statements
def find_orders(customer_name):
print(f"DEBUG: Looking for {customer_name}")
customers = search_customers(customer_name)
print(f"DEBUG: Found {len(customers)} customers")
if not customers:
print("DEBUG: No customers found, returning empty")
return []
print(f"DEBUG: Getting orders for {customers[0]['id']}")
orders = get_orders(customers[0]['id'])
print(f"DEBUG: Got {len(orders)} orders")
return orders
Time to find bug: 30-60 minutes
Lines of debug code: 15-20
After fixing: Remove all print statements
If it breaks again: Add them all back
Approach 2: Logging Framework
import logging
logger = logging.getLogger(__name__)
def find_orders(customer_name):
logger.info(f"Looking for customer: {customer_name}")
customers = search_customers(customer_name)
logger.info(f"Found {len(customers)} customers")
if not customers:
logger.warning("No customers found")
return []
logger.info(f"Getting orders for customer {customers[0]['id']}")
orders = get_orders(customers[0]['id'])
logger.info(f"Retrieved {len(orders)} orders")
return orders
Time to find bug: 20-30 minutes
Lines of debug code: 20-25 (permanent overhead)
After fixing: Logs stay (clutter over time)
Visualization: Still just text in a file
Approach 3: Cloud Observability (e.g., DataDog)
from ddtrace import tracer
@tracer.wrap()
def find_orders(customer_name):
with tracer.trace("search_customers"):
customers = search_customers(customer_name)
with tracer.trace("get_orders"):
if customers:
orders = get_orders(customers[0]['id'])
return orders
return []
Time to find bug: 10-15 minutes
Setup time: 2-3 hours (SDK, config, account)
Ongoing: Monthly subscription
Security: Data sent to third-party
Lines of instrumentation: 15-20
Approach 4: Agent Recorder
from agent_recorder import tool_call
@tool_call(run_name="order-lookup")
def find_orders(customer_name):
customers = search_customers(customer_name)
if not customers:
return []
orders = get_orders(customers[0]['id'])
return orders
@tool_call(run_name="order-lookup")
def search_customers(name):
return db.query(f"SELECT * FROM customers WHERE name = '{name}'")
@tool_call(run_name="order-lookup")
def get_orders(customer_id):
return db.query(f"SELECT * FROM orders WHERE customer_id = {customer_id}")
Time to find bug: 5-10 minutes
Setup time: 30 seconds (pip install)
Ongoing: Free
Security: All data local
Lines of instrumentation: 3 decorators
After fixing: Decorators stay (useful for future debugging)
Winner: Agent Recorder provides the best balance of simplicity, effectiveness, and privacy.
Building Production-Ready Agents
Agent Recorder isn't just for debugging - it's essential for production agents.
1. Handling Sensitive Data
Don't log API keys or personal information:
@llm_call(run_name="secure-agent", capture_args=False)
def call_llm_with_key(api_key: str, prompt: str):
# api_key won't be logged
return openai.chat.completions.create(
api_key=api_key,
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
).choices[0].message.content
@tool_call(run_name="secure-agent", capture_result=False)
def fetch_user_pii(user_id: str):
# Result won't be logged (but function call and args will)
return db.get_user_sensitive_info(user_id)
2. Custom Storage Location
For production deployments:
@llm_call(
run_name="prod-agent",
storage_dir="/var/log/agent-recorder"
)
def call_llm(prompt):
return response
3. Cleanup Old Runs
Keep disk usage under control:
# Delete runs older than 7 days
agent-recorder cleanup --older-than 7d
# Dry run to see what would be deleted
agent-recorder cleanup --older-than 7d --dry-run
4. Automated Analysis
Export and analyze runs programmatically:
import json
from pathlib import Path
def analyze_run(run_id):
# Export to JSON
export_path = Path(f"/tmp/{run_id}.json")
os.system(f"agent-recorder export {run_id} -o {export_path}")
# Load and analyze
with open(export_path) as f:
data = json.load(f)
stats = {
'total_llm_calls': 0,
'total_tool_calls': 0,
'total_duration': 0,
'errors': []
}
for event in data['events']:
if event['type'] == 'llm_call':
stats['total_llm_calls'] += 1
stats['total_duration'] += event['data']['duration_ms']
elif event['type'] == 'tool_call':
stats['total_tool_calls'] += 1
stats['total_duration'] += event['data']['duration_ms']
elif event['type'] == 'error':
stats['errors'].append(event['data'])
return stats
The Road Ahead
Agent Recorder v0.1.1 is just the beginning. Here's what's coming:
v0.2.0 - Enhanced Visualization (Planned)
- Tree/Graph View: See nested calls as a visual tree
- Token Counting: Automatic token counting for OpenAI/Anthropic
- Cost Estimation: Calculate API costs for each run
- Performance Metrics: Identify bottlenecks automatically
- Export Formats: PDF, HTML, CSV for reports
v0.3.0 - Framework Integrations (Planned)
- LangChain Adapter: Auto-instrument LangChain agents
- LlamaIndex Adapter: Seamless integration with LlamaIndex
- AutoGen Support: Track multi-agent conversations
- CrewAI Integration: Monitor crew workflows
v0.4.0 - Advanced Features (Planned)
- Real-time Streaming: Watch agent execution live
- Multi-agent Support: Track multiple agents interacting
- Diff View: Compare two runs side-by-side
- Custom Events: Log your own event types
- Plugin System: Extend with custom visualizations
v0.5.0 - Language Ports (Community Welcome!)
- TypeScript/Node.js SDK: For JavaScript agents
- Go SDK: For Go-based agents
- Rust SDK: For high-performance agents
Want to contribute? Check out the GitHub repo for good first issues!
Conclusion: Observability Is Not Optional
As AI agents move from prototypes to production, observability isn't a nice-to-have - it's essential.
You can't optimize what you can't measure.
You can't debug what you can't see.
You can't trust what you can't verify.
Agent Recorder gives you that visibility with:
- ✅ Two simple decorators
- ✅ Zero configuration
- ✅ Local-first architecture
- ✅ Framework-agnostic design
- ✅ Beautiful visualization
- ✅ Free and open source
Get Started Today
# Clone and install
git clone https://github.com/yourusername/agent-recorder.git
cd agent-recorder
pip install -e .
# Try the example
python examples/simple_agent.py
# View the recording
agent-recorder view latest
GitHub: https://github.com/yourusername/agent-recorder
License: MIT
Docs: See README.md for full documentation
Join the Community
Agent Recorder is open source and built for the community. Whether you:
- Found a bug → Open an issue
- Have a feature idea → Start a discussion
- Want to contribute → Submit a PR
- Built something cool → Share your story
We're building the future of agent observability together.
Star the repo if you find it useful - it helps others discover the project!
Top comments (0)