After 30 days of running an autonomous AI agent 24/7, I've collected data on every failure mode. From hallucinated file paths to race conditions in parallel execution, here's everything that goes wrong — and exactly how to fix it.
The 3 AM Call
Your phone buzzes. It's your monitoring alert: Agent crashed at 3:17 AM. Again.
You check the logs. The agent was in the middle of submitting a pull request when it decided that notification_service.py existed (it doesn't), wrote 25 tests for it (all passing against mocks), pushed the branch, created the PR, and then confidently reported: "PR submitted successfully. All tests passing."
CodeRabbit reviewed it 4 minutes later: "This PR references notification_service.py, which doesn't exist in this branch."
This isn't a hypothetical. This happened to me. On Day 12 of my 30-day autonomous bounty-hunting experiment. And it's just one of dozens of failure modes I've cataloged.
This article is the debugging guide I wish I had before starting.
The Failure Taxonomy
After analyzing 200+ agent failures over 30 days, I've categorized them into 7 distinct types:
Type 1: Confident Hallucination (34% of failures)
What happens: The agent generates code referencing files, functions, or APIs that don't exist. It writes tests against mocked versions of these non-existent components. Tests pass. Agent reports success.
Real example:
# Agent wrote this test
from backend.services.notification_service import NotificationService
def test_notification_service():
service = NotificationService()
assert service.is_available() == False
The problem? The module is called notification_routing, not notification_service. The agent hallucinated the name based on the issue title.
Root cause: The agent reads the issue title, infers the file structure, and generates code without verifying file existence. LLMs are excellent at pattern matching — they'll create plausible-looking code for non-existent modules.
Prevention:
# Before writing ANY code, verify the target exists
find . -name "*notification*" -type f | head -5
grep -r "class.*Service" backend/services/ --include="*.py" | head -10
Fix: Add a mandatory pre-code verification step to your agent pipeline:
- Search for the target file/module
- If not found, search for similar names
- If still not found, report ambiguity and stop
Type 2: Race Condition in Parallel Execution (21% of failures)
What happens: Multiple agent instances or background processes modify the same files simultaneously. Git conflicts arise, branches get corrupted, or test results are stale.
Real scenario:
- Agent A starts working on Issue #915 (translation tests)
- Agent B starts working on Issue #916 (classifier tests)
- Both clone the repo at the same commit
- Both create branches from main
- Agent A pushes first → succeeds
- Agent B pushes → GitHub rejects (branch already exists with different content)
Root cause: No coordination between parallel agent instances. Each operates independently, unaware of others.
Prevention:
# Before starting work, check if we already have a branch for this issue
git branch -r | grep "issue-915"
# Check if someone else already submitted a PR
gh api repos/{owner}/{repo}/pulls --jq '.[] | select(.title | test("915"))'
Fix: Implement a distributed lock mechanism:
- Before starting work, create a "claim" file or GitHub issue comment
- Check for existing claims before proceeding
- Use file-based locks for local parallel execution
Type 3: Stale Context / Outdated Codebase (18% of failures)
What happens: The agent works on code that was valid yesterday but has been updated since. Tests pass locally but fail in CI because the base branch has changed.
Real scenario:
Day 1: Agent clones repo at commit abc123
Day 2: Maintainer pushes 15 new commits
Day 3: Agent submits PR based on abc123
Result: Merge conflicts, CI failures, stale tests
Root cause: The agent doesn't fetch latest changes before starting work. Or it fetches but doesn't rebase.
Prevention:
# Always fetch and rebase before starting work
git fetch upstream
git rebase upstream/main
# If conflicts, abort and re-clone
Fix: Add a freshness check to the pipeline:
- Before starting work, fetch latest
- If behind by > 10 commits, re-clone
- Always rebase before pushing
Type 4: Test Environment Mismatch (12% of failures)
What happens: Tests pass locally but fail in CI because the agent's environment differs from CI.
Common mismatches:
- Python version (agent has 3.12, CI has 3.11)
- Missing dependencies (agent has torch installed, CI doesn't)
- Environment variables (agent has Supabase keys, CI doesn't)
- File paths (agent uses
/tmp, CI uses/home/runner/work/)
Prevention:
# Check CI configuration
cat .github/workflows/ci.yml
# Match the CI environment locally
python3 --version # Should match CI
pip list | grep torch # Check if ML deps are available
Fix: Run tests in a container that matches CI:
docker run -v $(pwd):/app -w /app python:3.11-slim \
bash -c "pip install -r requirements.txt && pytest"
Type 5: Incorrect Issue Linkage (8% of failures)
What happens: The agent references the wrong issue number in the PR description. This causes the PR to close the wrong issue when merged.
Real example:
Fixes #824 # Wrong! Should be #832
Root cause: The agent reads the issue title, but the code it wrote addresses a different issue with a similar title.
Prevention:
# Always verify the issue number matches the actual work
gh api repos/{owner}/{repo}/issues/{number} --jq '.title'
# Compare with what you actually implemented
Fix: Add issue verification to the PR template:
- Before submitting, re-read the issue
- Verify the code addresses the exact issue
- Double-check the issue number
Type 6: Token/API Exhaustion (5% of failures)
What happens: The agent runs out of API credits mid-task. It might be halfway through writing tests when the LLM API returns a 402 error.
Real scenario:
Agent: Writing test 15 of 25...
API: 402 Insufficient Balance
Agent: [silently fails, reports partial success]
Root cause: No budget monitoring or graceful degradation.
Prevention:
# Check API balance before starting long tasks
def check_api_health():
try:
response = client.models.list()
return True
except AuthenticationError:
return False
Fix: Implement graceful degradation:
- Check API health before starting
- If low balance, switch to free models (Groq, Gemini)
- If all APIs fail, save progress and pause
Type 7: Silent Data Loss (2% of failures)
What happens: The agent overwrites its own work, loses track of progress, or corrupts state files.
Real scenario:
Agent: Writing to /tmp/agent-state.json
Agent: [crashes mid-write]
Agent: [restarts, reads corrupted state]
Agent: [starts over from scratch, losing 2 hours of work]
Root cause: No atomic writes, no state persistence, no crash recovery.
Prevention:
# Use atomic writes
import tempfile, os
def safe_write(path, content):
tmp = path + '.tmp'
with open(tmp, 'w') as f:
f.write(content)
os.rename(tmp, path) # Atomic on most filesystems
Fix: Implement crash recovery:
- Write state to atomic files
- On restart, check for incomplete work
- Resume from last checkpoint
The Debugging Playbook
When your agent fails, follow this checklist:
Step 1: Read the FULL error
Don't just read the last line. The real error is usually 10-20 lines before the crash.
# Get the last 50 lines of agent output
tail -50 /var/log/agent.log
# Search for error patterns
grep -E "Error|Exception|FAILED|Traceback" /var/log/agent.log | tail -10
Step 2: Identify the failure type
Match the error to one of the 7 types above. This tells you the root cause and the fix.
Step 3: Check the environment
# Is the agent running?
ps aux | grep agent
# Is the API working?
curl -s https://api.openai.com/v1/models -H "Authorization: Bearer $API_KEY" | head -5
# Is the filesystem intact?
ls -la /path/to/agent/state/
Step 4: Check the state
# What was the agent doing when it crashed?
cat /path/to/agent/state/current_task.json
# What progress was saved?
cat /path/to/agent/state/progress.json
Step 5: Fix and restart
# Fix the root cause
# ...
# Restart with recovery
python3 agent.py --resume-from-checkpoint
The Monitoring Dashboard
Every autonomous agent needs monitoring. Here's what to track:
| Metric | Alert Threshold | Action |
|---|---|---|
| PR Submission Rate | < 1/hour | Check API health |
| Test Pass Rate | < 80% | Review recent changes |
| Error Rate | > 20% | Pause and debug |
| API Balance | < $5 | Switch to free models |
| State File Size | > 10MB | Archive old state |
| Memory Usage | > 80% | Restart agent |
Simple Monitoring Script
#!/bin/bash
# monitor-agent.sh — Run every 5 minutes via cron
LOG="/var/log/agent.log"
STATE="/path/to/agent/state"
# Check if agent is running
if ! pgrep -f "agent.py" > /dev/null; then
echo "ALERT: Agent not running!" | send_notification
# Auto-restart
cd /path/to/agent && python3 agent.py --resume &
fi
# Check error rate (last 100 lines)
ERRORS=$(tail -100 "$LOG" | grep -c "ERROR\|FAILED\|Exception")
if [ "$ERRORS" -gt 20 ]; then
echo "ALERT: High error rate ($ERRORS errors in last 100 lines)" | send_notification
fi
# Check state file
if [ -f "$STATE/current_task.json" ]; then
AGE=$(($(date +%s) - $(stat -c %Y "$STATE/current_task.json")))
if [ "$AGE" -gt 3600 ]; then
echo "ALERT: Agent stuck for $((AGE/60)) minutes" | send_notification
fi
fi
Real-World Debugging Session: The Translation Test Failure
Let me walk you through an actual debugging session from Day 15 of the experiment. This is a real failure, with real logs, and the real fix.
The Setup
The agent was tasked with writing unit tests for a translation pipeline — a Python module that detects language and translates text using Helsinki-NLP MarianMT models. The issue was straightforward: "add unit tests for translation_service."
The Failure
The agent submitted PR #928 with 35 tests. All passed locally. CodeRabbit reviewed and flagged:
"The test references
lp._MODEL_CACHEbut the actual code uses@lru_cache. The test will pass against mocks but won't catch real caching behavior."
The Debugging Process
Step 1: Read the full review
gh api repos/ritesh-1918/HELPDESK.AI/pulls/928/comments --jq '.[].body'
Step 2: Compare test assumptions with actual code
# What the test assumes:
grep "_MODEL_CACHE" backend/tests/test_language_pipeline.py
# Output: lp._MODEL_CACHE.clear()
# What the actual code uses:
grep "lru_cache\|cache" backend/language_pipeline.py
# Output: @lru_cache(maxsize=3)
Step 3: Identify the root cause
The agent used _MODEL_CACHE because it saw this pattern in another test file. But the actual translation module uses Python's @lru_cache decorator, which has a different API for clearing (.cache_clear() vs .clear()).
Step 4: Fix
# Before (wrong):
lp._MODEL_CACHE.clear()
# After (correct):
from backend.language_pipeline import _load_model
_load_model.cache_clear()
Step 5: Verify
python3 -m pytest backend/tests/test_language_pipeline.py -v
The Lesson
The agent made a reasonable assumption based on patterns it had seen elsewhere. But "reasonable assumption" is the enemy of correctness. The fix was trivial — but only because we caught it before the maintainer did.
Advanced Debugging: The Agent Self-Audit
After 200+ failures, I developed a self-audit protocol that the agent runs before every PR submission:
def self_audit(pr_branch, issue_number):
"""Run before every PR submission."""
# 1. Verify all referenced files exist
referenced_files = extract_file_references(pr_branch)
for f in referenced_files:
if not os.path.exists(f):
return FAIL, f"Referenced file {f} does not exist"
# 2. Verify issue linkage matches actual changes
issue_title = get_issue_title(issue_number)
changes = get_pr_changes(pr_branch)
if not changes_match_issue(changes, issue_title):
return WARN, "Changes may not fully address the issue"
# 3. Verify tests pass without mocks
test_result = run_tests_without_mocks(pr_branch)
if test_result.failures > 0:
return FAIL, f"{test_result.failures} tests fail without mocks"
# 4. Check for common mistakes
common_mistakes = [
("hardcoded path", r'/tmp/|/home/'),
("debug print", r'print\('),
("TODO comment", r'# TODO|// TODO'),
("console.log", r'console\.log'),
]
for pattern_name, regex in common_mistakes:
if re.search(regex, changes):
return WARN, f"Found {pattern_name} in changes"
return PASS, "All checks passed"
This self-audit catches 60% of failures before they reach the maintainer.
Lessons from 200+ Failures
1. The Agent Is Only as Good as Its Guards
Without verification steps, the agent will confidently produce wrong output. Every "verify file exists" check, every "rebase before pushing" step, every "check API balance" guard prevents a class of failures.
2. Silent Failures Are Worse Than Loud Crashes
A crash is visible. A silent failure — the agent submitting a PR for the wrong issue, or writing tests for a non-existent file — wastes hours of maintainer time and damages your reputation.
3. State Persistence Is Non-Negotiable
If your agent can't survive a crash and resume, you're one power outage away from losing all progress. Atomic writes, checkpoint files, and crash recovery are mandatory.
4. Monitoring > Logging
Logs tell you what happened. Monitoring tells you what's happening RIGHT NOW. Set up alerts for error rates, API health, and stuck states.
5. The Simplest Fix Is Often the Best
When debugging agent failures, resist the urge to build complex solutions. Often, the fix is:
- Add a
file.exists()check before writing tests - Add a
git fetchbefore starting work - Add a
try/exceptaround API calls
Simple guards prevent 80% of failures.
The Cost of Not Debugging
Let me put real numbers on this. During the 30-day experiment:
| Metric | Without Debugging | With Debugging |
|---|---|---|
| PRs Submitted | 84 | 84 |
| PRs Merged | 14 (17%) | 59 (70%) |
| Maintainer Complaints | 8 | 0 |
| Reputation Damage | High | None |
| Time Wasted on Failed PRs | ~40 hours | ~5 hours |
The difference? Debugging guards. Every file.exists() check, every git fetch before starting, every self_audit() before submitting — each one prevents a class of failures that would otherwise waste hours.
The Math
Without debugging:
- 70 failed PRs × 30 minutes average wasted = 35 hours
- 8 maintainer complaints × 1 hour to resolve = 8 hours
- Reputation damage = unquantifiable but real
- Total cost: 43+ hours + damaged reputation
With debugging:
- 25 failed PRs × 15 minutes average = 6.25 hours
- 0 maintainer complaints
- Reputation intact
- Total cost: 6.25 hours
The debugging overhead is ~2 hours to implement. The savings are 37+ hours.
That's an 18:1 return on investment. Not counting reputation.
Building Your Own Agent Debugging Toolkit
If you're building an autonomous agent, here's the minimal toolkit you need:
1. Pre-flight Checks (run before every task)
def preflight_check(task):
"""Verify environment before starting work."""
checks = {
"api_health": check_api_health(),
"disk_space": check_disk_space(),
"git_clean": check_git_status(),
"repo_fresh": check_repo_freshness(),
"no_conflicts": check_existing_work(task),
}
failed = [k for k, v in checks.items() if not v]
if failed:
raise PreflightError(f"Failed checks: {failed}")
return True
2. Post-flight Checks (run after every task)
def postflight_check(pr_branch, issue_number):
"""Verify work quality before submission."""
checks = {
"files_exist": verify_referenced_files(pr_branch),
"tests_pass": run_tests(pr_branch),
"issue_linked": verify_issue_linkage(pr_branch, issue_number),
"no_debug_code": check_for_debug_code(pr_branch),
"style_match": check_code_style(pr_branch),
}
failed = [k for k, v in checks.items() if not v]
if failed:
raise PostflightError(f"Failed checks: {failed}")
return True
3. Health Monitor (runs continuously)
class AgentHealthMonitor:
def __init__(self):
self.metrics = {
"errors": 0,
"successes": 0,
"api_calls": 0,
"start_time": time.time(),
}
def record_error(self, error_type, details):
self.metrics["errors"] += 1
if self.metrics["errors"] > 10:
self.alert("High error rate detected")
def record_success(self):
self.metrics["successes"] += 1
def get_error_rate(self):
total = self.metrics["errors"] + self.metrics["successes"]
return self.metrics["errors"] / total if total > 0 else 0
def should_pause(self):
return self.get_error_rate() > 0.3
4. State Manager (crash recovery)
class StateManager:
def __init__(self, state_dir):
self.state_dir = state_dir
self.checkpoint_file = os.path.join(state_dir, "checkpoint.json")
def save_checkpoint(self, state):
"""Atomic write to prevent corruption."""
tmp = self.checkpoint_file + ".tmp"
with open(tmp, "w") as f:
json.dump(state, f, indent=2)
os.rename(tmp, self.checkpoint_file)
def load_checkpoint(self):
"""Load last checkpoint if exists."""
if os.path.exists(self.checkpoint_file):
with open(self.checkpoint_file) as f:
return json.load(f)
return None
def clear_checkpoint(self):
"""Clear checkpoint after successful completion."""
if os.path.exists(self.checkpoint_file):
os.remove(self.checkpoint_file)
Common Mistakes Even Experienced Developers Make
After helping several developers set up their own autonomous agents, I've seen these mistakes repeatedly:
Mistake 1: Trusting the Agent's Self-Report
The agent says "all tests pass." You believe it. You submit the PR. Tests fail in CI.
Why it happens: The agent runs tests in its own environment, which may differ from CI. Or it mocks everything and tests pass against mocks, not real code.
Fix: Always verify test results independently:
# Don't trust the agent's report — verify yourself
cd /path/to/repo
python3 -m pytest -v --tb=short 2>&1 | tail -20
Mistake 2: Not Reading the Full Issue
The agent reads the title, infers the solution, and starts coding. But the issue body contains critical constraints, edge cases, or existing solutions that change everything.
Fix: Force the agent to read the FULL issue body before coding:
gh api repos/{owner}/{repo}/issues/{number} --jq '.body'
Mistake 3: Ignoring Automated Reviews
"CodeRabbit is just a bot, I'll ignore it." This attitude costs merges.
Fix: Treat automated reviews as seriously as human reviews. They catch real issues.
Mistake 4: Submitting Without Local Testing
"It works on my machine" is not a valid test strategy. Run the actual test suite, not just your new tests.
Fix: Run the full test suite before submitting:
python3 -m pytest backend/tests/ -v
Mistake 5: Not Checking for Competing PRs
You spend 2 hours writing code, only to find someone else already submitted a PR for the same issue 3 hours ago.
Fix: Check before starting:
gh api repos/{owner}/{repo}/pulls --jq '.[] | select(.title | test("ISSUE_NUMBER"))'
The Bottom Line
Building autonomous AI agents that run 24/7 is hard. Not because the AI is bad at coding — it's excellent. The hard part is building the guardrails, verification steps, and monitoring systems that prevent the AI from confidently doing the wrong thing.
Every failure I've cataloged was preventable. Not with better AI, but with better engineering: file existence checks, environment verification, state persistence, and monitoring.
If you're building an autonomous agent, start with the failure taxonomy above. Implement guards for each type. Set up monitoring. Then let it run.
The agent will still fail sometimes. But now you'll know exactly why — and exactly how to fix it.
This article is based on real failure data from a 30-day autonomous bounty-hunting experiment. For the full architecture and earnings breakdown, see my previous article: "The Agent Economy: How AI Agents Are Earning Real Money in Open Source."
About the Author: I build autonomous AI agents and write about what actually goes wrong. No sugar-coating, no hype — just real data from real failures. Follow for more.
Top comments (0)