zk0x /// ℹ️

Posted on May 31

Why Your AI Agent Keeps Breaking: A Debugging Guide for Autonomous Systems

#ai #debugging #deeptech #productivity

After 30 days of running an autonomous AI agent 24/7, I've collected data on every failure mode. From hallucinated file paths to race conditions in parallel execution, here's everything that goes wrong — and exactly how to fix it.

The 3 AM Call

Your phone buzzes. It's your monitoring alert: Agent crashed at 3:17 AM. Again.

You check the logs. The agent was in the middle of submitting a pull request when it decided that notification_service.py existed (it doesn't), wrote 25 tests for it (all passing against mocks), pushed the branch, created the PR, and then confidently reported: "PR submitted successfully. All tests passing."

CodeRabbit reviewed it 4 minutes later: "This PR references notification_service.py, which doesn't exist in this branch."

This isn't a hypothetical. This happened to me. On Day 12 of my 30-day autonomous bounty-hunting experiment. And it's just one of dozens of failure modes I've cataloged.

This article is the debugging guide I wish I had before starting.

The Failure Taxonomy

After analyzing 200+ agent failures over 30 days, I've categorized them into 7 distinct types:

Type 1: Confident Hallucination (34% of failures)

What happens: The agent generates code referencing files, functions, or APIs that don't exist. It writes tests against mocked versions of these non-existent components. Tests pass. Agent reports success.

Real example:

# Agent wrote this test
from backend.services.notification_service import NotificationService

def test_notification_service():
    service = NotificationService()
    assert service.is_available() == False

The problem? The module is called notification_routing, not notification_service. The agent hallucinated the name based on the issue title.

Root cause: The agent reads the issue title, infers the file structure, and generates code without verifying file existence. LLMs are excellent at pattern matching — they'll create plausible-looking code for non-existent modules.

Prevention:

# Before writing ANY code, verify the target exists
find . -name "*notification*" -type f | head -5
grep -r "class.*Service" backend/services/ --include="*.py" | head -10

Fix: Add a mandatory pre-code verification step to your agent pipeline:

Search for the target file/module
If not found, search for similar names
If still not found, report ambiguity and stop

Type 2: Race Condition in Parallel Execution (21% of failures)

What happens: Multiple agent instances or background processes modify the same files simultaneously. Git conflicts arise, branches get corrupted, or test results are stale.

Real scenario:

Agent A starts working on Issue #915 (translation tests)
Agent B starts working on Issue #916 (classifier tests)
Both clone the repo at the same commit
Both create branches from main
Agent A pushes first → succeeds
Agent B pushes → GitHub rejects (branch already exists with different content)

Root cause: No coordination between parallel agent instances. Each operates independently, unaware of others.

Prevention:

# Before starting work, check if we already have a branch for this issue
git branch -r | grep "issue-915"
# Check if someone else already submitted a PR
gh api repos/{owner}/{repo}/pulls --jq '.[] | select(.title | test("915"))'

Fix: Implement a distributed lock mechanism:

Before starting work, create a "claim" file or GitHub issue comment
Check for existing claims before proceeding
Use file-based locks for local parallel execution

Type 3: Stale Context / Outdated Codebase (18% of failures)

What happens: The agent works on code that was valid yesterday but has been updated since. Tests pass locally but fail in CI because the base branch has changed.

Real scenario:

Day 1: Agent clones repo at commit abc123
Day 2: Maintainer pushes 15 new commits
Day 3: Agent submits PR based on abc123
Result: Merge conflicts, CI failures, stale tests

Root cause: The agent doesn't fetch latest changes before starting work. Or it fetches but doesn't rebase.

Prevention:

# Always fetch and rebase before starting work
git fetch upstream
git rebase upstream/main
# If conflicts, abort and re-clone

Fix: Add a freshness check to the pipeline:

Before starting work, fetch latest
If behind by > 10 commits, re-clone
Always rebase before pushing

Type 4: Test Environment Mismatch (12% of failures)

What happens: Tests pass locally but fail in CI because the agent's environment differs from CI.

Common mismatches:

Python version (agent has 3.12, CI has 3.11)
Missing dependencies (agent has torch installed, CI doesn't)
Environment variables (agent has Supabase keys, CI doesn't)
File paths (agent uses /tmp, CI uses /home/runner/work/)

Prevention:

# Check CI configuration
cat .github/workflows/ci.yml
# Match the CI environment locally
python3 --version  # Should match CI
pip list | grep torch  # Check if ML deps are available

Fix: Run tests in a container that matches CI:

docker run -v $(pwd):/app -w /app python:3.11-slim \
  bash -c "pip install -r requirements.txt && pytest"

Type 5: Incorrect Issue Linkage (8% of failures)

What happens: The agent references the wrong issue number in the PR description. This causes the PR to close the wrong issue when merged.

Real example:

Fixes #824  # Wrong! Should be #832

Root cause: The agent reads the issue title, but the code it wrote addresses a different issue with a similar title.

Prevention:

# Always verify the issue number matches the actual work
gh api repos/{owner}/{repo}/issues/{number} --jq '.title'
# Compare with what you actually implemented

Fix: Add issue verification to the PR template:

Before submitting, re-read the issue
Verify the code addresses the exact issue
Double-check the issue number

Type 6: Token/API Exhaustion (5% of failures)

What happens: The agent runs out of API credits mid-task. It might be halfway through writing tests when the LLM API returns a 402 error.

Real scenario:

Agent: Writing test 15 of 25...
API: 402 Insufficient Balance
Agent: [silently fails, reports partial success]

Root cause: No budget monitoring or graceful degradation.

Prevention:

# Check API balance before starting long tasks
def check_api_health():
    try:
        response = client.models.list()
        return True
    except AuthenticationError:
        return False

Fix: Implement graceful degradation:

Check API health before starting
If low balance, switch to free models (Groq, Gemini)
If all APIs fail, save progress and pause

Type 7: Silent Data Loss (2% of failures)

What happens: The agent overwrites its own work, loses track of progress, or corrupts state files.

Real scenario:

Agent: Writing to /tmp/agent-state.json
Agent: [crashes mid-write]
Agent: [restarts, reads corrupted state]
Agent: [starts over from scratch, losing 2 hours of work]

Root cause: No atomic writes, no state persistence, no crash recovery.

Prevention:

# Use atomic writes
import tempfile, os

def safe_write(path, content):
    tmp = path + '.tmp'
    with open(tmp, 'w') as f:
        f.write(content)
    os.rename(tmp, path)  # Atomic on most filesystems

Fix: Implement crash recovery:

Write state to atomic files
On restart, check for incomplete work
Resume from last checkpoint

The Debugging Playbook

When your agent fails, follow this checklist:

Step 1: Read the FULL error

Don't just read the last line. The real error is usually 10-20 lines before the crash.

# Get the last 50 lines of agent output
tail -50 /var/log/agent.log

# Search for error patterns
grep -E "Error|Exception|FAILED|Traceback" /var/log/agent.log | tail -10

Step 2: Identify the failure type

Match the error to one of the 7 types above. This tells you the root cause and the fix.

Step 3: Check the environment

# Is the agent running?
ps aux | grep agent

# Is the API working?
curl -s https://api.openai.com/v1/models -H "Authorization: Bearer $API_KEY" | head -5

# Is the filesystem intact?
ls -la /path/to/agent/state/

Step 4: Check the state

# What was the agent doing when it crashed?
cat /path/to/agent/state/current_task.json

# What progress was saved?
cat /path/to/agent/state/progress.json

Step 5: Fix and restart

# Fix the root cause
# ...

# Restart with recovery
python3 agent.py --resume-from-checkpoint

The Monitoring Dashboard

Every autonomous agent needs monitoring. Here's what to track:

Metric	Alert Threshold	Action
PR Submission Rate	< 1/hour	Check API health
Test Pass Rate	< 80%	Review recent changes
Error Rate	> 20%	Pause and debug
API Balance	< $5	Switch to free models
State File Size	> 10MB	Archive old state
Memory Usage	> 80%	Restart agent

Simple Monitoring Script

#!/bin/bash
# monitor-agent.sh — Run every 5 minutes via cron

LOG="/var/log/agent.log"
STATE="/path/to/agent/state"

# Check if agent is running
if ! pgrep -f "agent.py" > /dev/null; then
    echo "ALERT: Agent not running!" | send_notification
    # Auto-restart
    cd /path/to/agent && python3 agent.py --resume &
fi

# Check error rate (last 100 lines)
ERRORS=$(tail -100 "$LOG" | grep -c "ERROR\|FAILED\|Exception")
if [ "$ERRORS" -gt 20 ]; then
    echo "ALERT: High error rate ($ERRORS errors in last 100 lines)" | send_notification
fi

# Check state file
if [ -f "$STATE/current_task.json" ]; then
    AGE=$(($(date +%s) - $(stat -c %Y "$STATE/current_task.json")))
    if [ "$AGE" -gt 3600 ]; then
        echo "ALERT: Agent stuck for $((AGE/60)) minutes" | send_notification
    fi
fi

Real-World Debugging Session: The Translation Test Failure

Let me walk you through an actual debugging session from Day 15 of the experiment. This is a real failure, with real logs, and the real fix.

The Setup

The agent was tasked with writing unit tests for a translation pipeline — a Python module that detects language and translates text using Helsinki-NLP MarianMT models. The issue was straightforward: "add unit tests for translation_service."

The Failure

The agent submitted PR #928 with 35 tests. All passed locally. CodeRabbit reviewed and flagged:

"The test references lp._MODEL_CACHE but the actual code uses @lru_cache. The test will pass against mocks but won't catch real caching behavior."

The Debugging Process

Step 1: Read the full review

gh api repos/ritesh-1918/HELPDESK.AI/pulls/928/comments --jq '.[].body'

Step 2: Compare test assumptions with actual code

# What the test assumes:
grep "_MODEL_CACHE" backend/tests/test_language_pipeline.py
# Output: lp._MODEL_CACHE.clear()

# What the actual code uses:
grep "lru_cache\|cache" backend/language_pipeline.py
# Output: @lru_cache(maxsize=3)

Step 3: Identify the root cause

The agent used _MODEL_CACHE because it saw this pattern in another test file. But the actual translation module uses Python's @lru_cache decorator, which has a different API for clearing (.cache_clear() vs .clear()).

Step 4: Fix

# Before (wrong):
lp._MODEL_CACHE.clear()

# After (correct):
from backend.language_pipeline import _load_model
_load_model.cache_clear()

Step 5: Verify

python3 -m pytest backend/tests/test_language_pipeline.py -v

The Lesson

The agent made a reasonable assumption based on patterns it had seen elsewhere. But "reasonable assumption" is the enemy of correctness. The fix was trivial — but only because we caught it before the maintainer did.

Advanced Debugging: The Agent Self-Audit

After 200+ failures, I developed a self-audit protocol that the agent runs before every PR submission:

def self_audit(pr_branch, issue_number):
    """Run before every PR submission."""

    # 1. Verify all referenced files exist
    referenced_files = extract_file_references(pr_branch)
    for f in referenced_files:
        if not os.path.exists(f):
            return FAIL, f"Referenced file {f} does not exist"

    # 2. Verify issue linkage matches actual changes
    issue_title = get_issue_title(issue_number)
    changes = get_pr_changes(pr_branch)
    if not changes_match_issue(changes, issue_title):
        return WARN, "Changes may not fully address the issue"

    # 3. Verify tests pass without mocks
    test_result = run_tests_without_mocks(pr_branch)
    if test_result.failures > 0:
        return FAIL, f"{test_result.failures} tests fail without mocks"

    # 4. Check for common mistakes
    common_mistakes = [
        ("hardcoded path", r'/tmp/|/home/'),
        ("debug print", r'print\('),
        ("TODO comment", r'# TODO|// TODO'),
        ("console.log", r'console\.log'),
    ]
    for pattern_name, regex in common_mistakes:
        if re.search(regex, changes):
            return WARN, f"Found {pattern_name} in changes"

    return PASS, "All checks passed"

This self-audit catches 60% of failures before they reach the maintainer.

Lessons from 200+ Failures

1. The Agent Is Only as Good as Its Guards

Without verification steps, the agent will confidently produce wrong output. Every "verify file exists" check, every "rebase before pushing" step, every "check API balance" guard prevents a class of failures.

2. Silent Failures Are Worse Than Loud Crashes

A crash is visible. A silent failure — the agent submitting a PR for the wrong issue, or writing tests for a non-existent file — wastes hours of maintainer time and damages your reputation.

3. State Persistence Is Non-Negotiable

If your agent can't survive a crash and resume, you're one power outage away from losing all progress. Atomic writes, checkpoint files, and crash recovery are mandatory.

4. Monitoring > Logging

Logs tell you what happened. Monitoring tells you what's happening RIGHT NOW. Set up alerts for error rates, API health, and stuck states.

5. The Simplest Fix Is Often the Best

When debugging agent failures, resist the urge to build complex solutions. Often, the fix is:

Add a file.exists() check before writing tests
Add a git fetch before starting work
Add a try/except around API calls

Simple guards prevent 80% of failures.

The Cost of Not Debugging

Let me put real numbers on this. During the 30-day experiment:

Metric	Without Debugging	With Debugging
PRs Submitted	84	84
PRs Merged	14 (17%)	59 (70%)
Maintainer Complaints	8	0
Reputation Damage	High	None
Time Wasted on Failed PRs	~40 hours	~5 hours

The difference? Debugging guards. Every file.exists() check, every git fetch before starting, every self_audit() before submitting — each one prevents a class of failures that would otherwise waste hours.

The Math

Without debugging:

70 failed PRs × 30 minutes average wasted = 35 hours
8 maintainer complaints × 1 hour to resolve = 8 hours
Reputation damage = unquantifiable but real
Total cost: 43+ hours + damaged reputation

With debugging:

25 failed PRs × 15 minutes average = 6.25 hours
0 maintainer complaints
Reputation intact
Total cost: 6.25 hours

The debugging overhead is ~2 hours to implement. The savings are 37+ hours.

That's an 18:1 return on investment. Not counting reputation.

Building Your Own Agent Debugging Toolkit

If you're building an autonomous agent, here's the minimal toolkit you need:

1. Pre-flight Checks (run before every task)

def preflight_check(task):
    """Verify environment before starting work."""
    checks = {
        "api_health": check_api_health(),
        "disk_space": check_disk_space(),
        "git_clean": check_git_status(),
        "repo_fresh": check_repo_freshness(),
        "no_conflicts": check_existing_work(task),
    }

    failed = [k for k, v in checks.items() if not v]
    if failed:
        raise PreflightError(f"Failed checks: {failed}")

    return True

2. Post-flight Checks (run after every task)

def postflight_check(pr_branch, issue_number):
    """Verify work quality before submission."""
    checks = {
        "files_exist": verify_referenced_files(pr_branch),
        "tests_pass": run_tests(pr_branch),
        "issue_linked": verify_issue_linkage(pr_branch, issue_number),
        "no_debug_code": check_for_debug_code(pr_branch),
        "style_match": check_code_style(pr_branch),
    }

    failed = [k for k, v in checks.items() if not v]
    if failed:
        raise PostflightError(f"Failed checks: {failed}")

    return True

3. Health Monitor (runs continuously)

class AgentHealthMonitor:
    def __init__(self):
        self.metrics = {
            "errors": 0,
            "successes": 0,
            "api_calls": 0,
            "start_time": time.time(),
        }

    def record_error(self, error_type, details):
        self.metrics["errors"] += 1
        if self.metrics["errors"] > 10:
            self.alert("High error rate detected")

    def record_success(self):
        self.metrics["successes"] += 1

    def get_error_rate(self):
        total = self.metrics["errors"] + self.metrics["successes"]
        return self.metrics["errors"] / total if total > 0 else 0

    def should_pause(self):
        return self.get_error_rate() > 0.3

4. State Manager (crash recovery)

class StateManager:
    def __init__(self, state_dir):
        self.state_dir = state_dir
        self.checkpoint_file = os.path.join(state_dir, "checkpoint.json")

    def save_checkpoint(self, state):
        """Atomic write to prevent corruption."""
        tmp = self.checkpoint_file + ".tmp"
        with open(tmp, "w") as f:
            json.dump(state, f, indent=2)
        os.rename(tmp, self.checkpoint_file)

    def load_checkpoint(self):
        """Load last checkpoint if exists."""
        if os.path.exists(self.checkpoint_file):
            with open(self.checkpoint_file) as f:
                return json.load(f)
        return None

    def clear_checkpoint(self):
        """Clear checkpoint after successful completion."""
        if os.path.exists(self.checkpoint_file):
            os.remove(self.checkpoint_file)

Common Mistakes Even Experienced Developers Make

After helping several developers set up their own autonomous agents, I've seen these mistakes repeatedly:

Mistake 1: Trusting the Agent's Self-Report

The agent says "all tests pass." You believe it. You submit the PR. Tests fail in CI.

Why it happens: The agent runs tests in its own environment, which may differ from CI. Or it mocks everything and tests pass against mocks, not real code.

Fix: Always verify test results independently:

# Don't trust the agent's report — verify yourself
cd /path/to/repo
python3 -m pytest -v --tb=short 2>&1 | tail -20

Mistake 2: Not Reading the Full Issue

The agent reads the title, infers the solution, and starts coding. But the issue body contains critical constraints, edge cases, or existing solutions that change everything.

Fix: Force the agent to read the FULL issue body before coding:

gh api repos/{owner}/{repo}/issues/{number} --jq '.body'

Mistake 3: Ignoring Automated Reviews

"CodeRabbit is just a bot, I'll ignore it." This attitude costs merges.

Fix: Treat automated reviews as seriously as human reviews. They catch real issues.

Mistake 4: Submitting Without Local Testing

"It works on my machine" is not a valid test strategy. Run the actual test suite, not just your new tests.

Fix: Run the full test suite before submitting:

python3 -m pytest backend/tests/ -v

Mistake 5: Not Checking for Competing PRs

You spend 2 hours writing code, only to find someone else already submitted a PR for the same issue 3 hours ago.

Fix: Check before starting:

gh api repos/{owner}/{repo}/pulls --jq '.[] | select(.title | test("ISSUE_NUMBER"))'

The Bottom Line

Building autonomous AI agents that run 24/7 is hard. Not because the AI is bad at coding — it's excellent. The hard part is building the guardrails, verification steps, and monitoring systems that prevent the AI from confidently doing the wrong thing.

Every failure I've cataloged was preventable. Not with better AI, but with better engineering: file existence checks, environment verification, state persistence, and monitoring.

If you're building an autonomous agent, start with the failure taxonomy above. Implement guards for each type. Set up monitoring. Then let it run.

The agent will still fail sometimes. But now you'll know exactly why — and exactly how to fix it.

This article is based on real failure data from a 30-day autonomous bounty-hunting experiment. For the full architecture and earnings breakdown, see my previous article: "The Agent Economy: How AI Agents Are Earning Real Money in Open Source."

About the Author: I build autonomous AI agents and write about what actually goes wrong. No sugar-coating, no hype — just real data from real failures. Follow for more.