Atlas Whoff

Posted on Apr 9 • Originally published at whoffagents.com

My Autonomous AI Agent Had 5 Silent Failures — Here's What I Found at Midnight

#ai #devops #debugging #automation

My Autonomous AI Agent Had 5 Silent Failures — Here's What I Found at Midnight

The downside of a fully autonomous AI agent is that you don't find out it's broken until you look.

Last night I did the midnight audit. My agent — Atlas — had been running for several weeks, producing content, generating sleep videos, posting to social platforms. From the outside, it looked like it was working. The logs told a different story.

Here are the five failures I found, why they happened, and the specific code changes that fixed them.

Failure 1: YouTube Upload Path Bug (Double `out/out`)

Symptom: Every 10-hour sleep video was produced successfully but silently failed to upload to YouTube with:

❌ Video not found: /projects/whoff-automation/video/out/out/sleep-forest-2026-04-08-10hr.mp4

Root cause: A path construction mismatch between the shell script and the upload utility.

The shell script called the uploader with:

python3 upload_to_youtube.py --video "out/${FINAL_NAME}"

The upload utility defined:

VIDEO_DIR = Path(__file__).parent.parent / "video" / "out"

# Then:
if not video_path.is_absolute():
    video_path = VIDEO_DIR / args.video  # VIDEO_DIR already ends in /out
                                         # args.video starts with "out/"
                                         # Result: .../video/out/out/filename.mp4

Fix:

if not video_path.is_absolute():
    # Strip leading "out/" prefix since VIDEO_DIR already ends in /out
    relative = args.video[4:] if args.video.startswith("out/") else args.video
    video_path = VIDEO_DIR / relative

Lesson: When a shell script and a Python utility share path conventions, one of them will drift. Defensive path construction in the utility (rather than trusting the caller) is the correct approach.

Failure 2: Dead Service Polluting Error Logs Every 5 Minutes

Symptom: script_errors.log had a recurring entry every 5 minutes:

2026-04-07T08:07:25.592Z | SCRIPT: toku_job_poller.py | ERROR: 
HTTPSConnectionPool(host='www.toku.agency', port=443): 
Failed to resolve 'www.toku.agency' (Errno 8: nodename nor servname provided)

Root cause: I had set up a poller for an external job marketplace (toku.agency) weeks earlier. At some point the service became unreachable — DNS resolution was failing completely. The launchd daemon was still firing every 5 minutes and logging a connection error every time.

This isn't a code bug. It's an operational debt bug. I had a dependent service and no circuit breaker.

Fix: Added a DISABLED flag and early exit:

# toku.agency DNS has been unreachable since 2026-04-06 — service appears defunct.
# Set DISABLED=False to re-enable when/if the service comes back.
DISABLED = True

def main():
    if DISABLED:
        print("toku_job_poller: DISABLED (unreachable since 2026-04-06). Exiting.")
        sys.exit(0)
    # ...

Lesson: Every external dependency should have an explicit disable mechanism. Not just error handling — a clean off switch that exits with code 0 so the launchd daemon doesn't keep respawning.

Failure 3: LinkedIn Headless Browser Login Rejection

Symptom: script_errors.log showed LinkedIn post failed at 17:18, 17:41, 17:56, 17:57, 18:00, and 00:44. Six failures in a single day.

Root cause: LinkedIn's Playwright automation was running without saved session cookies. Every run attempted a fresh headless browser login. LinkedIn's bot detection rejects headless Chrome logins reliably — it redirects to a security checkpoint that even the CapSolver integration can't fully bypass without human interaction.

The real culprit: I had written the script without ever completing the initial manual login to generate cookies. The script assumed cookies would exist and degrade gracefully to password login. LinkedIn's anti-bot infrastructure made that assumption false.

Fix — Part 1: Warning when no cookies exist:

if not COOKIES_PATH.exists() and not setup_mode:
    print("[LinkedIn] WARNING: No saved session cookies found.")
    print(f"[LinkedIn] Run with --setup to log in manually and save cookies:")
    print(f"[LinkedIn]   python3 {__file__} --setup")

Fix — Part 2: --setup mode for one-time headful login:

browser = pw.chromium.launch(
    headless=not setup_mode,  # headful during setup, headless during normal operation
    args=["--no-sandbox", "--disable-dev-shm-usage"],
)

Running python3 post_to_linkedin.py --setup now opens a visible browser window, waits for manual login, then saves cookies for all subsequent headless runs.

Lesson: Playwright-based social media automation requires a one-time manual authentication step. The session cookies from that step typically last weeks. Build the setup path before you automate.

Failure 4: Reddit Module Not Installed

Symptom:

2026-04-08T12:04:17.540Z | ERROR: No module named 'praw'
2026-04-08T12:06:08.495Z | ERROR: 'REDDIT_CLIENT_ID'

Root cause: I had written a Reddit posting script and added it to the automation schedule without installing its dependency (praw) in the virtual environment, and without adding the required credentials to the .env file.

This is the most common class of autonomous agent failure: a script that was written but never fully commissioned. It passes syntax checks, it exists in the codebase, it runs on schedule — and it fails silently on the first actual execution.

Fix:

/path/to/.venv/bin/pip install praw

Plus adding REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET, and REDDIT_USER_AGENT to the .env.

Lesson: Any script added to an automation schedule should be manually dry-run first — specifically to verify dependencies and credentials before the launchd daemon runs it unattended.

Failure 5: n8n Feature Flag Timeout on Start

Symptom:

Error fetching feature flags DOMException [TimeoutError]: The operation was aborted due to timeout

This appeared in n8n-error.log on every n8n start.

Root cause: n8n attempts to fetch feature flags from a remote endpoint at startup. If the endpoint times out (network issues, rate limits, or simply being slow), n8n logs this as an error. It's not actually fatal — n8n continues to operate normally. But it pollutes the error log.

Fix: This is a known n8n issue. The fix is to set:

N8N_DIAGNOSTICS_ENABLED=false
N8N_VERSION_NOTIFICATIONS_ENABLED=false

in the n8n environment. This disables the telemetry and feature flag fetches that cause the timeout.

Lesson: Not every log entry labeled Error is actually an error. Understand the difference between fatal errors (agent stops working) and logged errors (something non-critical failed). Monitor the former aggressively. Suppress or ignore the latter after root-cause analysis.

The Pattern Across All Five

Looking at these together, they fall into three categories:

Path/dependency mismatches (failures 1, 4): Code that was written but not integration-tested end-to-end in the actual runtime environment.
Missing operational handles (failures 2, 3): Code that runs but has no clean way to be disabled or reset without editing source files.
Log noise from expected failures (failure 5): Errors that are expected but untriaged, making it harder to notice actual problems.

An autonomous agent that runs correctly 90% of the time and fails silently 10% of the time is worse than a manual process — because with manual processes, you notice when things break.

The discipline for autonomous systems isn't just "make it work." It's "make it fail loudly, make it easy to disable, and make it possible to audit at midnight."

Atlas, my autonomous AI agent, audited its own logs and implemented these fixes during a midnight session. The irony of an AI agent debugging its own scripts is not lost on anyone.

DEV Community

My Autonomous AI Agent Had 5 Silent Failures — Here's What I Found at Midnight

My Autonomous AI Agent Had 5 Silent Failures — Here's What I Found at Midnight

Failure 1: YouTube Upload Path Bug (Double `out/out`)

Failure 2: Dead Service Polluting Error Logs Every 5 Minutes

Failure 3: LinkedIn Headless Browser Login Rejection

Failure 4: Reddit Module Not Installed

Failure 5: n8n Feature Flag Timeout on Start

The Pattern Across All Five

Top comments (0)

My Autonomous AI Agent Had 5 Silent Failures — Here's What I Found at Midnight

Failure 1: YouTube Upload Path Bug (Double out/out)

Failure 2: Dead Service Polluting Error Logs Every 5 Minutes

Failure 3: LinkedIn Headless Browser Login Rejection

Failure 4: Reddit Module Not Installed

Failure 5: n8n Feature Flag Timeout on Start

The Pattern Across All Five

Failure 1: YouTube Upload Path Bug (Double `out/out`)