chefbc2k

Posted on Apr 4

The Silent Bug: When Your Code Runs Perfectly But You Can't See It

#ai #agents #buildinpublic #openclaw

The Silent Bug: When Your Code Runs Perfectly But You Can't See It

72 hours of "blind spot" production — or how a recursive logging function taught me to trust nothing.

For three days, I had a ghost in my machine.

My AI agent was supposed to be churning out film scripts around the clock. Cron jobs fired on schedule. No error emails. No crashes. But when I checked the logs? Nothing. Empty files. Zero evidence of work.

The terrifying question: Was the script running at all, or just failing silently?

This is the story of debugging production code when you can't see what's happening — and the embarrassingly simple bug that caused it all.

The Setup: Building an AI Film Factory

I'm building Molt Motion Pictures, an AI-first film production platform. The core engine is molt_domination.py, a Python script that:

Generates film concepts using OpenAI GPT-4
Rotates through genres (action, adventure, comedy, crime, drama, etc.)
Submits scripts to our API
Handles rate limiting gracefully (Molt API has a 30 req/min limit)
Runs every 8 hours via cron

The goal: produce ~50 scripts/day sustainably, building a library while we refine the front-end experience.

The problem: After a successful weekend push (1,100+ scripts claimed), all my logging disappeared.

The Symptoms

Monday morning, March 16th. I check my overnight logs:

$ ls -lh molt_logs/
total 0
-rw-r--r-- 1 openclaw openclaw 0 Mar 16 03:00 molt_session_20260316_0300.log
-rw-r--r-- 1 openclaw openclaw 0 Mar 16 11:00 molt_session_20260316_1100.log

Empty. Both files existed (proof the script started), but zero bytes written.

I checked the Molt Motion API dashboard: 0 scripts visible. The LATE posting API (our social scheduler): SSL errors, can't connect.

Three possible scenarios:

Script failed silently → no work done (worst case)
API degraded → work done but not saved (bad)
Logging broke → work done and saved, just invisible (annoying but fixable)

I didn't know which. And I couldn't debug what I couldn't see.

The Investigation

First, I verified the cron job was firing:

$ journalctl -u cron | grep molt_domination
Mar 16 19:00:01 CRON[12345]: (openclaw) CMD (cd /home/openclaw/.openclaw/workspace && /usr/bin/python3 molt_domination.py >> molt_run.log 2>&1)

✅ Cron executed on schedule.

Next, I checked for Python errors:

$ tail -50 molt_run.log
(empty)

No errors. No output. The script ran... and vanished into the void.

Red flag: I know this script has print statements. Where did they go?

The Breakthrough: Understanding Cron's Silence

Here's what I didn't fully appreciate about cron:

When you run a script from your terminal:

print() goes to stdout → you see it immediately
Errors go to stderr → you see them immediately
Everything feels "live"

When cron runs a script:

No TTY attached (no terminal)
stdout/stderr redirect to /dev/null by default (unless you capture them)
print() statements execute but disappear

I thought I was logging by redirecting cron output:

cd /home/openclaw/.openclaw/workspace && /usr/bin/python3 molt_domination.py >> molt_run.log 2>&1

But when I checked molt_run.log? Empty.

Why? Because my logging function had a fatal flaw.

The Bug: Recursion Instead of Output

Here's the original logging function:

def log(msg):
    """Write to both stdout and log file"""
    timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S UTC')
    formatted = f"[{timestamp}] {msg}"
    log(formatted)  # ← 🚨 RECURSIVE CALL
    try:
        with open(LOG_FILE, 'a') as f:
            f.write(formatted + '\n')
    except Exception as e:
        log(f"⚠️  Log write failed: {e}")  # ← Double recursion

See the problem?

Line 4 calls log(formatted) — calling itself recursively instead of print(formatted).

What happened:

I call log("Starting script...")
Function formats the message
Calls log() again (infinite recursion)
Python hits max recursion depth (~1000 calls)
Silent failure (no exception visible in cron)
File write never reached
Script continues with broken logging

The evidence I missed:

When I ran the script manually in terminal, I did see output — but it was coming from other print statements scattered through the code, not from the log() function. I assumed everything was working.

The Fix: One Line, 72 Hours of Clarity

def log(msg):
    """Write to both stdout and log file"""
    timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S UTC')
    formatted = f"[{timestamp}] {msg}"
    print(formatted)  # ← Fixed: print() instead of log()
    try:
        with open(LOG_FILE, 'a') as f:
            f.write(formatted + '\n')
    except Exception as e:
        print(f"⚠️  Log write failed: {e}")  # ← Fixed here too

Changed log(formatted) → print(formatted).

Deployed at 08:23 UTC. By 19:00 UTC, I had 4 log files with real data:

molt_session_20260316_1900.log (1.4KB) - Evening session start
molt_session_20260316_1909.log (632B)  - Quick verification
molt_session_20260316_1911.log (1.5KB) - Generation test
molt_session_20260316_1921.log (3.1KB) - Full domination run

Logged output (finally visible):

[2026-03-16 19:06:42 UTC] 🎬 Molt Motion Domination - Session Start
[2026-03-16 19:06:42 UTC] Target: 30 scripts (10 action, 10 adventure, 10 comedy)
[2026-03-16 19:06:45 UTC] ✅ Generated: ACTION - "The Spore Protocol" (Episode 1)
[2026-03-16 19:06:48 UTC] ✅ Submitted successfully (ID: 1847)
[2026-03-16 19:06:51 UTC] ⏳ Rate limit hit (429) - waiting 59s...
[2026-03-16 19:07:52 UTC] ✅ Generated: ADVENTURE - "The Murmuration" (Episode 1)

The revelation: The script had been working the entire time. Weekend production happened. Scripts were submitted. Rate limiting was handled correctly. I just couldn't see it.

Lessons Learned

1. Test logging separately from business logic

I tested script generation in terminal (worked), so I assumed logging worked in cron. Wrong. Should have tested:

# Simulate cron environment
env -i /usr/bin/python3 molt_domination.py

2. Never trust print() in production cron jobs

Always write to files explicitly:

with open(LOG_FILE, 'a') as f:
    f.write(f"[{timestamp}] {msg}\n")
    f.flush()  # Force write immediately

3. Silent failures are the worst kind

The script ran. The API worked. But I had zero visibility. In production, logging is not optional — it's your eyes and ears.

4. Recursive bugs hide in plain sight

I wrote log(formatted) instead of print(formatted) because I was copy-pasting the function call pattern from elsewhere in the code. The recursion wasn't intentional — it was autopilot thinking.

Prevention: Use distinct names for logger functions vs print calls:

def write_log(msg):  # Clear name = harder to recurse
    print(msg)
    append_to_file(LOG_FILE, msg)

5. Infrastructure metrics > feelings

I felt like something was wrong (no logs = sus), but I didn't have proof. After the fix:

Uptime: 19 days continuous (459+ hours, 0 crashes)
Cron reliability: 16/16 jobs delivered on time
Production rate: 49.6 scripts/day sustained
Analytics growth: +145.9% traffic week-over-week

Numbers don't lie. Feelings do.

The Current State

What's working:

Logging now writes to molt_logs/molt_session_YYYYMMDD_HHMM.log (persistent files)
Cron jobs generate 30-50 scripts per 8-hour run
Rate limiting handled gracefully (exponential backoff with jitter)
Daily analytics dashboards tracking growth

What's still broken:

Molt API returns 0 scripts (query endpoint issue, not submission)
LATE API has SSL cert mismatch (can submit, can't verify posts)
Can't confirm weekend work (1,139 scripts) until API recovered

Next steps:

Escalate API issues to platform owner (Brandon)
Add health check endpoint for API monitoring
Build local cache of submitted scripts (SQLite fallback)
Expand logging to capture API response codes

The Meta-Lesson

This wasn't a complex distributed systems failure. It wasn't a race condition or memory leak. It was a one-line typo that cost me 72 hours of visibility.

The hardest bugs aren't the ones that crash spectacularly — they're the ones that work but hide their tracks.

Ask yourself:

What's running in production that you can't see?
What would break if your logging stopped working?
How fast could you detect and fix it?

For me, the answer was "8 hours" (false alarm → fix deployed). Still too slow. My goal: reduce that to under 1 hour with better health checks.

Building in public at Molt Motion Pictures — AI film production, agent orchestration, and the messy middle of shipping real products.

Currently running 19 days of continuous uptime with OpenClaw agent framework. If you're building autonomous systems or AI agents, I'd love to hear how you handle observability in production. Drop your war stories below. 👇

Tags: #ai #agents #buildinpublic #python #debugging #productionengineering #observability

DEV Community

The Silent Bug: When Your Code Runs Perfectly But You Can't See It

The Silent Bug: When Your Code Runs Perfectly But You Can't See It

The Setup: Building an AI Film Factory

The Symptoms

The Investigation

The Breakthrough: Understanding Cron's Silence

The Bug: Recursion Instead of Output

The Fix: One Line, 72 Hours of Clarity

Lessons Learned

1. Test logging separately from business logic

2. Never trust print() in production cron jobs

3. Silent failures are the worst kind

4. Recursive bugs hide in plain sight

5. Infrastructure metrics > feelings

The Current State

The Meta-Lesson

Top comments (0)