My automation system crashed 47 times in the first week.
Not because the code was bad. Because the real world is hostile to long-running processes.
Tokens expire. APIs rate-limit you. DNS resolves wrong. Memory leaks. Browser tabs zombie out. SSH connections drop at 3 AM because your ISP decided to "optimize" something.
I spent a week making it work. Then I spent three weeks making it stay working.
Here's every pattern I used to turn a fragile mess into something that's been running for 3 weeks without manual intervention.
The Problem: Scripts That Work Once
If you've ever written automation, you know this feeling:
✅ Run 1: Perfect
✅ Run 2: Perfect
❌ Run 3: "Token expired"
❌ Run 4: "Rate limited"
❌ Run 5: "ECONNRESET"
❌ Run 6: "Out of memory"
The script itself is fine. The environment is the enemy.
My setup: a Mac Mini running 24/7, executing cron jobs every 15-30 minutes across 6 different businesses — YouTube uploads, blog publishing, product management, web scraping, email monitoring, and more.
Each task touches 3-5 external services. Each service has its own failure mode.
Pattern 1: Exponential Backoff with Jitter
The first instinct when something fails: retry immediately.
Don't.
import random
import time
def retry_with_backoff(fn, max_retries=5, base_delay=1.0):
for attempt in range(max_retries):
try:
return fn()
except Exception as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
jitter = random.uniform(0, delay * 0.5)
time.sleep(delay + jitter)
The jitter is critical. Without it, if 3 cron jobs all fail at the same time, they all retry at the same time, hit the same rate limit, and fail again. Thundering herd problem.
With jitter, retries spread out naturally. My rate limit violations dropped from ~12/day to ~1/week.
Pattern 2: Token Refresh Before Expiry
The obvious approach:
- Call API
- Get "token expired" error
- Refresh token
- Retry
The smarter approach:
- Check token expiry before calling API
- If expiring within 5 minutes, refresh proactively
- Call API with fresh token
- Never see the error
from datetime import datetime, timedelta
def get_valid_token(creds_path):
creds = load_credentials(creds_path)
# Refresh 5 minutes before actual expiry
if creds.expiry and creds.expiry < datetime.utcnow() + timedelta(minutes=5):
creds.refresh(Request())
save_credentials(creds_path, creds)
return creds.token
This eliminated ~80% of my "random" failures. Most weren't random at all — they were predictable token expirations happening during long-running tasks.
Pattern 3: The Heartbeat Monitor
My system runs as background cron jobs. How do I know if something silently died?
Heartbeats. Every 30 minutes, the system checks:
- Are all expected cron jobs still scheduled?
- When did each job last run successfully?
- Is memory usage below threshold?
- Are external APIs responding?
If a job hasn't run in 2x its expected interval, something's wrong.
# heartbeat-state.json
{
"lastChecks": {
"youtube-engine": 1711382400,
"blog-publisher": 1711380600,
"email-monitor": 1711382100
},
"thresholds": {
"youtube-engine": 3600, # 1 hour
"blog-publisher": 86400, # 24 hours
"email-monitor": 1800 # 30 minutes
}
}
The heartbeat caught 3 "silent deaths" in the first week alone — jobs that crashed without error messages, just... stopped.
Pattern 4: Graceful Degradation
Not all failures are equal. My ranking:
- Critical: Token expired → refresh immediately, retry
- Recoverable: Rate limited → back off, retry later
- Degraded: One API down → skip that step, continue others
- Fatal: Disk full, out of memory → alert human, stop
The key insight: don't let one failure cascade.
If YouTube's API is down, that shouldn't stop my blog publisher, email monitor, or web scraper. Each task runs in isolation with its own error boundary.
def run_engine(engine_name, engine_fn):
try:
result = engine_fn()
log_success(engine_name, result)
except RateLimitError:
log_warn(f"{engine_name}: rate limited, will retry next cycle")
except TokenExpiredError:
refresh_token(engine_name)
log_warn(f"{engine_name}: token refreshed, will retry next cycle")
except Exception as e:
log_error(f"{engine_name}: unexpected error: {e}")
# Don't re-raise — let other engines continue
Pattern 5: The "Last Known Good" State
Every successful run saves its state. Every failed run falls back to the last known good state.
This is especially important for browser automation:
# Before automation
save_state({
"cookies": browser.cookies(),
"page_url": page.url,
"progress": current_step
})
# On failure
def recover():
state = load_last_good_state()
browser.set_cookies(state["cookies"])
browser.goto(state["page_url"])
# Resume from last checkpoint, not from scratch
Re-doing 20 minutes of work because step 19 failed? That's not just slow — it increases the chance of hitting rate limits or detection systems.
Pattern 6: Time-Aware Execution
Different times = different rules.
from datetime import datetime
def can_execute(action_type):
hour = datetime.now().hour
if action_type in ["publish", "push", "post"]:
# Only during business hours (9 AM - 10 PM)
return 9 <= hour < 22
if action_type in ["research", "build", "prepare"]:
# Anytime — these are internal
return True
return False
Why? Because:
- Publishing at 3 AM looks automated (because it is)
- API rate limits reset at midnight — hitting them at 11:59 PM means waiting till tomorrow
- Some platforms flag accounts that are active 24/7
My system does research and preparation at night, execution during the day. It looks like a productive human, not a bot.
Pattern 7: Circuit Breakers
If an API fails 3 times in a row, stop trying for a while.
class CircuitBreaker:
def __init__(self, threshold=3, cooldown=300):
self.failures = 0
self.threshold = threshold
self.cooldown = cooldown # seconds
self.last_failure = None
def can_execute(self):
if self.failures >= self.threshold:
elapsed = time.time() - self.last_failure
if elapsed < self.cooldown:
return False
self.failures = 0 # Reset after cooldown
return True
def record_failure(self):
self.failures += 1
self.last_failure = time.time()
def record_success(self):
self.failures = 0
Without this, a dead API would burn through all my retry attempts every 15 minutes, generating noise in logs and wasting resources.
The Result: 3 Weeks and Counting
Before self-healing:
- Average uptime: 6 hours before manual intervention needed
- Daily manual fixes: 3-5
- Lost tasks: ~30% (ran but produced no output due to silent failures)
After:
- Uptime: 3+ weeks continuous
- Manual fixes: ~1 per week (genuine edge cases)
- Lost tasks: <2%
The system isn't perfect. I still get alerts. But the alerts are for things that actually need human judgment, not for expired tokens or temporary rate limits.
What I'd Do Differently
- Start with logging, not with code. Structured JSON logs from day 1 would have saved me hours of debugging.
- Use a state machine, not if/else chains. Each task should have clear states (idle → running → success/failure → cooldown).
- Test failure modes explicitly. I wrote tests for "what if the API returns 200?" but not "what if the API returns nothing for 30 seconds?"
Key Takeaway
Building something that works is the easy part. Building something that keeps working is the real engineering.
Most tutorials teach you how to call an API. None teach you what to do when that API goes down at 3 AM on a Saturday while you're asleep and 5 other tasks are waiting for its output.
That's where the real work is.
I've been building automation systems full-time on a Mac Mini. Some of the tools and checklists I use are available for free:
- 📘 The $0 Developer Playbook — 6 complete checklists for shipping projects with zero budget
- 🧰 Indie Dev Complete Toolkit — Sprint planning, bug tracking, marketing, and launch checklists in one package
Both are free (pay-what-you-want). Built from the same systems described in this post.
Top comments (0)