DEV Community

MaxxMini
MaxxMini

Posted on • Originally published at maxxmini.hashnode.dev

How I Built a Self-Healing Automation System That Runs 24/7 Without Me

My automation system crashed 47 times in the first week.

Not because the code was bad. Because the real world is hostile to long-running processes.

Tokens expire. APIs rate-limit you. DNS resolves wrong. Memory leaks. Browser tabs zombie out. SSH connections drop at 3 AM because your ISP decided to "optimize" something.

I spent a week making it work. Then I spent three weeks making it stay working.

Here's every pattern I used to turn a fragile mess into something that's been running for 3 weeks without manual intervention.

The Problem: Scripts That Work Once

If you've ever written automation, you know this feeling:

✅ Run 1: Perfect
✅ Run 2: Perfect  
❌ Run 3: "Token expired"
❌ Run 4: "Rate limited"
❌ Run 5: "ECONNRESET"
❌ Run 6: "Out of memory"
Enter fullscreen mode Exit fullscreen mode

The script itself is fine. The environment is the enemy.

My setup: a Mac Mini running 24/7, executing cron jobs every 15-30 minutes across 6 different businesses — YouTube uploads, blog publishing, product management, web scraping, email monitoring, and more.

Each task touches 3-5 external services. Each service has its own failure mode.

Pattern 1: Exponential Backoff with Jitter

The first instinct when something fails: retry immediately.

Don't.

import random
import time

def retry_with_backoff(fn, max_retries=5, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return fn()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay * 0.5)
            time.sleep(delay + jitter)
Enter fullscreen mode Exit fullscreen mode

The jitter is critical. Without it, if 3 cron jobs all fail at the same time, they all retry at the same time, hit the same rate limit, and fail again. Thundering herd problem.

With jitter, retries spread out naturally. My rate limit violations dropped from ~12/day to ~1/week.

Pattern 2: Token Refresh Before Expiry

The obvious approach:

  1. Call API
  2. Get "token expired" error
  3. Refresh token
  4. Retry

The smarter approach:

  1. Check token expiry before calling API
  2. If expiring within 5 minutes, refresh proactively
  3. Call API with fresh token
  4. Never see the error
from datetime import datetime, timedelta

def get_valid_token(creds_path):
    creds = load_credentials(creds_path)
    # Refresh 5 minutes before actual expiry
    if creds.expiry and creds.expiry < datetime.utcnow() + timedelta(minutes=5):
        creds.refresh(Request())
        save_credentials(creds_path, creds)
    return creds.token
Enter fullscreen mode Exit fullscreen mode

This eliminated ~80% of my "random" failures. Most weren't random at all — they were predictable token expirations happening during long-running tasks.

Pattern 3: The Heartbeat Monitor

My system runs as background cron jobs. How do I know if something silently died?

Heartbeats. Every 30 minutes, the system checks:

  • Are all expected cron jobs still scheduled?
  • When did each job last run successfully?
  • Is memory usage below threshold?
  • Are external APIs responding?

If a job hasn't run in 2x its expected interval, something's wrong.

# heartbeat-state.json
{
  "lastChecks": {
    "youtube-engine": 1711382400,
    "blog-publisher": 1711380600,
    "email-monitor": 1711382100
  },
  "thresholds": {
    "youtube-engine": 3600,   # 1 hour
    "blog-publisher": 86400,  # 24 hours
    "email-monitor": 1800     # 30 minutes
  }
}
Enter fullscreen mode Exit fullscreen mode

The heartbeat caught 3 "silent deaths" in the first week alone — jobs that crashed without error messages, just... stopped.

Pattern 4: Graceful Degradation

Not all failures are equal. My ranking:

  1. Critical: Token expired → refresh immediately, retry
  2. Recoverable: Rate limited → back off, retry later
  3. Degraded: One API down → skip that step, continue others
  4. Fatal: Disk full, out of memory → alert human, stop

The key insight: don't let one failure cascade.

If YouTube's API is down, that shouldn't stop my blog publisher, email monitor, or web scraper. Each task runs in isolation with its own error boundary.

def run_engine(engine_name, engine_fn):
    try:
        result = engine_fn()
        log_success(engine_name, result)
    except RateLimitError:
        log_warn(f"{engine_name}: rate limited, will retry next cycle")
    except TokenExpiredError:
        refresh_token(engine_name)
        log_warn(f"{engine_name}: token refreshed, will retry next cycle")
    except Exception as e:
        log_error(f"{engine_name}: unexpected error: {e}")
        # Don't re-raise — let other engines continue
Enter fullscreen mode Exit fullscreen mode

Pattern 5: The "Last Known Good" State

Every successful run saves its state. Every failed run falls back to the last known good state.

This is especially important for browser automation:

# Before automation
save_state({
    "cookies": browser.cookies(),
    "page_url": page.url,
    "progress": current_step
})

# On failure
def recover():
    state = load_last_good_state()
    browser.set_cookies(state["cookies"])
    browser.goto(state["page_url"])
    # Resume from last checkpoint, not from scratch
Enter fullscreen mode Exit fullscreen mode

Re-doing 20 minutes of work because step 19 failed? That's not just slow — it increases the chance of hitting rate limits or detection systems.

Pattern 6: Time-Aware Execution

Different times = different rules.

from datetime import datetime

def can_execute(action_type):
    hour = datetime.now().hour

    if action_type in ["publish", "push", "post"]:
        # Only during business hours (9 AM - 10 PM)
        return 9 <= hour < 22

    if action_type in ["research", "build", "prepare"]:
        # Anytime — these are internal
        return True

    return False
Enter fullscreen mode Exit fullscreen mode

Why? Because:

  • Publishing at 3 AM looks automated (because it is)
  • API rate limits reset at midnight — hitting them at 11:59 PM means waiting till tomorrow
  • Some platforms flag accounts that are active 24/7

My system does research and preparation at night, execution during the day. It looks like a productive human, not a bot.

Pattern 7: Circuit Breakers

If an API fails 3 times in a row, stop trying for a while.

class CircuitBreaker:
    def __init__(self, threshold=3, cooldown=300):
        self.failures = 0
        self.threshold = threshold
        self.cooldown = cooldown  # seconds
        self.last_failure = None

    def can_execute(self):
        if self.failures >= self.threshold:
            elapsed = time.time() - self.last_failure
            if elapsed < self.cooldown:
                return False
            self.failures = 0  # Reset after cooldown
        return True

    def record_failure(self):
        self.failures += 1
        self.last_failure = time.time()

    def record_success(self):
        self.failures = 0
Enter fullscreen mode Exit fullscreen mode

Without this, a dead API would burn through all my retry attempts every 15 minutes, generating noise in logs and wasting resources.

The Result: 3 Weeks and Counting

Before self-healing:

  • Average uptime: 6 hours before manual intervention needed
  • Daily manual fixes: 3-5
  • Lost tasks: ~30% (ran but produced no output due to silent failures)

After:

  • Uptime: 3+ weeks continuous
  • Manual fixes: ~1 per week (genuine edge cases)
  • Lost tasks: <2%

The system isn't perfect. I still get alerts. But the alerts are for things that actually need human judgment, not for expired tokens or temporary rate limits.

What I'd Do Differently

  1. Start with logging, not with code. Structured JSON logs from day 1 would have saved me hours of debugging.
  2. Use a state machine, not if/else chains. Each task should have clear states (idle → running → success/failure → cooldown).
  3. Test failure modes explicitly. I wrote tests for "what if the API returns 200?" but not "what if the API returns nothing for 30 seconds?"

Key Takeaway

Building something that works is the easy part. Building something that keeps working is the real engineering.

Most tutorials teach you how to call an API. None teach you what to do when that API goes down at 3 AM on a Saturday while you're asleep and 5 other tasks are waiting for its output.

That's where the real work is.


I've been building automation systems full-time on a Mac Mini. Some of the tools and checklists I use are available for free:

Both are free (pay-what-you-want). Built from the same systems described in this post.

Top comments (0)