Supervising 13 always-on processes without Kubernetes: the log-freshness health check that ended my 3 a.m. babysitting

#python #devops #ai #automation

I'm a solo builder. At any given moment I have a handful of always-on processes earning, scraping, or thinking on my behalf: a Polymarket crash-recovery trading bot, a weekly site-scraper that regenerates a ~14k-entry directory, a couple of API servers, a secret-scanner, a DB-retention job, and a few scheduled agents that call an LLM to review strategy or hunt vulnerabilities. Thirteen in total.

For months these lived as a mess of launchd plists, loose crontab lines, and nohup … & processes I'd forget about until one silently died — and I'd notice days later, usually because a number on a dashboard stopped moving. That's the classic agent-ops failure mode: lots of independent autonomous processes, no unified supervisor. Each one is individually simple. Collectively they're un-auditable.

I didn't want Kubernetes. For one solo dev's always-on jobs, k8s is a sledgehammer — a control plane heavier than everything it would supervise. I wanted the smallest thing that would give me one honest answer to one question: is everything that's supposed to be running actually running, and actually working?

So I built ForgeOS: a tiny kernel (~1,900 LOC of Python, zero heavy deps — just pyyaml) that treats every process as an engine defined by a single YAML file.

Three engine types, one primitive

daemon        long-running        the trading bot, API servers, collectors
cron          scheduled           weekly export, daily scan
intelligence  scheduled + LLM     a Claude-CLI agent run on a schedule, output piped back in

The third type is the one people raise an eyebrow at, so let me defend it early: an intelligence engine is just a cron job that happens to call an LLM. Same scheduling, same health checks, same logging as any other engine. The moment I stopped treating "AI agents" as special snowflakes and modeled them as the same primitive as a backup job, the whole system got simpler. A scheduled strategy-review agent and a nightly DB-prune are the same shape; only the command differs.

The part that actually changed my life: health + self-heal

Each engine declares what "healthy" means right in its config:

name: crash-bot
type: daemon
description: Polymarket crash-recovery bot
command: ["python3", "pm_crash_monitor.py"]
cwd: ~/.../agents/trader
health:
  process: pm_crash_monitor      # must appear in the process table
  log_max_age_min: 20            # log must have been written in the last 20 min
  log_path: /tmp/pm_crash_monitor.log
kill_condition: "daily_loss > $10 OR cash < $20"
env:
  LIVE_TRADING: "true"

The kernel runs a loop: for each enabled engine, check the declared health signals — is the process alive, is the log fresh? If a daemon is dead or its log has gone stale, restart it and record the event. One command, forge health, tells me the true state of all 13 engines instead of me SSH-ing around running ps aux | grep.

forge init       # creates ~/.forgeos/
forge start      # start all daemon engines
forge health     # the one honest answer
forge daemon     # run the kernel: health + scheduling + self-heal
forge brief      # one-screen status of everything

Three lessons that generalize beyond my setup

If you run autonomous agents, these cost me real downtime to learn:

1. "Is the process running?" is a near-useless health check on its own. A hung process is "running." The PID is right there. My collectors and AI agents fail far more often by getting stuck than by crashing — and a stuck process keeps its PID forever. Adding log_max_age_min (the log must have been written recently) caught dramatically more real failures than PID checks ever did. Liveness ≠ progress. Health-check progress.

2. Declare kill conditions in config, not buried in code. My trading engine carries kill_condition: daily_loss > $10 OR cash < $20 next to its definition — not 400 lines deep in the bot's source. Keeping the safety limit visible, beside the engine it governs, is what made me actually willing to let a money-moving process run unattended. A safety limit you can't see at a glance is a safety limit you don't trust.

3. Model AI agents as ordinary engines. The temptation is to build a special "agent runtime." Resist it. An LLM call on a schedule is a cron job with an expensive command. Give it the same health loop (did it run? is its output log fresh?) and the same restart policy as everything else. The non-determinism lives inside the command; the supervision around it should be boring.

The honest part: where this still falls short

Log-freshness answers "did it run." It does not answer "did it run well" — which for an intelligence engine (an LLM whose output is non-deterministic text) is the question that actually matters. Right now that's still manual review for me: the agent reliably runs, but judging whether its strategy review was good is a human gate. I haven't automated the quality check, and I'm not convinced a naive "LLM grades the LLM" loop is trustworthy enough to remove me from it yet. If you've solved non-deterministic-output health-checking for scheduled agents, I genuinely want to hear how — that's the open edge of this design.

Status

ForgeOS is pre-release (v0.1.0) and built in public — it's not on PyPI or a public repo yet; today it installs from a source checkout (pip install -e . from the project root). I'm writing this up now because the architecture (declarative engines + log-freshness health + self-heal, no k8s) is the reusable part, and it stands on its own whether or not you ever run my code.

If you're supervising long-running or scheduled agents today — systemd? a hosted orchestrator? homegrown? — I'd love to compare notes in the comments, especially on that last open problem: how do you health-check an agent whose output is non-deterministic text?