I have 142 systemd-managed agents on a single VPS — trading bots, content publishers, lead nurture, web audit, customer journey trackers. For months I had no single point of truth about whose alive and whose silently failing. Manual systemctl status doesn't scale to 142.
So I built Brain — a bash + SQLite + systemd-timer orchestrator. ~500 lines total. No Kubernetes, no Prometheus, no Grafana, no microservices framework.
Here's why bash + SQLite was the right choice over "proper" tooling.
The problem
Agents emit logs to journalctl. Some have heartbeat files. Some have their own state DBs. Some are silent until they crash. When something breaks, you find out via Telegram spam: "agent X failed" × 50 same message.
Three sins of the existing setup:
- No deduplication — every loop fires the same alert as new
- No life-cycle awareness — a disabled unit looks the same as a crashed one
- No business signal correlation — "webhook handler down" matters infinitely more than "blog cron skipped one tick"
The fix: single SQLite + bash orchestrator
CREATE TABLE agents (
name TEXT PRIMARY KEY,
kind TEXT, -- systemd / cron / brain_wrapper
expected_cadence_sec INTEGER,
last_seen_ts INTEGER,
last_status TEXT, -- ok / warn / crit / dead
last_note TEXT,
consec_fails INTEGER
);
CREATE TABLE alerts (
agent TEXT,
severity TEXT,
hash TEXT, -- normalized msg for dedup
count INTEGER,
escalated INTEGER, -- 0=new, 1=TG sent, 2=acked
resolved_at INTEGER,
UNIQUE(agent, hash)
);
The hash dedup is the key. Initially I hashed the raw message — but messages contain "no activity for 4071s" where the number changes every cycle. So same problem, new hash, new TG ping. Spam.
Fix:
hash_msg=$(echo "$note" | sed -E 's/[0-9]+s/Xs/g; s/[0-9]+ errors/N errors/g')
hash=$(echo "$name|$status|$hash_msg" | md5sum | cut -d' ' -f1)
One TG ping per new problem. Re-firings increment count silently. Dramatic noise reduction.
Why bash over Python framework
- Bash + SQLite + jq is on every Linux box. Zero install footprint.
- systemd timers > python schedulers. Free restart-on-failure. Free log rotation.
- All state is
.dband.jsonfiles I cancatto read. - Total code: 6 files, ~500 lines. Python equivalent would be 2-3x.
The whole orchestrator runs in <10 seconds per 5-min cycle. Across 142 agents.
Lessons
- Normalize before hashing. Same problem, same hash.
- Disabled ≠ failed. Treat enabled=disabled as dead, not crit.
- Auto-resolve. When agent returns to OK → close old alerts.
- Daily digest > real-time spam. Real-time only for actual CRIT.
Full source: posting as I clean it up. Reach out if you want to discuss systemd-as-orchestrator patterns.
— Stas, https://guardlabs.online
📥 Free chapter — 20 no-budget growth tactics
This launch log runs on a playbook. If you want the actual tactics — Google-ecosystem hacks, trend-jacking, the HARO authority play — grab two free sections of the Blueprint. No PDF wall, no login: it opens in your browser. Real numbers, real code, no fluff.
Top comments (0)