Managing a Docker ecosystem manually doesn't scale. Not because the tools are bad — Docker is fine, Compose is fine — but because the cognitive load of watching six services across multiple hosts adds up fast. You miss things. A container dies at 3am. A CVE sits unactioned for two weeks because you were heads-down on something else. An SSL cert expires because the renewal check was a cron job you forgot about.
I've been building a self-hosted Docker management platform for the past few months. Last week I added something I'd been putting off: autonomous agents. Not AI wrappers around Docker commands. Actual agents that monitor, decide, and act — with two distinct levels of autonomy depending on the situation.
Here's what I learned building them.
The problem with "smart" monitoring
Most monitoring tools solve the detection problem. They tell you something is wrong. Then you still have to act.
That's fine for a team with an on-call rotation. For a solo homelab operator or a small team without 24/7 coverage, detection without action means you're still waking up at 3am — just now with a notification in your hand.
What I wanted was a system that could handle the obvious cases automatically, and only escalate to me when the situation genuinely required human judgment.
That distinction — obvious cases vs judgment calls — is where the Level 1 / Level 2 split comes from.
Level 1 — Rule-based, no AI, always on
Level 1 agents operate on pure logic. No API calls, no model inference, no external dependencies. They run on a poll interval and apply deterministic rules.
Examples of what Level 1 handles:
- A container exits unexpectedly → restart it
- An image has a new digest available → flag it for review
- A monitor has failed N consecutive checks → fire an alert
- A notification channel has been disabled → warn that delivery is impossible
- An SSL cert expires in less than X days → escalate
These cases have clear right answers. A container that died should come back up. A cert expiring in 7 days needs attention. No judgment required.
Level 1 is the baseline. It works without any API key, without any configuration beyond enabling it. The idea is that anyone who installs the platform gets meaningful automation out of the box.
One thing I got wrong initially: I had Level 1 acting too fast. A container that exits and restarts within Docker's own backoff window doesn't need the agent to intervene — Docker already handles it. The fix was adding a minimum dead time before the agent acts. If the container has been down for less than 60 seconds, wait. Docker might already be handling it.
That minDeadTime parameter turned out to be one of the most important tuning knobs in the whole system.
Level 2 — Claude Haiku decides when it's ambiguous
Level 2 activates when a situation falls outside the clear rules. The trigger is usually a pattern that looks bad but might have a legitimate explanation.
The clearest example: a container that keeps crashing.
Level 1 will restart a crashed container. But what if it crashes again? And again? At some point, restarting a container in a crash loop isn't helpful — you're just burning resources and masking a real problem. But stopping it entirely might break something that depends on it.
This is a judgment call. The right answer depends on context: what's in the logs, how many times it's restarted in the last hour, what the exit codes look like, whether this is a critical service or a background worker.
Level 2 passes that context to Claude Haiku and asks for a decision: restart, escalate, or force-stop.
The prompt includes:
- Container name and image
- Restart policy
- Exit code history
- Last 20 lines of logs (capped at 1500 chars)
- Current restart count within the window
Claude returns a JSON decision with an action and a reason. That reason gets surfaced in the UI so the operator can see why the agent acted the way it did.
A few things I got right here:
Fallback to escalate on any error. If the API call fails, times out, or returns something unparseable, the agent escalates rather than retrying or doing nothing. Conservative failure mode.
Guard against decision spam. The agent tracks its Claude decisions per container. If it already asked Claude about this container at restart count N, it won't ask again until the count changes. Without this, a stuck container would generate an API call every 30 seconds.
The AI badge in the feed. When Claude made a decision, the action in the feed shows an AI badge and the model's reasoning inline. Operators can see "Claude decided to escalate this because the logs show a repeated OOM pattern" rather than just "escalated."
The demo that made it real
The moment the system clicked for me was this test:
docker stop my-service
Within 90 seconds:
- The container agent detected the exit (exit code 137 — SIGKILL from
docker stop) - Waited for
minDeadTime - Restarted the container automatically
- Logged the action:
Auto-restarted · exitCode=137 · policy=unless-stopped · restart #1 - The orchestrator detected the service coming back online
No intervention. No notification. It just handled it.
Exit code 137 is
128 + 9— SIGKILL. The agent knows this isn't a crash, it's an external stop. In production you'd probably exclude manually-stopped containers from auto-restart. That's what the exclusion list is for.
What I built across the ecosystem
Six agents in total, one per tool:
Container Agent — watches running containers, restarts crashed ones, escalates crash loops to Level 2.
Update Agent — monitors image digests for changes, evaluates whether updates are safe to apply automatically or should be deferred for review.
Monitor Agent — analyzes uptime check results, detects three patterns: sustained failure, flapping (up-down-up-down), and SSL expiry. Level 2 handles ambiguous sustained failures.
Scan Agent — reads CVE scan results and evaluates severity. Critical vulnerabilities go to Level 2 — the model assesses whether the CVE is likely exploitable in the specific configuration. High vulnerabilities alert directly without AI.
Notify Agent — watches the notification system itself. Catches misconfiguration (disabled channels, no active rules) and anomalies (delivery failure spikes, notification volume spikes).
Master Agent — sits at the orchestrator level with visibility across the whole ecosystem. Detects cross-service patterns: ecosystem-wide degradation, event storms, correlated critical events across multiple services.
What Level 2 is not
It's not a replacement for alerting. The agents escalate to the notification system when they decide a human needs to know. Level 2 makes that escalation smarter — it filters out the cases that don't warrant waking someone up.
It's not always right. The model can make bad calls, especially with ambiguous logs. That's why every Level 2 decision is visible in the feed with the reasoning attached. You can always override, and you can always clear a decision and let the agent re-evaluate.
It's not required. The whole system works without an API key. Level 1 covers the clear cases. Level 2 is additive.
The thing I didn't expect
Building this forced me to think carefully about what "autonomous" actually means for infrastructure tooling.
Full autonomy — an agent that can do anything without asking — is the wrong target. The right target is calibrated autonomy: the agent acts confidently on clear cases, hesitates on ambiguous ones, and always leaves a paper trail explaining what it did and why.
The Level 1 / Level 2 split is really just a formalization of that. Some situations have obvious answers. Some don't. The system should know the difference.
Building this as an open-source self-hosted platform. If you're interested in following the progress, drop a comment.

Top comments (0)