QuoLu

Posted on May 29

Delegating Full Server Management to AI

#claudecode #podman

Introduction

In my previous article, I talked about how easy it was to let an AI touch my server directly via SSH. I had it create deployment scripts, allowing me to complete everything from building to updating containers in one go.

Since then, I have transformed my private bot into a SaaS. This is also running on my home server in production.

Once you reach this point, there is only one thing left to think about: Why not leave the operations to the AI as well?

If there is a problem, just make it fix it

AI is good at coding. It has knowledge of security. It can read server configurations.

So, when a problem occurs on the server, I should just let the AI investigate, fix it, and even deploy the solution. There is no need for a human to be woken up in the middle of the night to read logs.

There is another important reason.

If I provide instructions based on my own codebase, the AI will only act within the limits of my own knowledge. How should security be configured? Are there issues with the container structure? Honestly, Opus knows more than I do. Therefore, instead of giving detailed instructions, I decided to leave it to the autonomous judgment of Opus.

The Mechanism

I created a monitoring app that resides in the Electron task tray.

Daytime — Lightweight Monitoring

Running an AI continuously consumes my MAX plan quota. So, I don't use AI during the day.

Instead, a monitoring script checks the server status via SSH every 60 seconds. It checks if containers are running, if HTTP responses are returned, and if the database can be connected. If an anomaly is detected, it first attempts to restart the container, and if that fails, it triggers the AI.

Late Night — Full AI Patrol

Every night at 4:00 AM, the AI patrols the entire server. Security settings, resource usage, container configuration. It uncovers issues that the daytime monitoring script cannot catch through the eyes of the AI.

Why late at night? The MAX plan usage recovers over time. If I use it at night, it will be recovered by the time I wake up, and if I don't use it, that quota is just wasted. There's no reason not to make effective use of it.

The AI also writes the monitoring script

What is interesting is that I even have the AI write the monitoring script itself. During its late-night patrol, it looks at the server configuration and decides what needs to be monitored, then generates and updates the script itself.

Instead of me specifying "monitor this port," the AI decides "this container needs this check." As I wrote earlier, I don't want to limit it to my own knowledge.

3-Layer Agent Structure

When an anomaly is detected and the AI kicks in, it is not one single AI doing everything. I have divided the roles into three layers.

Parent Agent — Detects symptoms and identifies which app has the problem. What is important here is that the parent only passes on the symptoms. Just facts like "the container went down" or "HTTP returned 500." The parent does not judge why it went down or how to fix it.

Child Agent — Starts up in the relevant app's project, investigates the cause, fixes the code, tests it, and deploys it. Identifying the cause from the symptoms is the child's job.

Grandchild Agent — Audits the fix strategy proposed by the child. It checks if "this fix will cause other problems" before proceeding with the execution.

Why doesn't the parent diagnose it? Because the child is more familiar with that project. Since the child starts in the relevant app's project folder, it can read all the CLAUDE.md files and code. The parent only sees the entire server and does not know the internal structure of individual apps. If the parent tries to guess the cause, it will be pulled by that speculation. That is why I only pass the symptoms and let the staff on the ground make the judgment.

What actually happened

Recently, the monitoring script detected an error. The AI was dispatched and began investigating.

Result: It was a bug in the monitoring script itself.

The AI found the error in the monitoring script that the AI itself had created, and fixed it itself. I did nothing. I just got a notification on Discord saying "Fixed," and that was it.

It sounds like a joke, but it is also proof that the mechanism is working properly. There is no need to write a perfect script from the beginning. If a problem occurs, fix it—that loop is running automatically.

Things I can do because it is my own server

Some people might think this is "irresponsible" when they hear such stories. From the perspective of someone whose business is security, leaving everything to an AI's judgment might seem absurd.

But I am an individual developer. I am running my own services on my own server. Honestly, Opus knows more about security and server management than I do. I believe that relying on someone who is more knowledgeable than myself is not irresponsible but rational.

Of course, I have no intention of distributing this to others. Since it is my own server, I can do whatever I want, but it would be a disaster if I broke someone else's server. Because I consider it "for personal use," I can leave it entirely to the AI.

Conclusion

It started with letting the server be touched via SSH, then letting it handle deployment, then SaaS-ifying, and now I am even leaving operations to it.

I avoid binding it with my own knowledge as much as possible and leave it to the AI's judgment. By doing so, the AI finds problems I didn't know about and applies fixes I couldn't have written myself. If I had written the monitoring script myself, I wouldn't have noticed that bug.

It is tough for individual developers to manage servers. But now, the AI patrols while I am asleep in the middle of the night. When I wake up in the morning, a message arrives on Discord saying "No anomalies." That alone helps me sleep peacefully.

Instead of writing detailed instructions, things go better when I leave it to the AI. I am gradually seeing this kind of relationship.

DEV Community