DEV Community

QuoLu
QuoLu

Posted on

What Happened in 3 Days of Letting AI Manage My Server

Introduction

In my previous article, I wrote about delegating the entire management of my home server to an AI. The system is designed so that the AI performs patrols late at night, and during the day, it springs into action if monitoring scripts detect any anomalies.

I explained how I built the system. Now, I will write about what actually happened after putting it into operation for three days.


The Monitoring Script Broke Itself

On the morning of the third day of operation, the monitoring script detected an anomaly: the license_api_prod container was not responding.

The AI went into action and began its investigation. It connected to the server via SSH and checked the status of the container. The result—the container was running perfectly fine.

It was a false positive. Furthermore, two minutes later, the same false positive occurred for ddnser.

The Cause: Too Many Open SSH Connections

The cause identified by the AI was as follows.

The monitoring script checks the server status every 60 seconds. It opens three connections for system resources (disk, memory, swap) and up to nine for health checks of the seven containers. It was opening a total of over 10 SSH connections simultaneously.

OpenSSH has a setting called MaxStartups, which limits the number of simultaneous connections. The default is 10. The monitoring script was exceeding this limit, causing connections to be rejected. In other words, the monitoring script itself was putting a load on the server, causing its own SSH connections to fail.

1st Fix: From Parallel to Sequential

The AI changed the execution method for health checks from full parallelization using Promise.allSettled() to sequential execution using for...of. Now, only one SSH connection is open at a time.

2nd Fix: Adding Retries

Even when run sequentially, temporary SSH disconnections can occur. A connection might drop once due to server load or a momentary network glitch.

Following the second false positive two minutes later, the AI added a helper function to identify "SSH transport errors." It detects patterns such as Connection closed, Connection refused, and ETIMEDOUT, waits for three seconds, and then retries once. It only reports an anomaly if the connection fails again after the retry.

The fix was two-pronged: reducing the number of SSH connections by moving from parallel to sequential, and handling temporary disconnections with retries. Both were done without human intervention. I just received a notification on Discord saying "Fixed," and that was it.


What the Late-Night Patrols Found

Every day at 4:00 AM, the AI patrols the entire server. It checks security settings, resource usage, container configurations, and log contents. The AI looks at parts that humans usually do not check daily.

Nextcloud Logs at 21GB

During the patrol on the second day of operation, the AI noticed an anomaly in the Nextcloud log file.

/var/mnt/nextcloud_data/nextcloud.log21.3GB.

The log file had ballooned to 21GB. Because it was on NFS, there was no disk space crisis, but log rotation was not functioning. I would never have noticed this on my own.

1,241 SELinux Denial Logs

There was another issue. SELinux was denying lock access to the auction.db for the auction-bot every minute. 1,241 instances in the past 24 hours.

SELinux runs in Permissive mode, so the action was not actually blocked. The application ran normally. However, every time a denial occurred, a daemon called setroubleshoot would start to analyze it, temporarily consuming 22.9% of CPU.

There was no actual damage, but it was wasting resources. I don't think I would have ever found this if the AI hadn't flagged it.

Daily Point Observation

The patrol surveys the entire server every day. The AI determines what to look for, but as a result, trends emerge over time.

Swap Usage Trends

Date Swap Usage Note
4/2 89%
4/3 92% Slight upward trend
4/4 62% Reset by server reboot

The AI tracked the trend of swap usage building up day by day. The server was rebooted and reset on 4/4, but swap might become critical during long periods of operation without a reboot. The AI writes "Continued monitoring required" in its report every time.

fail2ban BAN Trends

Date Currently Banned Total BANNED
4/2 14 IPs 235
4/3 10 IPs → 14 IPs 242 → 293
4/4 8 IPs

Brute-force attacks against SSH occur daily. Connection attempts using common usernames such as admin, ubuntu, and mysql. fail2ban bans them calmly. Password authentication is disabled and only key authentication is used, so they are not broken through, but I want to be aware that the attacks are occurring. The AI reports this daily in its report.


Conclusion

Looking back at what happened over the three days, most of the issues the AI dealt with were things I could not have noticed myself.

Hitting the limit for SSH connections. The Nextcloud log ballooning to 21GB. SELinux denial logs piling up every minute. Swap becoming critical day by day. None of these would be visible unless a human specifically opened the logs to check.

While I am asleep, the AI watches the server, and when I wake up, the results are waiting on Discord. It fixes problems if it can, and reports them if it can't. After running it for three days, I feel this system works even better than I expected.

Top comments (0)