I Paid the Bill for AI-Written Code Months Later

#ai #performance

I've seen many different systems and made many different mistakes in my career. But the root of a bill I recently paid, in my opinion, stemmed from my excessive trust in "artificial intelligence"-written code. A seemingly minor optimization effort in a manufacturing ERP led to a months-long, hidden performance degradation.

Paying the bill for AI-written code was much more painful than I expected. Initially, everything seemed fine; in fact, the problem appeared to be solved. But the real cost hit the system where I least expected it. This story might not have shaken my belief in the power of AI, but it reminded me once again that I must always remain pragmatic.

The Initial Spark: AI-Written Code and the Illusion of a Quick Fix

In a manufacturing ERP, especially while developing the AI-powered production planning module, certain background tasks needed optimization. Periodically cleaning specific tables and performing index maintenance in the database were critical. To make these processes more robust, I decided to use a systemd timer. It was a simple task, but with the pressure of speed at the time, I thought, "why not ask AI?"

AI suggested a systemd unit and timer combination that seemed quite logical to me. It ran a Python script with a simple ExecStart command, and the script, in turn, triggered pg_repack and VACUUM ANALYZE operations at regular intervals. The script creating its own waiting loop with an expression like sleep 360 seemed flexible and controllable to me at the time. "Why restart from scratch and consume resources every time?" I thought.

ℹ️ AI-Assisted SystemD Timer Approach

The approach suggested by AI, which manages periodic tasks using time.sleep() within a Python script, might seem appealing initially but can lead to unexpected problems in the long run. systemd's own scheduling mechanisms are generally more reliable.

The Hidden Danger: Resource Leakage and OOM-Killed Processes

Everything worked fine for months, or so it seemed. Until one Tuesday morning, at 04:17, I started seeing strange OOM-killed messages in the journald logs. First sporadically, then with increasing frequency. There were momentary slowdowns, then timeouts, in the database connection pool on the primary PostgreSQL instance. My first suspicion was PostgreSQL's own settings; was work_mem insufficient, or shared_buffers?

I checked the cgroup limits; there was no abnormality. The OOM-killed messages in journald always pointed to my "smart" AI-written code script. The script would remain in a sleep loop while waiting for heavy operations like pg_repack or VACUUM ANALYZE to complete, but during this process, memory usage gradually increased. Although the Python script seemed simple, memory management of some libraries or the load created by external processes like pg_repack itself caused the long-running script to experience a resource leak.

# A simplified example suggested by AI
import time
import subprocess

def run_db_maintenance():
    # ... database maintenance commands ...
    subprocess.run(["pg_repack", "-d", "my_db"])
    subprocess.run(["vacuumdb", "-a", "--analyze-in-stages"])

if __name__ == "__main__":
    while True:
        run_db_maintenance()
        time.sleep(360) # Intended to run every 6 minutes

This sleep loop actually kept the script continuously in memory, leading to a small memory leak with each cycle. This slowly accumulating memory consumed system resources and was eventually marked as OOM-killed by the operating system. This, in turn, caused the script to restart every time and begin the loop from the very beginning – meaning the sleep 360 had no purpose, and the system kept triggering heavy operations like pg_repack.

Paying the Bill: Performance Degradation and the Solution

Even detecting the OOM-killed processes took me 3 days. The real problem was that this script was constantly triggering CPU and I/O-intensive operations like pg_repack. WAL bloat on the PostgreSQL server increased, and even read replicas were experiencing synchronization issues. Critical ERP reports, especially shipment and inventory reports, started taking 30% longer than usual to complete. Production operators also experienced occasional freezes on their screens.

The real bill, however, was the indirect losses caused by this performance degradation. Delayed reports, slowed operations, and most importantly, the time my team spent trying to understand this problem. A week was spent monitoring overall system performance, scanning pg_stat_activity, analyzing iostat outputs, and meticulously sifting through journald. Finally, when I stopped that "smart" AI script with a simple systemctl stop command, the system began to breathe.

⚠️ Overlooked Details in AI-Generated Code

Even if the general structure of AI-generated code seems correct, critical details regarding operating system-level resource management, process lifecycles, and long-running scenarios can be overlooked. This can lead to issues like memory leaks and excessive CPU/I/O usage, especially in bare-metal or container environments.

The solution was simple: I abandoned the sleep loop suggested by AI. Instead, I configured the systemd timer directly with OnCalendar or OnBoot + OnUnitActiveSec. I restructured the script to perform a single task and then exit each time it ran. This ensured that the script started from scratch each time and used resources cleanly.

Lessons Learned and a Look Towards the Future

This incident did not diminish my admiration for AI's code generation capabilities, but it clarified my stance on it: AI is just an assistant tool; the final control must always be mine. Especially for system-level tasks, using the operating system's native mechanisms (e.g., systemd's own scheduler) is much more reliable than setting up a sleep loop within a Python script.

I paid the bill for AI-written code, yes. But it taught me invaluable lessons. Although AI offers quick solutions, blindly trusting it without understanding the underlying mechanisms and long-term effects can lead to costly consequences. Seeing that I could fall into this trap even after 20 years of field experience shows that everyone needs to be careful.

So, have you had a similar experience with AI-written code? What was the most costly mistake you encountered, and what was your most important lesson from that process? I eagerly await your comments.