20 Lessons I Learned in Server Management

#devops #observability

Server management, in my opinion, is more like people management than software development. After twenty years, I realized that my biggest mistakes weren't triggered by technical shortcomings, but by my habit of saying 'yes' and the simple assumptions I overlooked. Every problem I encountered in this process turned into a lesson that moved me forward, sometimes at a high cost.

Instead of listing these 20 lessons one by one, I wanted to present them under main themes distilled from my experiences. Because the essence of it is not just technical knowledge, but also the mindset and problem-solving approach behind that knowledge. Let's take a look at these lessons together.

The Danger of Assumptions and Overlooked Details

In the server world, saying "I think it works" is a polite way of saying "it will definitely break." In the early years of my career, while doing network segmentation for a client project, I assumed that VLAN tagging would work as expected at every point. The result? A critical system became inaccessible for two days because the default VLAN on a port of an old switch in between was different. This simple oversight cost me dearly.

⚠️ Don't Assume, Verify

When working on a complex part of a system, avoid assuming even the simplest details. Always verify every step and, if possible, control it with automated tests. Especially at the network layer, meticulously checking the configuration of each device prevents major problems later on.

Once, I thought I had correctly optimized Redis's memory usage in the backend of one of my side products. However, because I made the wrong OOM eviction policy choice, Redis would suddenly stop under a certain load. Seeing only "OOM-killed" in the journald logs led me to spend days examining cgroup limits and Redis configuration to find the source of the problem. Such small details can deeply affect a system's stability.

The Human Factor and the Power of Processes

No matter how perfect technical architectures are, if people and business processes are involved, everything changes. While working on a production ERP, I realized that software architecture is often not a technical architecture, but an organizational flow. Delayed shipment reports were always incomplete; the reason wasn't a SQL query or a performance problem, but operators on the field forgetting to update their screens at the right time. This showed me that software is not just code, but a whole with user experience and workflow.

In another lesson, I paid a heavy price for saying "yes." When a project manager asked for a quick delivery for a "small feature," I underestimated the potential side effects and testing time and said "yes." As a result, that "small feature" led to a month-long regression testing cycle and a two-week rollback operation. Sometimes the right answer is "no," or one needs to learn to say "only under these conditions."

Security is a Process, Not a One-Time Project

System security is not like building a wall; it's a living organism that requires constant vigilance. I saw how critical CVE tracking was on an internal banking platform. When a kernel vulnerability (such as CVE-2026-31431 related to algif_aead) emerged, we had to apply a kernel module blacklist within hours. This taught me that security updates are too vital to be put off as "something to do later."

Even when writing fail2ban rules, I saw how creative attackers can be. Properly configuring JWT/OAuth2 patterns and rate limiting mechanisms, instead of simple regex's, is vital to thwart SQL injection attempts or DDoS layers. Security is an area that requires continuous learning and staying current; it's not something you can set up once and forget.

Operational Debt and the Importance of Visibility

Operational debt is as dangerous as technical debt and often progresses more insidiously. During a period when I neglected to monitor WAL bloat in PostgreSQL, I saw our disks fill up much faster than expected. Or, when I overlooked the reliability of systemd unit timers, I only realized during an audit that a critical backup job hadn't been running for months. If you don't monitor something, you'll never know if it's working.

Observability is not just about collecting metrics, logs, and traces; it's also about making this data meaningful and generating actionable alarms. Once, while running my applications with Docker Compose on a bare-metal server, I encountered build OOM errors because I hadn't correctly set container memory limits. Understanding journald's rate limits and correctly using cgroup memory.high soft limits allowed me to detect such problems beforehand.

The Value of Simplicity and Robustness

One of the most important lessons I learned after twenty years is the power of simplicity. In discussions about monolith vs. microservice, my preference has always been "complexity as needed." While developing an ERP for a manufacturing company, we considered patterns like event-sourcing and CQRS, but by calculating whether the existing team and infrastructure could handle it, we started with a more pragmatic, modular monolith. This provided us with both fast delivery and reduced our operational burden.

When designing systems, I always keep concepts like idempotency and transaction outbox in mind. Especially when working with eventual consistency in distributed systems, ensuring that an operation yields the same result even if it runs multiple times is critical. These simple yet powerful principles ensure that my systems are more robust and manageable in the long run.

Server management is a journey that requires continuous learning and adaptation. These are some of the most valuable lessons I learned in my twenty-year journey. What was the most valuable lesson you learned in your twenty-year journey? Do they overlap with mine, or did you draw completely different conclusions? I eagerly await your comments.