DEV Community

Mustafa ERBAY
Mustafa ERBAY

Posted on • Originally published at mustafaerbay.com.tr

System Architecture is a Bit About Paranoia

Recently, a series of OOM-killed errors in the AI generation pipeline running on my own VPS took me back to the old days. I once again saw how a sleep 360 command could wreak havoc on a system and the cost of a simple mistake. This situation made me realize that system architecture is, in fact, a bit about "paranoya."

For me, this state of "paranoia" is like a way of life built on anticipating worst-case scenarios, accepting that anything can go wrong, and taking precautions accordingly. While it might sound a bit negative, my 20 years of experience in the field have repeatedly proven why this approach is indispensable.

Roots of Paranoia: Past Scars

This "paranoid" mindset isn't an empty delusion; it's a result of bitter experiences I've lived through and learned from. Over the years, I've seen many systems crash unexpectedly, slow down, or become completely inaccessible. These incidents formed the foundation of my architectural approach.

I remember once, at a major Turkish e-commerce site, the database server completely locked up in the middle of a critical campaign. I'll never forget the helplessness and panic of that moment. I've experienced similar, though smaller-scale, crises on my own VPS; for example, my disk filling up to 100% on April 28th. Such events taught me how crucial it is to ask, "what if?"

Incidents on My Own VPS

I manage over 13 Docker containers on my own server. Sometimes, when even one starts behaving unexpectedly, I see a domino effect, with others also getting swapped out. These scenarios show how each part of the system interacts with one another and how the weakest link can affect the entire system.

⚠️ VPS Overload and OOM

One of the most classic scenarios I've experienced on my own VPS is out-of-memory (OOM). Sometimes I've encountered situations like kcompactd using 92% CPU, or sshd being unable to accept new connections. This always reminds me how critical it is to monitor resources and know the limits.

Once, I noticed that Docker's build cache had reached 33 GB, and unused images were taking up 23 GB. My server disk was 100% full, and I couldn't even SSH in. This situation painfully taught me that even a simple full disk could cripple the entire operation. Since that day, I regularly run the docker system prune -a command.

Knowing That Everything Can Break

As a system architect, my most fundamental principle is to accept the fact that everything, absolutely everything, will break one day. This could be a hardware failure, a software bug, or a simple human error. What's important is knowing that these breakdowns will happen and how we build a resilience mechanism against them.

I once experienced state corruption in my GitHub Actions runner due to _work/_temp directories. The pain of deleting those directories and having to rebuild the entire pipeline showed me how fragile even automation systems can be. Such incidents explain why redundancy and fast recovery mechanisms are so valuable.

Resilience and Fault Tolerance

This "paranoid" perspective drives me to focus more on the concepts of resilience and fault tolerance when designing systems. Planning how the system will remain operational if a component fails is one of the most critical steps in architecture.

For example, this blog's Astro build process sometimes consumes 2.5 GB of RAM, pushing the total system RAM to 7.6 GB and resulting in an OOM. In such a situation, I add a preflight resource guard to the pipeline to check resources first. If resources are insufficient, I defer the operation and switch to polling-wait mode. Last month, when I typed sleep 360 and got OOM-killed, I had to activate this polling-wait mechanism.

Cloudflare Cache Strategies

Even when using Cloudflare, this "paranoia" comes into play. Astro's default max-age=0 wasn't providing the performance I wanted for static content. Therefore, I implemented an override on Nginx to define longer cache durations for specific paths. This is a matter of making a trade-off between content freshness and performance, and consciously managing that trade-off.

location /_astro/ {
    expires 1y;
    add_header Cache-Control "public, max-age=31536000, immutable";
    proxy_pass http://localhost:4321;
}
Enter fullscreen mode Exit fullscreen mode

In this example, I set a 1-year cache duration for static assets starting with _astro. This reduces unnecessary origin hits at the CDN layer, thereby improving performance and lightening the load on my server.

Security and Constant Vigilance

One of the most prominent areas where "paranoia" is evident in system architecture is security. I always assume that attackers will try to find the weakest point in the system. Therefore, I try to take precautions not only against known vulnerabilities but also against potential risks.

In recent years, I've seen many times how critical CVEs can be. Even on my own system, I've tracked potential risks like CVE-2026-31431 and tried to close a possible vulnerability by blacklisting kernel modules like algif_aead. Such proactive steps strengthen the system's overall security posture.

ℹ️ Proactive Security Measures

Steps like blacklisting kernel modules or disabling unnecessary services, while often seeming like minor details, can prevent major security incidents. Remember, the best security measure is to prevent an incident from happening.

Runner Economics on My Own VPS

To avoid exceeding my GitHub Actions quota, I use a self-hosted runner on my own VPS. While this provides a cost advantage, it also requires me to constantly track updates and security patches. This situation allows me to view "paranoia" as a form of cost optimization and risk management tool.

Even when setting up my AI-powered content pipeline, I noticed errors occurring when slashes (/) were used in tags or when the publishDate field wasn't a quoted string. Even Turkish-specific details like the dotted-i character problem show that unexpected issues can arise at every layer of the system. These kinds of "quirks" are part of my constant vigilance and thinking about every detail.

Paranoia or Professionalism?

So, what exactly is this "paranoia" in system architecture? For me, it's not about being constantly anxious or expecting a disaster at any moment. Rather, it's about understanding the inherent complexity and fragility of systems and taking conscious steps to minimize these risks. We could call this risk-aware design.

This is a kind of engineering perspective. Just as an engineer building a bridge designs with scenarios like earthquakes, storms, or excessive loads in mind, we also strive to make our systems resilient against potential failures. I'm not ashamed of my own mistakes; on the contrary, last month when I typed sleep 360 and got OOM-killed, I learned from that error and developed the polling-wait mechanism. This is about asking, "what can I do to prevent this from happening again?" instead of just shrugging it off as "it happens."

For me, system architecture is a bit like this: being constantly vigilant and designing with the knowledge that everything can go wrong. Without this "paranoia," none of my side projects like hesapciyiz.com, spamkalkani.com, or islistesi.com would run so stably. Have you ever had such "paranoid" moments? I'd love to hear about them in the comments.

Top comments (0)