DEV Community

Cover image for 20 Brutally Honest Tips for Keeping Your Servers Alive
Robert D. Stallworth
Robert D. Stallworth

Posted on

20 Brutally Honest Tips for Keeping Your Servers Alive

Look, whether you’re babysitting a single VPS or a massive cloud cluster, the difference between a restful night and a 3 AM hardware-induced panic attack usually comes down to your prep work. Server management isn't just about running updates—it's about building "future-proof" habits.

After managing tens of thousands of boxes at Bobcares, we've realised that the best admins aren't the ones who fix things the fastest; they’re the ones who make sure things don't break in the first place. Here’s how we do it.

I. Squeezing Out Performance

  1. Set your "Normal" early. You can’t tell if a server is acting up if you don’t know what a quiet Tuesday looks like. Log your CPU and RAM baselines now, or you’ll be guessing during a crisis.
  2. Stop ignoring the DB. Most "slow server" tickets are actually just bad database configs. If it’s a dedicated box, give InnoDB 75% of your RAM. Don't let it starve.
  3. Kill slow queries. Turn on the slow query log. If a query takes over a second, it's a bug, not a feature. Use EXPLAIN to find out why your indexes are failing you.
  4. Cache or die. Use Redis for object caching and OPcache for PHP. Rendering the same page from scratch 10,000 times is just a waste of electricity.
  5. Audit your "Right-Sizing." Cloud providers love it when you overpay for idle CPU. If you’re at 5% usage but hitting swap, stop buying more cores and just upgrade the RAM.
  6. Tweak Nginx/Apache. Enable Brotli for better compression and make sure your worker_processes actually matches your core count. Default configs are almost always garbage.

II. Security Without the Fluff

  1. SSH Keys are non-negotiable. Password auth is a playground for botnets. Switch to Ed25519 keys and lock the door (disable password login) in your sshd_config.
  2. The "Obscurity" wins. Move SSH off port 22. It won't stop a hacker, but it stops the "noise" of 10 million daily bot scans hitting your logs.
  3. Default Deny Firewalls. If a port doesn't need to be open to the public (looking at you, MySQL 3306), close it. Use UFW or firewalld to whitelist only your office IP for management.
  4. The 72-Hour Patch Rule. Security holes like Heartbleed don't wait for your scheduled maintenance. Automate security-only updates, or have a process to patch critical CVEs within three days.
  5. Deploy Fail2Ban. It’s the digital equivalent of a "No Trespassing" sign that actually bites. If a bot tries to guess a login three times, ban their IP for 24 hours.
  6. Principle of Least Privilege. Why is your web app running as root? It shouldn’t be. Audit your SUID binaries and ensure each process has the minimum necessary permissions to function.

III. Keeping an Eye on the Pulse

  1. Alert on symptoms, not stats. A CPU spike isn't a problem; a slow checkout page is. Build alerts that tell you when the user is hurting, not just when a meter hits red.
  2. Centralize the mess. SSH-ing into ten boxes to check logs is a nightmare. Ship everything to Loki or ELK. It makes finding the "root cause" take minutes instead of hours.
  3. Synthetic Probes. Don't wait for a customer to complain. Use a service to "fake" a user login every few minutes from different parts of the world to ensure the site actually works.
  4. Watch the Disk I/O. Sometimes the CPU is fine, but the disk is "waiting." High I/O wait is a classic sign of a failing drive or a noisy neighbour in a cloud environment.

IV. The Backup Rule of Gold

  1. The 3-2-1 Law. 3 copies of data, 2 different media types, 1 off-site. If your only backup is on the same rack as your server, you don't have a backup.
  2. Restore Drills. A backup is just a file until it's proven to work. Try to restore your entire site to a fresh box once a month. If you can’t do it in under an hour, your plan is broken.

V. Working Like a Pro

  1. Infrastructure as Code (IaC). Stop hand-crafting servers like they’re artisanal pottery. Use Ansible or Terraform. If a box dies, you should be able to spin up an exact clone with one command.
  2. Blameless Post-Mortems. When a server crashes, don't hunt for a person to fire. Hunt for the flaw in the system that let it happen. Document it, fix it, move on.

Summary

1-6 Speed Tune your DB, use Redis, and stop overpaying for idle cloud cores.
7-12 Security Kill passwords, move your SSH port, and patch within 72 hours.
13-16 Visibility Monitor the user experience, not just the hardware meters.
17-18 Safety Follow 3-2-1 and actually test your restores.
19-20 Ops Use Ansible and stop blaming people for system failures.

Tired of the 24/7 grind?

Keeping up with these 20 rules is a lot of work. At Bobcares, we handle the heavy lifting for over 52,000 servers worldwide. We handle the hardening, the patches, and the 3 AM alerts so you don't have to.

Top comments (0)