DEV Community

Cover image for If You Were a Server: How to Detect Issues and Keep Things Running Smoothly
Mohamed Idris
Mohamed Idris

Posted on

If You Were a Server: How to Detect Issues and Keep Things Running Smoothly

Here is a question that often pops up in senior web developer and backend interviews:

"If you were a server, how would you detect that you're having issues, and what would you do next to make things run smoothly?"

At first, it sounds a little weird. Be a server? But that is actually the whole point. The interviewer wants to know if you can think like infrastructure — can you see problems before they become disasters, and do you know how to respond?

This post breaks it all down in simple English. Whether you are a junior developer hearing these concepts for the first time, or someone preparing for a senior interview, you will walk away with a solid mental model.


First: What Is the Question Really Asking?

The question is covering two things:

  1. Detection — How does a server (or you, the developer/SRE watching it) know something is wrong?
  2. Response — What actions do you take to fix it or at least keep things stable?

Think of it like being a doctor for your own body. You check your temperature, blood pressure, and pulse. If something is off, you take medicine, rest, or call a specialist. Servers work the same way.


Part 1: How Does a Server Detect It Has Issues?

The Four Core Vitals

Just like a doctor checks your basic vitals, a server has its own set of core health signals. These are the first things to look at.

1. CPU Usage

CPU (Central Processing Unit) is the brain of your server. It handles every calculation, request, and operation.

  • Healthy: Under 60-70% usage during normal load.
  • Warning: Consistently above 80%.
  • Critical: Sustained 90%+ means your server is struggling to breathe.

When CPU is maxed out, your server starts slowing down responses, queuing requests, or dropping them entirely.

How to check it: top, htop, vmstat, or any cloud monitoring dashboard like AWS CloudWatch or Datadog.

2. Memory (RAM) Usage

Memory is your server's short-term workspace. Every running process uses RAM.

  • Healthy: Enough free RAM to handle spikes.
  • Warning: When RAM fills up, your OS starts using swap — which is disk space acting as fake RAM. Disk is much slower than RAM.
  • Critical: If swap fills up too, processes start crashing.

A common mistake is ignoring memory leaks — situations where an application keeps grabbing more memory and never releasing it. Over time, this silently kills your server.

3. Disk Usage and I/O

Disk usage is how much of your storage is full. Disk I/O is how fast data is being read and written.

  • Best practice: Keep at least 20-30% disk space free at all times.
  • A full disk can crash your database, stop logging, and break deployments.
  • High disk I/O (lots of reads/writes happening at once) can make everything slow, even if CPU and RAM look fine.

4. Network Bandwidth

Your server talks to the outside world through the network. If the pipe gets clogged:

  • Requests take longer to arrive and respond.
  • Large file uploads or downloads can saturate the connection.
  • You may see packet loss — data literally gets dropped mid-transfer.

Beyond the Core Vitals: Application-Level Signals

The four core vitals tell you about the hardware. But you also need to watch what your application is doing.

5. Response Time / Latency

How long does your server take to respond to a request?

  • A healthy API might respond in 100ms.
  • If latency jumps to 2-3 seconds, something is wrong — maybe a slow database query, a blocked thread, or an overloaded service.

Why it matters: Users notice latency before they notice anything else. A slow page feels broken.

6. Error Rate

What percentage of your requests are returning errors?

  • 5xx errors (500, 502, 503, 504) mean your server is the problem.
  • If your error rate goes from 0.1% to 5%, something just broke.
  • Spikes in 4xx errors (404, 403) can also signal broken deployments or misconfigured routes.

7. Throughput / Request Rate

How many requests per second is your server handling?

  • A sudden drop in traffic can be as alarming as a spike — it might mean your server is down and clients are not even reaching it.
  • A sudden spike might mean a traffic surge, a bot attack, or a viral moment — all of which need different responses.

Health Checks: The Server Checking Itself

Modern servers do not wait for humans to notice something is wrong. They self-report through health check endpoints.

There are two main types used in systems like Kubernetes:

Liveness Probe

"Am I still alive?"

This is a simple check: is the process running at all? If a liveness probe fails, the orchestrator (like Kubernetes) will restart the container.

Example endpoint: GET /health/live → returns 200 OK if running.

Readiness Probe

"Am I ready to receive traffic?"

This is more nuanced: is the server running and fully ready to handle requests? Maybe it is warming up a cache, waiting for a database connection, or doing startup migrations.

If a readiness probe fails, the load balancer stops sending traffic to that instance — without killing it. Once it recovers, traffic resumes.

Example endpoint: GET /health/ready → checks DB connection, cache, etc.


Logging: The Server's Diary

Logs are your best friend when something goes wrong. A server should write meaningful logs that tell a story.

Structured logging means logging in a consistent format (usually JSON) so machines can read it:

{
  "timestamp": "2026-04-22T10:30:00Z",
  "level": "error",
  "message": "Database connection failed",
  "service": "api",
  "request_id": "abc-123",
  "duration_ms": 5000
}
Enter fullscreen mode Exit fullscreen mode

Without good logs, debugging is like trying to solve a crime with no evidence. With good logs, you can trace exactly what happened, when, and why.

Log levels to know:

  • DEBUG — detailed developer info (not for production usually)
  • INFO — normal operations ("User logged in")
  • WARN — something is off but not broken yet
  • ERROR — something broke, needs attention
  • FATAL / CRITICAL — the server cannot continue

Alerting: Getting Notified Before It's Too Late

Monitoring without alerting is useless. You cannot stare at dashboards 24/7.

Set up alerts with smart thresholds:

Metric Warning Threshold Critical Threshold
CPU Usage > 80% for 5 min > 90% for 2 min
Memory Usage > 85% > 95%
Disk Space < 25% free < 10% free
Error Rate > 1% > 5%
Response Time > 500ms avg > 2s avg

Do not alert on every small blip. Use time windows (e.g., "CPU > 80% for 5 consecutive minutes") to avoid alert fatigue — a flood of noisy alerts that people start ignoring.

Tools: PagerDuty, OpsGenie, Slack alerts, AWS CloudWatch Alarms, Grafana Alerts.


Part 2: What Do You Do Next to Keep Things Running Smoothly?

You detected the problem. Now what? Here are the key strategies — from automatic to manual.

1. Auto-Scaling: Get More Help

If your server is overloaded, the simplest answer is: add more servers.

Auto-scaling is when your infrastructure automatically spins up new server instances when load is high, and shuts them down when load drops.

Normal traffic:   [Server 1]
Traffic spike:    [Server 1] [Server 2] [Server 3]
Traffic drops:    [Server 1]
Enter fullscreen mode Exit fullscreen mode

This works with a load balancer sitting in front — it distributes incoming requests across all available instances so no single server gets crushed.

Horizontal scaling = more instances (this is what auto-scaling does).
Vertical scaling = bigger instance (more CPU/RAM on the same machine). Harder to do automatically.

Cloud providers (AWS, GCP, Azure) all have auto-scaling built in.


2. Circuit Breaker: Stop the Bleeding

Imagine your server calls another service (a payment API, a database, a third-party service). That service is slow or down. Your server keeps waiting... and waiting... and all your threads are now stuck waiting for something that will never respond. Your whole server grinds to a halt because of someone else's problem.

The Circuit Breaker pattern prevents this.

It works in three states:

CLOSED (Normal)
  → Requests pass through
  → Failures are counted

OPEN (Problem detected)
  → Requests immediately fail fast
  → No waiting, no timeout
  → Returns an error or fallback instantly

HALF-OPEN (Testing recovery)
  → A few test requests get through
  → If they succeed → back to CLOSED
  → If they fail → back to OPEN
Enter fullscreen mode Exit fullscreen mode

It is called a circuit breaker because it works exactly like the electrical circuit breaker in your house — when something goes wrong, it cuts the connection to stop further damage.

Libraries: opossum (Node.js), resilience4j (Java), polly (.NET).


3. Graceful Degradation: Do Less, Not Nothing

When part of your system is broken, the goal is to keep the core experience working, even if some features are temporarily unavailable.

Real-world examples:

  • YouTube goes down? Show the homepage with a "video unavailable" message instead of crashing entirely.
  • Recommendation engine fails? Show generic popular content instead of personalized picks.
  • Payment service is slow? Disable the "buy now" button and show a "try again shortly" message instead of hanging the page.

This is better than an error page that says "500 Internal Server Error" with nothing else.

The key idea: degrade gracefully, fail loudly only when you have no other choice.


4. Caching: Avoid Repeating Work

Many server problems come from doing the same expensive work over and over. Caching stores the result of expensive operations so you can reuse them.

Without cache:
User → Server → Database (100ms) → User

With cache:
User → Server → Cache (1ms) → User  (if cached)
User → Server → Database (100ms) → Cache → User  (if not cached)
Enter fullscreen mode Exit fullscreen mode

Types of caching:

  • In-memory cache: Redis, Memcached — extremely fast.
  • HTTP caching: Cache-Control headers tell browsers and CDNs to store responses.
  • Database query caching: Cache frequently run queries.

When your server is struggling, a well-configured cache can reduce database load by 80% or more.


5. Rate Limiting: Control Incoming Traffic

If traffic spikes and you cannot scale fast enough, you need to protect yourself by limiting how many requests any single client can make.

Example: Allow 100 requests per minute per IP. After that, return 429 Too Many Requests.

This protects against:

  • Accidental infinite loops in client code
  • DDoS (Distributed Denial of Service) attacks
  • Scrapers hammering your API

It is not just about bad actors — rate limiting is also good for your own internal services to prevent one slow consumer from starving everyone else.


6. Retry with Exponential Backoff: Be Patient, Not Aggressive

Sometimes a service is temporarily down and recovers in seconds. Clients should retry — but smart retries, not aggressive ones.

Bad retry:

Fail → Retry immediately → Fail → Retry immediately → Fail...
(This makes the overloaded server worse)
Enter fullscreen mode Exit fullscreen mode

Good retry with exponential backoff:

Fail → Wait 1s → Retry
Fail → Wait 2s → Retry
Fail → Wait 4s → Retry
Fail → Wait 8s → Retry (give up after X attempts)
Enter fullscreen mode Exit fullscreen mode

Each retry waits twice as long as the last. Add a bit of random delay (called jitter) so thousands of clients do not all retry at the exact same moment, which would cause another spike.


7. Rollbacks: Undo What Broke It

Sometimes the issue is a bad deployment. The fastest fix is often to roll back to the previous working version.

Blue-green deployment and canary releases are strategies to reduce this risk:

  • Blue-Green: You have two identical environments (blue = live, green = new). Deploy to green, test it, then switch traffic. If something breaks, switch back to blue instantly.
  • Canary Release: Roll out the new version to 5% of users first. If metrics look good, increase to 20%, then 50%, then 100%. If something breaks, only 5% of users were affected.

These patterns let you catch problems early without taking down everything.


8. Runbooks: The Human Response Plan

All the automation in the world cannot cover every scenario. You need a runbook — a documented set of steps that a human follows when a specific alert fires.

A good runbook answers:

  • What does this alert mean?
  • How urgent is it?
  • What are the first three things to check?
  • How do I escalate if I cannot fix it?
  • What commands should I run?

Example runbook entry:

Alert: High Memory Usage (> 90% for 10 minutes)

1. SSH into the affected server
2. Run: `ps aux --sort=-%mem | head -20` to find the top memory consumers
3. Check for memory leaks in the app logs: `grep "OutOfMemory" /var/log/app.log`
4. If a rogue process is found: restart the service
5. If memory does not recover: trigger a scale-out event
6. Escalate to on-call engineer if unresolved after 15 minutes
Enter fullscreen mode Exit fullscreen mode

A runbook turns a stressful incident into a checklist. It is boring when things are calm and invaluable at 3am during an outage.


Putting It All Together: The Full Answer Framework

If an interviewer asks you this question, here is the structure of a great answer:

Detection (Observability):

  • I would monitor core metrics: CPU, memory, disk, network
  • I would track application metrics: latency, error rate, throughput
  • I would have health check endpoints (liveness and readiness)
  • I would use structured logging for traceability
  • I would set up smart alerts with meaningful thresholds

Response (Resilience):

  • Auto-scaling to handle load spikes
  • Circuit breakers to prevent cascading failures from downstream services
  • Graceful degradation to keep core features working
  • Caching to reduce repeated expensive work
  • Rate limiting to protect against traffic floods
  • Retry with exponential backoff for transient failures
  • Rollback strategies (blue-green, canary) for bad deployments
  • Runbooks so on-call engineers know exactly what to do

Real-World Tools Worth Knowing

Category Popular Tools
Metrics & Dashboards Grafana, Datadog, Prometheus, AWS CloudWatch
Logging ELK Stack (Elasticsearch, Logstash, Kibana), Loki, Splunk
Alerting PagerDuty, OpsGenie, Grafana Alerts
APM (App Performance) Datadog APM, New Relic, Sentry
Uptime Monitoring UptimeRobot, Pingdom, Checkly
Circuit Breakers Resilience4j, Opossum (Node.js), Polly (.NET)
Caching Redis, Memcached, Varnish
Load Balancing / Scaling AWS ALB + Auto Scaling, Kubernetes HPA, NGINX

TL;DR

The interview question is asking you to think like a system that observes itself and heals itself.

  • Detect with metrics (CPU, memory, disk, network), application signals (latency, error rate), health checks, and structured logging.
  • Respond with auto-scaling, circuit breakers, graceful degradation, caching, rate limiting, smart retries, rollbacks, and runbooks.

The best systems do not just survive failures — they are designed to expect them and handle them without waking anyone up at 3am.


Found this useful? Drop a comment with the trickiest server question you have faced in an interview — I would love to hear it.


Sources:

Top comments (0)