DEV Community

c0d3l0v3r
c0d3l0v3r

Posted on • Originally published at Medium

I Doubled My Servers and Wrecked My Availability

Cover Image

“Reliability is the most important feature of any system.” — Bill Gates

Why does your bank never show the wrong balance, but your favorite app sometimes takes a few seconds to update? Why do some systems feel fast but crash under load, while others feel slower but never fail?

Performance answers: “How fast is the system right now?”
Scalability answers: “What happens when demand increases?”

These questions point to a fundamental tension in system design: performance vs scalability.

A system can improve performance metrics like throughput and latency, yet fail to scale under load. This article demonstrates exactly that.

In this article, we’ll explore this trade-off through a simple analogy, and then connect it to real-world systems using measurable metrics like latency, throughput, and error rate.

The Burger Shop Analogy

Imagine a small burger shop run by a single worker, let’s call them A.

Let’s assume the following things:

  • Burgers are prepared automatically by machines, so A only handles taking and serving orders.
  • Finished burgers are placed in a shared box, from which A picks them up and delivers them to customers.

We intentionally ignore preparation time and assume burgers are instantly ready. This allows us to focus purely on how requests are handled, rather than how they are produced.

The Single Worker System

In this setup, a single worker handles all incoming requests.

When you enter the shop, you join a single queue and wait your turn before placing an order. Every request is processed sequentially (one at a time).

A few key observations:

  • Single queue, sequential processing

    With only one line, every customer must wait for those ahead of them. This results in higher waiting time, which translates to high latency and low throughput.

  • No contention

    Since only one request is handled at a time, there is no competition for shared resources. The system operates without conflicts or synchronization issues.

  • Efficient resource usage

    Resources are used strictly on demand. Each burger is prepared only when needed, with no excess consumption or overhead.

What does this imply?

  • High latency → Each request experiences delay due to queuing
  • Low throughput → Only one request is processed at any given time
  • No contention → No conflicts between concurrent operations

Despite its limitations, this system is remarkably stable.

There are no race conditions, no lost work, and no unexpected failures. Every request is handled in a predictable and controlled manner.

Slow, but predictable.

Scaling the System

So far, the limitation has been clear: a single worker handling all requests.

To improve this, we introduce more machines. Instead of relying on a single machine, worker A can now distribute incoming orders across multiple machines.

Let’s examine what changes:

  1. Parallel processing

    With multiple machines available, several orders can be handled simultaneously. This increases the number of requests processed per unit time, leading to higher throughput and reduced latency.

  2. Decision overhead (routing)

    Worker A must now decide which machine should handle each request. This introduces a small coordination overhead — similar to how a load balancer (e.g., Nginx) routes requests in real systems.

    For simplicity, we will ignore this overhead in our analysis.

What happens to the metrics?

  • Multiple orders can be processed simultaneously.
  • Waiting time decreases
  • Throughput increases

This approach is known as horizontal scaling, increasing system capacity by adding more resources rather than making a single component more powerful.

The Hidden Problem

So far, scaling seems like a clear improvement.

But consider the shared burger box , the place where all machines deposit finished burgers.

What happens if multiple machines try to place burgers into the box at the same time?

This introduces two critical failure scenarios:

  1. Lost Work

    One burger makes it into the box, but another falls to the ground.

    That burger is wasted, and the customer never receives their order.

    In system terms, this is a failed request.

  2. System Blockage

    Both burgers get stuck at the entrance of the box, blocking it completely.

    Now no further burgers can be placed.

    In system terms, this resembles contention or resource locking, where the system slows down or stops entirely.

What Changed ?

By introducing multiple machines, we improved performance:

  1. Higher throughput, lower latency

    Multiple burgers can now be prepared simultaneously. This increases throughput and reduces waiting time for customers, improving overall performance.

  2. Increased coordination and state management

    With two machines handling requests, worker A must decide where each order goes. Both machines can process requests, so their state must be tracked and logs must be maintained. This introduces coordination overhead.

  3. Inefficiency due to lost work

    Some burgers may fall to the ground before reaching the box. These orders are wasted, and the customer never receives their burger. In system terms, this represents failed or lost requests.

  4. Unpredictable latency

    Consider a customer waiting for an order that never arrives. If they leave after 30 seconds, that request now has very high latency. Even if other customers are served quickly, overall latency becomes inconsistent and harder to predict.This is the core trade-off.

In system terms, we have introduced:

  • Contention for shared resources
  • Increased system complexity
  • Possibility of failure

The Core Trade-off

This is the fundamental trade-off in system design.

Some customers are served faster, but others may never get their order.

Performance

Performance refers to how effectively a system executes tasks within a given time frame. It is typically evaluated in terms of speed, responsiveness, and how efficiently system resources are utilized under load.

To understand and measure performance, we rely on a few key metrics:

Latency

Latency is the time it takes for a system to respond to a request.

In simpler terms, it measures how long a user has to wait after initiating an action.

Throughput

Throughput represents the number of requests or transactions a system can handle per unit of time.

It reflects the system’s capacity to process workload at scale.

Resource Utilization

Resource utilization measures how much of the system’s resources, such as CPU, memory, and network bandwidth are consumed while handling requests.

Efficient systems aim to maximize output while minimizing unnecessary resource consumption.

Efficiency

Efficiency describes how well a system converts resources into useful work.

A highly efficient system achieves high throughput and low latency while using minimal resources.

Scalability

Scalability refers to a system’s ability to handle increasing load by proportionally utilizing additional resources, while maintaining acceptable performance.

In other words, a scalable system does not degrade significantly as the number of users or requests grows, it adapts by efficiently distributing the workload.

To evaluate scalability, we observe how key performance metrics behave under increasing load, including:

  • Latency under load
  • Throughput under load
  • Resource utilization
  • Error rates (failures, timeouts)
  • Number of concurrent users supported

A system is considered scalable if these metrics remain stable or degrade gracefully as demand increases.

Availability

Availability is the measure of a system’s ability to respond to requests within an acceptable time frame.

A system is considered available if it is operational and capable of returning responses, even under load. High availability systems are designed to minimize downtime and ensure that users can reliably access the service.

“A system can be highly performant but not scalable. for example, a single powerful server may respond quickly, but fail when user load increases.”

The Monolith vs. The Load-Balanced Fleet (A vs 2A’s)

To move beyond theory, we need to observe how a system behaves when pushed to its absolute limits. To test the relationship between compute scaling, database bottlenecks, and overall availability, an A/B load test was constructed.

The code for the experiments can be found in attached github repository:


Assumptions

Before running the load test, the architecture was built on three core hypotheses:

  • Linear Scalability: Doubling the compute resources (adding a second application server) will linearly increase the system’s maximum throughput.
  • Testing Default Database Limits: We hypothesized that the default connection pool of a single MongoDB instance would be sufficient to handle the concurrent load of doubled compute resources.
  • The Load Balancer Penalty is Negligible: The network hop introduced by routing traffic through a reverse proxy will be outweighed by the massive gains in request processing speed.

Setup Methodology

The experiment was conducted locally using Docker to containerize and isolate the environments. The testing tool, k6, was configured to inject a stepped load simulating aggressive traffic spikes: starting at 100 concurrent virtual users and ramping up to 2,000 users over an 80-second window.

The workload itself was a mixed Read/Write operation. Every request forced the application to write a new document to the database and immediately read a document back, simulating a standard transactional endpoint.

We tested three distinct architectures:

  • Architecture A (The Baseline): A single Node.js/Express application connected directly to a standalone MongoDB instance. This represents a traditional, unscaled monolith.
  • Architecture 2A (The Distributed Compute): Two identical Node.js/Express applications sitting behind an Nginx load balancer. Nginx utilized a round-robin strategy to distribute incoming k6 traffic across both nodes. Crucially, both application nodes were wired to the exact same single MongoDB instance.
const express = require("express");
const mongoose = require("mongoose");
const os = require("os");

const app = express();


mongoose.connect("mongodb://mongo:27017/testdb", {
  maxPoolSize: 50,
});

const TestSchema = new mongoose.Schema({
  value: Number,
});

const Test = mongoose.model("Test", TestSchema);

app.get("/", async (req, res) => {
  const start = Date.now();

  try {
    await Test.create({ value: Math.random() });
    await Test.findOne();

    const latency = Date.now() - start;

    res.json({
      latency,
      server: os.hostname(), 
    });
  } catch (err) {
    res.status(500).json({ error: "DB error" });
  }
});

app.listen(3000, () => console.log(`Server running on ${os.hostname()}`));
Enter fullscreen mode Exit fullscreen mode

Results

The k6 load test yielded drastically different saturation points for the two architectures.

These reports are present as the .html files in the github respository.

Architecture A (Single Node Baseline):

The latency trade-off. With a single Node.js event loop acting as a natural throttle, the p95 latency sat at ~796ms. The system was relatively slow under heavy load, but perfectly stable.

The latency trade-off. With a single Node.js event loop acting as a natural throttle, the p95 latency sat at ~796ms. The system was relatively slow under heavy load, but perfectly stable.

Baseline throughput. The single-node architecture safely processed a total of 176,584 requests, hitting a hard bottleneck at roughly 2,193 requests per second.

Baseline throughput. The single-node architecture safely processed a total of 176,584 requests, hitting a hard bottleneck at roughly 2,193 requests per second.

_100% Availability. Because the compute layer could not overwhelm the database connection pool, every single read/write transaction was successfully completed without a single dropped request._

100% Availability. Because the compute layer could not overwhelm the database connection pool, every single read/write transaction was successfully completed without a single dropped request.

  • Total Requests Processed: 176,584
  • Throughput: 2,193 requests/second
  • Latency (p95): ~796 ms
  • Failure Rate: 0.00% (100% Success)

Architecture 2A (Distributed Compute):

The smoking gun. While the p95 latency for successful requests actually dropped to ~492ms, the maximum latency spiked to an unusable 36 seconds. The explicit HTTP 500 errors confirm the single MongoDB instance was pushed into connection exhaustion.

The smoking gun. While the p95 latency for successful requests actually dropped to ~492ms, the maximum latency spiked to an unusable 36 seconds. The explicit HTTP 500 errors confirm the single MongoDB instance was pushed into connection exhaustion.

The illusion of scale. Adding a load balancer and a second application node successfully pushed our throughput to over 3,600 requests per second, processing 350,261 total requests.

The illusion of scale. Adding a load balancer and a second application node successfully pushed our throughput to over 3,600 requests per second, processing 350,261 total requests.

The breaking point. Scaling the compute layer while leaving the data layer untouched resulted in a massive 26.5% failure rate, actively degrading the system’s availability.

The breaking point. Scaling the compute layer while leaving the data layer untouched resulted in a massive 26.5% failure rate, actively degrading the system’s availability.

  • Total Requests Processed: 350,261
  • Throughput: 3,638 requests/second
  • Latency (p95): ~492 ms (for successful requests)
  • Failure Rate: 26.57% (93,069 failed requests, including explicit HTTP 500 database errors and massive timeouts spiking up to 36 seconds).

Discussion

The experimental data explicitly demonstrates the systemic risks of isolated horizontal scaling. While adding compute resources improved specific performance metrics, it fundamentally compromised system reliability, proving that scaling a single tier does not equate to scaling the system as a whole.

1. The Shifting Bottleneck and Resource Exhaustion The baseline load test (Architecture A) established a clear system ceiling: a single Node.js instance capped throughput at 2,193 requests per second with a p95 latency of ~796ms. Crucially, the failure rate was 0.00%. In this configuration, the compute layer acted as a strict upstream bottleneck. Because a single server could not process requests fast enough to overwhelm the database’s connection pool (configured to a maximum of 50 connections), the database was naturally shielded from resource exhaustion. The system was relatively slow under peak load, but highly stable.

By introducing a load balancer and a second application node (Architecture 2A), the compute-layer bottleneck was removed. The system accepted more concurrent traffic, driving throughput up by roughly 65% (to 3,638 req/sec). However, because both nodes funneled this doubled traffic into the single, unscaled MongoDB instance, the bottleneck immediately shifted to the data layer. The sudden influx of concurrent requests exhausted the database connection pool, leading to connection lockups and explicit HTTP 500 errors.

2. The False Positive of “Improved” Performance A superficial reading of Architecture 2A’s metrics might suggest a performance win: the p95 latency for successful requests actually dropped by ~38% (from 796ms down to 492ms), and total processed requests nearly doubled.

However, a rigorous analysis reveals this “scale” is an illusion. The data shows a catastrophic failure rate of 26.57% (93,069 failed requests). The reduction in p95 latency for successful requests is effectively a survivorship bias metric; it only accounts for the requests that managed to secure a database connection. Meanwhile, the maximum latency for blocked requests spiked to an unusable 36 seconds. The experiment proves that forcing high throughput at the compute layer without matching capacity at the data layer destroys system availability.

3. The Inadequacy of Isolated Scaling The results strictly invalidate the hypothesis that doubling compute resources linearly increases maximum throughput safely. Instead, it proves that architectural scaling must be holistic.

When Architecture 2A pushed the database beyond its concurrency limits, the system did not degrade gracefully — it fractured, dropping more than a quarter of all user requests.

Conclusion

The data validates a core principle of distributed systems:

throughput and availability are often inversely correlated when shared resources are unscaled. Architecture A was unscalable but highly available (100% success). Architecture 2A was highly performant for a subset of users but critically unavailable for the rest (26.57% failure).

To successfully scale this system beyond the 2,193 req/sec ceiling of Architecture A without incurring the 26% failure penalty of Architecture 2A, the architecture must evolve to protect the database. Based strictly on these failure modes, future iterations must either synchronously scale the data layer (e.g., database sharding/replicas to handle the higher connection volume) or implement an asynchronous buffer (e.g., a message queue) to decouple the request ingestion rate from the database write rate.

Acknowledgements & References

The experiments and conclusions drawn in this article were directly informed by studying foundational distributed systems literature. Translating these concepts from theory into a live, containerized k6 experiment was made possible by the following resources:

Performance, Scalability & Throughput

The CAP Theorem & Consistency Patterns

High Availability, Failover & Database Replication

Top comments (0)