DEV Community

Cover image for Resilient Backend Architecture: From Implementation to Chaos Testing
Dmytro Kostenko
Dmytro Kostenko

Posted on • Originally published at Medium

Resilient Backend Architecture: From Implementation to Chaos Testing

Many web developers who’ve been in the industry for a while have experienced the same scenario at least once: the server suddenly crashes for no obvious reason, client applications become slow or unusable, users get frustrated, and the team is working hard to bring everything back to normal as quickly as possible.

In this article, I’ll walk you through the backend architecture with an HTTP interface that’s actually prepared for these kinds of failures. The goal is to make it observable, predictable, and resilient, even when core dependencies like the database start failing or responding slowly.

The article is structured around several implementation steps, followed by an architecture overview and a set of controlled load and chaos tests. I use NestJS as the framework of choice, but the concepts and patterns are not tied to any specific technology. You can replicate them anywhere.

If you want to explore the actual implementation, you can find the full working example in this repository.

All sequence diagrams are written in PlantUML and rendered via PlantText. The exact PlantUML source files can be found here, so you can easily feed them to your AI assistant to generate code in another language or framework.

Feel free to reuse any part of this project. Fork it, clone it privately, or take individual modules for your own systems. If it saves you time or helps you avoid a late-night outage someday, that’s already a win.

Now, without further ado, let’s start building this architecture step by step.

Step 1. Baseline API

To have something concrete to build on, I started with a simple REST API for managing TODO tasks. It’s intentionally minimal, just enough CRUD to anchor the resilience work, observability, and chaos scenarios that come later.

Endpoints

  1. GET /tasks — get all tasks
  2. POST /tasks — create a task
  3. DELETE /tasks — delete all tasks
  4. GET /tasks/:id — get one task
  5. PUT /tasks/:id — update a task
  6. DELETE /tasks/:id — delete one task

Model

enum TaskStatus {
  PENDING = 'pending',
  IN_PROGRESS = 'in_progress',
  COMPLETED = 'completed',
}

interface Task {
  id: string
  name: string
  status: TaskStatus
  createdAt: Date
  updatedAt: Date
}
Enter fullscreen mode Exit fullscreen mode

Only name and status are editable.

The source code is here.

Step 2. Input Validation

One of the simplest ways an API can fail is when the client sends unexpected data and the server responds with a 500 instead of a proper 400.

For this part, I simply followed the official NestJS validation documentation and added a few DTOs to validate request bodies and route parameters.

The source code is here.

Step 3. Safe Retries

Now let’s think about the client for a moment. Write requests can get interrupted for all kinds of networking reasons, and when that happens the client has no idea whether the server actually processed the request or not. To avoid creating duplicates or half-applied changes, our write operations need to be safe to retry.

To support this, I added a request interceptor that deduplicates operations by using the Idempotency-Key header.

The spec recommends using this header for non-idempotent POST and PATCH requests, but since any write can have side effects when retried, I simply enabled it for everything except GET. For simplicity, I also skipped payload fingerprinting.

Safe retries flow:

Safe retries flow

The source code is here.

Step 4. Fallback Cache for Read-Path Outages

With validation in place and safe retries handled, the next problem is: the client wants to read data, but the database is temporarily unavailable.

Normally, GET responses should include proper caching headers so browsers or shared caches can reuse data without hitting the backend. But when the cached copy has already expired — or when the client never cached the response in the first place — the request comes back to us, and if the DB is down, we have nothing to return.

To soften this failure mode, I added a small fallback cache for the hot read endpoints. Every time a client reads or writes data, we update that cache using a fire-and-forget strategy so it doesn’t slow down the main request path. The database stays the source of truth, but if it’s unavailable, we can at least try to serve a cached response instead of failing immediately.

Fallback cache flow:

Fallback cache flow

The source code is here.

Step 5. Message Queue for Write-Path Outages

At this point, the API can still serve some reads during a database outage thanks to the fallback cache. But write requests are a different story — if the DB is down, we either fail immediately or provide a way to defer the operation.

In practice, there are two options:

  1. Return 503 with a Retry-After header.
  2. Return 202 and defer the write.

The first option is simple, but the second can be much more user-friendly, especially when clients send large payloads that aren’t ideal to re-upload repeatedly.

To support deferred writes, we need:

  • a message broker
  • a worker process consuming a queue
  • a temporary store (cache) to track message states and results

Enqueuing the write

If the DB is unavailable when a write request arrives, the API publishes a message to the broker, stores the message state in cache, and returns a 202 with a polling location.

A message for POST /tasks looks like this:

{
    "key": "create",
    "payload": {
        "id": "481f060f-7458-4ae6-ba32-73103e5e1d31",
        "data": {
            "name": "Read a book"
        }
    }
}
Enter fullscreen mode Exit fullscreen mode
  • key — routing key deciding which queue the message goes to
  • payload.id — message ID used by both the client and the worker; not related to the task yet
  • data — original request body

Retry and delay settings depend on your needs. Example:

{
    "delayMs": 5000,
    "retryPolicy": {
        "maxRetries": 3,
        "baseDelayMs": 30000,
        "maxDelayMs": 120000
    }
}
Enter fullscreen mode Exit fullscreen mode

A typical 202 response looks like this:

{
    "id": "481f060f-7458-4ae6-ba32-73103e5e1d31",
    "status": "pending",
    "location": "/tasks/queued/481f060f-7458-4ae6-ba32-73103e5e1d31",
    "retryAfter": 30
}
Enter fullscreen mode Exit fullscreen mode
  • location — where the client polls the message state
  • retryAfter — suggested polling interval in seconds

Synchronous enqueue flow:

Synchronous enqueue flow

Worker processing

Once the initial delay expires, the worker receives the message. Based on the routing key, it performs the equivalent of the original write request.

The first step is to lock the message by setting its state to "in_progress" so no other consumer processes it.

  • On success, it writes to the DB, updates the state to "completed", stores the original API response, and acknowledges the message.
  • On failure, it republishes the message with exponential backoff, until retries run out.
  • If all retries fail, the message is marked "failed" and sent to a dead-letter queue.

Async worker flow:

Async worker flow

Client polling

Meanwhile, the client polls the provided location endpoint with the provided interval until the message status becomes "completed" or "failed".

Client polling flow:

Client polling flow

A successful result looks like this:

{
    "id": "481f060f-7458-4ae6-ba32-73103e5e1d31",
    "status": "completed",
    "result": {
        "id": "7fbccabe-0635-450d-92db-d0fe4829759a",
        "name": "Read a book",
        "status": "pending",
        "createdAt": "2025-11-26T20:45:34.713Z",
        "updatedAt": "2025-11-26T20:45:34.713Z"
    }
}
Enter fullscreen mode Exit fullscreen mode
  • result — response the client could have received from POST /tasks

The client is responsible for integrating this flow into the UX so the outage is invisible to the end user.

Since the cache is used to store message states, we can assign a TTL to automatically clean up old entries. The TTL should match your business expectations around how long messages may take to settle.

The source code is here.

Step 6. Circuit Breaker for Distributed Systems

In distributed systems, dependency failures are usually the root cause of outages. To make our API behave predictably under these conditions, we can introduce a circuit breaker between the caller and every external dependency. The breaker helps in two ways:

  1. It fails remote operations fast when a dependency is slow or unavailable.
  2. It prevents overloaded dependencies from getting hammered even more, giving them a chance to recover while the client receives a quick response.

A circuit breaker is essentially a proxy that decides whether a given operation should run or be rejected based on recent metrics.

Circuit breaker flow:

Circuit breaker flow

You can implement the breaker yourself or use an existing library like opossum, which already provides a wide range of options.

Breaker settings should be tuned per dependency (DB, cache, message broker, etc.) based on real metrics — more on that in the next section. Example configuration:

private readonly breakerOptions: CircuitBreaker.Options<[operation: () => Promise<unknown>]> = {
  timeout: 3000,
  errorThresholdPercentage: 50,
  resetTimeout: 30000,
  rollingCountTimeout: 30000,
  rollingCountBuckets: 10,
  volumeThreshold: 10,
  allowWarmUp: true,
}
Enter fullscreen mode Exit fullscreen mode

Wrapping dependencies

The breaker doesn’t change the API surface — callers should not know it exists. You simply wrap your dependency instance at injection time.

  1. Wrapper using a Proxy:
export function withCircuitBreaker<T extends object>(breaker: CircuitBreakerService, repository: T): T {
  return new Proxy(repository, {
    get(target, prop, receiver) {
      const value = Reflect.get(target, prop, receiver)
      if (typeof value !== 'function') return value

      return (...args: unknown[]) => breaker.fire(() => value.apply(target, args))
    },
  })
}
Enter fullscreen mode Exit fullscreen mode
  1. NestJS provider that wires the breaker into a port:
{
  provide: TASK_REPOSITORY,
  useFactory: (breaker: CircuitBreakerService, repository: TaskRepository) =>
    withCircuitBreaker(breaker, repository),
  inject: [CircuitBreakerService, PostgresTaskRepository],
}

// CircuitBreakerService – one instance per dependency
Enter fullscreen mode Exit fullscreen mode
  1. Caller remains unaware:
@Inject(TASK_REPOSITORY)
private readonly taskRepository: TaskRepository
Enter fullscreen mode Exit fullscreen mode

I added a circuit breaker around the following dependencies:

  • PostgreSQL
  • Redis
  • RabbitMQ (publisher)

You can introduce the breaker at any point in your system. I’m adding it now so the architecture is fully assembled before we start collecting metrics and running tests. It doesn’t change any of the previous sequence diagrams — the only difference is that dependency calls may fail faster depending on the breaker state.

The source code is here.

Step 7. Telemetry

Now that the API is wrapped with several resilience layers, we can add the final piece: observability. Without proper telemetry, it’s impossible to understand how the system behaves under failures or load, and even harder to debug issues when something goes wrong.

For this project, I’m using OpenTelemetry. It has wide language support, solid auto-instrumentation, and integrates easily with the tools we need.

To complete the observability setup, I added:

  1. Prometheus — collects metrics
  2. Grafana — visualizes metrics
  3. Jaeger — collects distributed traces
  4. OpenTelemetry Collector — receives data and exports traces to Jaeger

The full docker-compose.yml with all services is here.

In the application, telemetry is initialized before the app boots, so all startup operations and dependencies are captured. The init script is here.

Once everything is running, we get metrics such as:

  • service up/down state
  • event loop delay and utilization
  • garbage collection duration by kind
  • heap memory usage
  • HTTP request duration and count per endpoint
  • DB query duration, count, and connection usage
  • and many others

I also added custom metrics for:

  • circuit breaker state — open/closed/half-open
  • user-facing cache hits/misses

Additionally, I added separate Prometheus exporters for:

RabbitMQ

  • up/down
  • connections, channels
  • delivered / acknowledged / rejected messages
  • per-queue metrics

Redis

  • up/down
  • client count, key count, ops/sec
  • command latency
  • global cache hits/misses

Prometheus config is here.

Grafana dashboards and alert rules are here — you can explore all of them by following the setup instructions in the repository. I’ll show one of the dashboards in the next section.

To improve debugging, I configured full end-to-end tracing with Jaeger. For example, here’s the message queue flow from Step 5:

Sync flow Jaeger spans

You can see the original request failing due to a DB outage, followed by two retries from the worker, and then the successful write. These traces come from two different processes (API and worker), but Jaeger automatically stitches them into a single flow — no custom code required.

And here’s the client polling the queued state:

Async flow Jaeger spans

Expanding a span shows more details:

Jaeger span

By default, the OTel Collector captures everything, but sending full traces slows down the system. To keep overhead minimal, I configured tail-based sampling to always keep 5xx errors and 1% of other traces:

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
    send_batch_max_size: 2048

  tail_sampling:
    decision_wait: 10s
    expected_new_traces_per_sec: 900 # expected RPS
    num_traces: 18000 # RPS * 2 * wait
    policies:
      - name: keep_5xx
        type: numeric_attribute
        numeric_attribute:
          key: http.status_code
          min_value: 500
          max_value: 599
      - name: keep_1_percent
        type: probabilistic
        probabilistic:
          sampling_percentage: 1
Enter fullscreen mode Exit fullscreen mode

The full configuration is here.

Architecture Overview

A resilient system starts with a solid architecture. It reduces blast radius when things fail, makes reliability patterns reusable, and lets you swap infrastructure pieces (DB, broker, cache, telemetry) without rewriting core rules. Without structure, retries, fallbacks, and observability tend to leak everywhere and become impossible to maintain.

This project uses a variation of Clean Architecture (inspired by Robert C. Martin’s book), where all dependencies point inward. Each inner layer knows nothing about the outer ones, and outer layers depend on inner layers via explicit ports (interfaces). A single composition root wires everything together.

Clean Architecture

Domain (entities & rules)

Contains the core entities and the repository port that defines the operations available for each entity. This layer is framework-agnostic.

Application (use cases)

Contains orchestration logic: interacting with repositories and cache. This layer is framework-agnostic.

Interface (presenters/controllers)

Contains everything that exposes our use cases to the outside world. In this project, that includes HTTP controllers and message publishers & subscribers — both act as entry points that translate external inputs into application-level calls.

Because the inner layers don’t depend on any delivery mechanism, this interface can take many forms:

  • HTTP REST/GraphQL
  • WebSockets
  • Desktop/Web application
  • or any future transport

All of these are simply different ways of invoking the same business rules. Swapping or extending them does not require changing a single line in the Application or Domain layers.

Infrastructure (adapters for external dependencies)

This layer contains the concrete implementations of all ports defined in the inner layers. Anything that talks to the outside world lives here: PostgreSQL repositories, Redis cache adapter, RabbitMQ publisher/subscriber, circuit breaker integration, telemetry exporters, etc.

Because the Application and Domain layers depend only on ports, none of these implementations are permanent. We can replace PostgreSQL with MongoDB, Redis with Memcached, or RabbitMQ with Kafka without touching business logic. Even the circuit breaker and telemetry stack can be swapped or extended without affecting any use case or entity.

The Infrastructure layer knows how to communicate with external services, but it doesn’t know why. All decision-making stays inside the Application layer.

Composition Root

The only place that knows about everything. It chooses the concrete adapters, initializes telemetry, attaches circuit breakers, and wires dependencies.

The final component diagram looks like this:

Component diagram

Testing

With the architecture in place, it’s time to see how it behaves under real pressure.

When engineers talk about “tests”, they usually mean unit, integration, or end-to-end testing. Those are essential, but they don’t tell you anything about resilience. This article is about surviving outages, so we’ll skip straight to load tests and controlled chaos scenarios.

For load, we use Grafana k6.
For chaos, we either stop services or inject latency via Toxiproxy.

All tests run in Docker on a MacBook Pro M1 with 8 vCPUs and 8 GB RAM shared by all containers.

The load phase lasts 5 minutes with 100 virtual users generating ~1000 requests/sec. Each VU can create up to 25 entities, giving us a predictable DB size and a consistent read/write pattern.

Request distribution:

  • POST /tasks — 15%
  • GET /tasks/:id — 50%
  • PUT /tasks/:id — 20%
  • DELETE /tasks/:id — 15%

(GET /tasks is ignored because the API doesn’t support per-user lists.)

First, let’s look at the baseline when all dependencies are healthy.

Baseline

k6 load baseline

k6 output:

http_queued_req_duration.......: avg=0s      min=0s       med=0s       max=0s       p(90)=0s       p(95)=0s      
http_queued_req_failed.........: 0.00%  0 out of 0

http_req_duration..............: avg=16.81ms min=745.08µs med=12.21ms  max=838.54ms p(90)=35.1ms   p(95)=44.31ms 
  { scenario:read }............: avg=8.11ms  min=745.08µs med=6.96ms   max=830ms    p(90)=14.01ms  p(95)=17.39ms 
  { scenario:write }...........: avg=24.78ms min=1.82ms   med=21.9ms   max=838.54ms p(90)=43.33ms  p(95)=52.58ms 
http_req_failed................: 0.00%  0 out of 320641
  { scenario:read }............: 0.00%  0 out of 153346
  { scenario:write }...........: 0.00%  0 out of 167293
http_reqs......................: 320641 1068.318129/s
Enter fullscreen mode Exit fullscreen mode

Everything succeeds. The fallback cache and message queue don’t activate as expected.

Now let’s start breaking things.

Scenario 1: PostgreSQL down for 1 minute

k6 load with PostgreSQL down

k6 output:

http_queued_req_duration.......: avg=1m48s    min=1m7s     med=1m25s   max=3m2s  p(90)=2m45s   p(95)=2m51s   
http_queued_req_failed.........: 0.00%  0 out of 70
http_queued_reqs...............: 70     0.227933/s


http_req_duration..............: avg=11.94ms  min=36.87µs  med=6.79ms  max=1.17s p(90)=27.6ms  p(95)=36.72ms 
  { scenario:read }............: avg=5.7ms    min=642.75µs med=3.39ms  max=1.08s p(90)=10.85ms p(95)=13.98ms 
  { scenario:write }...........: avg=20.97ms  min=36.87µs  med=17.06ms max=1.17s p(90)=38.53ms p(95)=47.78ms 
http_req_failed................: 0.00%  0 out of 270681
  { scenario:read }............: 0.00%  0 out of 160154
  { scenario:write }...........: 0.00%  0 out of 110525
http_reqs......................: 270681 881.386655/s
Enter fullscreen mode Exit fullscreen mode

What happens:

  • The PostgreSQL circuit opens.
  • Reads keep working because fallback cache serves them.
  • Writes are deferred — they get enqueued, and the worker processes them once Postgres recovers.
  • Virtual users respect the suggested 202 polling interval, so write RPS naturally drops.

In a real system, cache hit ratios would be lower — but with only 25 entities per VU, everyone hits cache reliably.

Scenario 2: Redis down for 1 minute

k6 load with Redis down

k6 output:

http_queued_req_duration.......: avg=0s       min=0s       med=0s      max=0s       p(90)=0s       p(95)=0s      
http_queued_req_failed.........: 0.00%  0 out of 0

http_req_duration..............: avg=13.85ms  min=362.41µs med=9.11ms  max=3.11s    p(90)=30.17ms  p(95)=39.24ms 
  { scenario:read }............: avg=6.28ms   min=726.66µs med=4.89ms  max=844.76ms p(90)=11.9ms   p(95)=15.04ms 
  { scenario:write }...........: avg=22.79ms  min=362.41µs med=19.11ms max=3.11s    p(90)=39.71ms  p(95)=48.94ms 
http_req_failed................: 0.70%  2045 out of 291874
  { scenario:read }............: 0.00%  0 out of 158010
  { scenario:write }...........: 1.52%  2045 out of 133862
http_reqs......................: 291874 972.431145/s
Enter fullscreen mode Exit fullscreen mode

What happens:

  • The Redis circuit opens.
  • Reads still succeed because of the DB-first strategy.
  • Writes fail with 503 because idempotency cannot be guaranteed without Redis, and accepting write requests without it would risk corruption or duplication.
  • Virtual users respect the Retry-After header, so write RPS naturally drops.

Scenario 3: PostgreSQL and RabbitMQ down for 1 minute

k6 load with PostgreSQL and RabbitMQ down

k6 output:

http_queued_req_duration.......: avg=0s       min=0s       med=0s      max=0s       p(90)=0s       p(95)=0s      
http_queued_req_failed.........: 0.00%  0 out of 0

http_req_duration..............: avg=15.94ms  min=629.95µs med=10.4ms  max=2.21s p(90)=34.16ms  p(95)=43.32ms 
  { scenario:read }............: avg=7.39ms   min=648.62µs med=5.66ms  max=1.13s p(90)=13.19ms  p(95)=16.47ms 
  { scenario:write }...........: avg=26.09ms  min=629.95µs med=22.26ms max=2.21s p(90)=43.79ms  p(95)=53.06ms 
http_req_failed................: 0.72%  2082 out of 285945
  { scenario:read }............: 0.00%  0 out of 155091
  { scenario:write }...........: 1.59%  2082 out of 130852
http_reqs......................: 285945 952.711964/s
Enter fullscreen mode Exit fullscreen mode

What happens:

  • Both Postgres and RabbitMQ circuits open.
  • You see a short latency spike caused by amqplib’s reconnect attempts, which are cut off by the circuit timeout.
  • Reads are still fine because of the fallback cache.
  • Writes fail because with both DB and message queue unavailable, the system has no safe place to store the operation.
  • After Postgres recovers, its breaker closes.
  • RabbitMQ stays in half-open mode because no messages need to be published when the DB is healthy.

Scenario 4: Everything down

All breakers open after hitting the error threshold, and the system fails fast, with occasional probes to check if any dependency has recovered.

This is the correct outcome: no retries, no thread starvation, no cascading failures — just clean fast-fail behavior.

All outages shown here were triggered by stopping containers to make Grafana’s up/down charts easier to read. You can also reproduce them by injecting latency: the circuit will gather enough failure data and open automatically, stabilizing the system in the same way you saw in the RabbitMQ case.

More chaos scenarios are available here.

Conclusion

With safe retries, a fallback cache, a message queue, circuit breakers, and solid telemetry, the system behaves much more predictably when things go wrong. The chaos tests simply confirmed that each piece pulls its weight under pressure.

Still, this is just one way to approach backend resilience. I’m sure there are angles I didn’t explore or places that could be improved. If you have ideas, different experiences, or suggestions, I’d genuinely love to hear them.

Thanks for reading!

Top comments (0)