Many web developers who’ve been in the industry for a while have experienced the same scenario at least once: the server suddenly crashes for no obvious reason, client applications become slow or unusable, users get frustrated, and the team is working hard to bring everything back to normal as quickly as possible.
In this article, I’ll walk you through the backend architecture with an HTTP interface that’s actually prepared for these kinds of failures. The goal is to make it observable, predictable, and resilient, even when core dependencies like the database start failing or responding slowly.
The article is structured around several implementation steps, followed by an architecture overview and a set of controlled load and chaos tests. I use NestJS as the framework of choice, but the concepts and patterns are not tied to any specific technology. You can replicate them anywhere.
If you want to explore the actual implementation, you can find the full working example in this repository.
All sequence diagrams are written in PlantUML and rendered via PlantText. The exact PlantUML source files can be found here, so you can easily feed them to your AI assistant to generate code in another language or framework.
Feel free to reuse any part of this project. Fork it, clone it privately, or take individual modules for your own systems. If it saves you time or helps you avoid a late-night outage someday, that’s already a win.
Now, without further ado, let’s start building this architecture step by step.
Step 1. Baseline API
To have something concrete to build on, I started with a simple REST API for managing TODO tasks. It’s intentionally minimal, just enough CRUD to anchor the resilience work, observability, and chaos scenarios that come later.
Endpoints
- GET
/tasks— get all tasks - POST
/tasks— create a task - DELETE
/tasks— delete all tasks - GET
/tasks/:id— get one task - PUT
/tasks/:id— update a task - DELETE
/tasks/:id— delete one task
Model
enum TaskStatus {
PENDING = 'pending',
IN_PROGRESS = 'in_progress',
COMPLETED = 'completed',
}
interface Task {
id: string
name: string
status: TaskStatus
createdAt: Date
updatedAt: Date
}
Only name and status are editable.
The source code is here.
Step 2. Input Validation
One of the simplest ways an API can fail is when the client sends unexpected data and the server responds with a 500 instead of a proper 400.
For this part, I simply followed the official NestJS validation documentation and added a few DTOs to validate request bodies and route parameters.
The source code is here.
Step 3. Safe Retries
Now let’s think about the client for a moment. Write requests can get interrupted for all kinds of networking reasons, and when that happens the client has no idea whether the server actually processed the request or not. To avoid creating duplicates or half-applied changes, our write operations need to be safe to retry.
To support this, I added a request interceptor that deduplicates operations by using the Idempotency-Key header.
The spec recommends using this header for non-idempotent POST and PATCH requests, but since any write can have side effects when retried, I simply enabled it for everything except GET. For simplicity, I also skipped payload fingerprinting.
Safe retries flow:
The source code is here.
Step 4. Fallback Cache for Read-Path Outages
With validation in place and safe retries handled, the next problem is: the client wants to read data, but the database is temporarily unavailable.
Normally, GET responses should include proper caching headers so browsers or shared caches can reuse data without hitting the backend. But when the cached copy has already expired — or when the client never cached the response in the first place — the request comes back to us, and if the DB is down, we have nothing to return.
To soften this failure mode, I added a small fallback cache for the hot read endpoints. Every time a client reads or writes data, we update that cache using a fire-and-forget strategy so it doesn’t slow down the main request path. The database stays the source of truth, but if it’s unavailable, we can at least try to serve a cached response instead of failing immediately.
Fallback cache flow:
The source code is here.
Step 5. Message Queue for Write-Path Outages
At this point, the API can still serve some reads during a database outage thanks to the fallback cache. But write requests are a different story — if the DB is down, we either fail immediately or provide a way to defer the operation.
In practice, there are two options:
- Return 503 with a
Retry-Afterheader. - Return 202 and defer the write.
The first option is simple, but the second can be much more user-friendly, especially when clients send large payloads that aren’t ideal to re-upload repeatedly.
To support deferred writes, we need:
- a message broker
- a worker process consuming a queue
- a temporary store (cache) to track message states and results
Enqueuing the write
If the DB is unavailable when a write request arrives, the API publishes a message to the broker, stores the message state in cache, and returns a 202 with a polling location.
A message for POST /tasks looks like this:
{
"key": "create",
"payload": {
"id": "481f060f-7458-4ae6-ba32-73103e5e1d31",
"data": {
"name": "Read a book"
}
}
}
- key — routing key deciding which queue the message goes to
- payload.id — message ID used by both the client and the worker; not related to the task yet
- data — original request body
Retry and delay settings depend on your needs. Example:
{
"delayMs": 5000,
"retryPolicy": {
"maxRetries": 3,
"baseDelayMs": 30000,
"maxDelayMs": 120000
}
}
A typical 202 response looks like this:
{
"id": "481f060f-7458-4ae6-ba32-73103e5e1d31",
"status": "pending",
"location": "/tasks/queued/481f060f-7458-4ae6-ba32-73103e5e1d31",
"retryAfter": 30
}
- location — where the client polls the message state
- retryAfter — suggested polling interval in seconds
Synchronous enqueue flow:
Worker processing
Once the initial delay expires, the worker receives the message. Based on the routing key, it performs the equivalent of the original write request.
The first step is to lock the message by setting its state to "in_progress" so no other consumer processes it.
- On success, it writes to the DB, updates the state to
"completed", stores the original API response, and acknowledges the message. - On failure, it republishes the message with exponential backoff, until retries run out.
- If all retries fail, the message is marked
"failed"and sent to a dead-letter queue.
Async worker flow:
Client polling
Meanwhile, the client polls the provided location endpoint with the provided interval until the message status becomes "completed" or "failed".
Client polling flow:
A successful result looks like this:
{
"id": "481f060f-7458-4ae6-ba32-73103e5e1d31",
"status": "completed",
"result": {
"id": "7fbccabe-0635-450d-92db-d0fe4829759a",
"name": "Read a book",
"status": "pending",
"createdAt": "2025-11-26T20:45:34.713Z",
"updatedAt": "2025-11-26T20:45:34.713Z"
}
}
-
result — response the client could have received from POST
/tasks
The client is responsible for integrating this flow into the UX so the outage is invisible to the end user.
Since the cache is used to store message states, we can assign a TTL to automatically clean up old entries. The TTL should match your business expectations around how long messages may take to settle.
The source code is here.
Step 6. Circuit Breaker for Distributed Systems
In distributed systems, dependency failures are usually the root cause of outages. To make our API behave predictably under these conditions, we can introduce a circuit breaker between the caller and every external dependency. The breaker helps in two ways:
- It fails remote operations fast when a dependency is slow or unavailable.
- It prevents overloaded dependencies from getting hammered even more, giving them a chance to recover while the client receives a quick response.
A circuit breaker is essentially a proxy that decides whether a given operation should run or be rejected based on recent metrics.
Circuit breaker flow:
You can implement the breaker yourself or use an existing library like opossum, which already provides a wide range of options.
Breaker settings should be tuned per dependency (DB, cache, message broker, etc.) based on real metrics — more on that in the next section. Example configuration:
private readonly breakerOptions: CircuitBreaker.Options<[operation: () => Promise<unknown>]> = {
timeout: 3000,
errorThresholdPercentage: 50,
resetTimeout: 30000,
rollingCountTimeout: 30000,
rollingCountBuckets: 10,
volumeThreshold: 10,
allowWarmUp: true,
}
Wrapping dependencies
The breaker doesn’t change the API surface — callers should not know it exists. You simply wrap your dependency instance at injection time.
- Wrapper using a Proxy:
export function withCircuitBreaker<T extends object>(breaker: CircuitBreakerService, repository: T): T {
return new Proxy(repository, {
get(target, prop, receiver) {
const value = Reflect.get(target, prop, receiver)
if (typeof value !== 'function') return value
return (...args: unknown[]) => breaker.fire(() => value.apply(target, args))
},
})
}
- NestJS provider that wires the breaker into a port:
{
provide: TASK_REPOSITORY,
useFactory: (breaker: CircuitBreakerService, repository: TaskRepository) =>
withCircuitBreaker(breaker, repository),
inject: [CircuitBreakerService, PostgresTaskRepository],
}
// CircuitBreakerService – one instance per dependency
- Caller remains unaware:
@Inject(TASK_REPOSITORY)
private readonly taskRepository: TaskRepository
I added a circuit breaker around the following dependencies:
- PostgreSQL
- Redis
- RabbitMQ (publisher)
You can introduce the breaker at any point in your system. I’m adding it now so the architecture is fully assembled before we start collecting metrics and running tests. It doesn’t change any of the previous sequence diagrams — the only difference is that dependency calls may fail faster depending on the breaker state.
The source code is here.
Step 7. Telemetry
Now that the API is wrapped with several resilience layers, we can add the final piece: observability. Without proper telemetry, it’s impossible to understand how the system behaves under failures or load, and even harder to debug issues when something goes wrong.
For this project, I’m using OpenTelemetry. It has wide language support, solid auto-instrumentation, and integrates easily with the tools we need.
To complete the observability setup, I added:
- Prometheus — collects metrics
- Grafana — visualizes metrics
- Jaeger — collects distributed traces
- OpenTelemetry Collector — receives data and exports traces to Jaeger
The full docker-compose.yml with all services is here.
In the application, telemetry is initialized before the app boots, so all startup operations and dependencies are captured. The init script is here.
Once everything is running, we get metrics such as:
- service up/down state
- event loop delay and utilization
- garbage collection duration by kind
- heap memory usage
- HTTP request duration and count per endpoint
- DB query duration, count, and connection usage
- and many others
I also added custom metrics for:
- circuit breaker state — open/closed/half-open
- user-facing cache hits/misses
Additionally, I added separate Prometheus exporters for:
RabbitMQ
- up/down
- connections, channels
- delivered / acknowledged / rejected messages
- per-queue metrics
Redis
- up/down
- client count, key count, ops/sec
- command latency
- global cache hits/misses
Prometheus config is here.
Grafana dashboards and alert rules are here — you can explore all of them by following the setup instructions in the repository. I’ll show one of the dashboards in the next section.
To improve debugging, I configured full end-to-end tracing with Jaeger. For example, here’s the message queue flow from Step 5:
You can see the original request failing due to a DB outage, followed by two retries from the worker, and then the successful write. These traces come from two different processes (API and worker), but Jaeger automatically stitches them into a single flow — no custom code required.
And here’s the client polling the queued state:
Expanding a span shows more details:
By default, the OTel Collector captures everything, but sending full traces slows down the system. To keep overhead minimal, I configured tail-based sampling to always keep 5xx errors and 1% of other traces:
processors:
batch:
timeout: 5s
send_batch_size: 1024
send_batch_max_size: 2048
tail_sampling:
decision_wait: 10s
expected_new_traces_per_sec: 900 # expected RPS
num_traces: 18000 # RPS * 2 * wait
policies:
- name: keep_5xx
type: numeric_attribute
numeric_attribute:
key: http.status_code
min_value: 500
max_value: 599
- name: keep_1_percent
type: probabilistic
probabilistic:
sampling_percentage: 1
The full configuration is here.
Architecture Overview
A resilient system starts with a solid architecture. It reduces blast radius when things fail, makes reliability patterns reusable, and lets you swap infrastructure pieces (DB, broker, cache, telemetry) without rewriting core rules. Without structure, retries, fallbacks, and observability tend to leak everywhere and become impossible to maintain.
This project uses a variation of Clean Architecture (inspired by Robert C. Martin’s book), where all dependencies point inward. Each inner layer knows nothing about the outer ones, and outer layers depend on inner layers via explicit ports (interfaces). A single composition root wires everything together.
Domain (entities & rules)
Contains the core entities and the repository port that defines the operations available for each entity. This layer is framework-agnostic.
Application (use cases)
Contains orchestration logic: interacting with repositories and cache. This layer is framework-agnostic.
Interface (presenters/controllers)
Contains everything that exposes our use cases to the outside world. In this project, that includes HTTP controllers and message publishers & subscribers — both act as entry points that translate external inputs into application-level calls.
Because the inner layers don’t depend on any delivery mechanism, this interface can take many forms:
- HTTP REST/GraphQL
- WebSockets
- Desktop/Web application
- or any future transport
All of these are simply different ways of invoking the same business rules. Swapping or extending them does not require changing a single line in the Application or Domain layers.
Infrastructure (adapters for external dependencies)
This layer contains the concrete implementations of all ports defined in the inner layers. Anything that talks to the outside world lives here: PostgreSQL repositories, Redis cache adapter, RabbitMQ publisher/subscriber, circuit breaker integration, telemetry exporters, etc.
Because the Application and Domain layers depend only on ports, none of these implementations are permanent. We can replace PostgreSQL with MongoDB, Redis with Memcached, or RabbitMQ with Kafka without touching business logic. Even the circuit breaker and telemetry stack can be swapped or extended without affecting any use case or entity.
The Infrastructure layer knows how to communicate with external services, but it doesn’t know why. All decision-making stays inside the Application layer.
Composition Root
The only place that knows about everything. It chooses the concrete adapters, initializes telemetry, attaches circuit breakers, and wires dependencies.
The final component diagram looks like this:
Testing
With the architecture in place, it’s time to see how it behaves under real pressure.
When engineers talk about “tests”, they usually mean unit, integration, or end-to-end testing. Those are essential, but they don’t tell you anything about resilience. This article is about surviving outages, so we’ll skip straight to load tests and controlled chaos scenarios.
For load, we use Grafana k6.
For chaos, we either stop services or inject latency via Toxiproxy.
All tests run in Docker on a MacBook Pro M1 with 8 vCPUs and 8 GB RAM shared by all containers.
The load phase lasts 5 minutes with 100 virtual users generating ~1000 requests/sec. Each VU can create up to 25 entities, giving us a predictable DB size and a consistent read/write pattern.
Request distribution:
- POST
/tasks— 15% - GET
/tasks/:id— 50% - PUT
/tasks/:id— 20% - DELETE
/tasks/:id— 15%
(GET /tasks is ignored because the API doesn’t support per-user lists.)
First, let’s look at the baseline when all dependencies are healthy.
Baseline
k6 output:
http_queued_req_duration.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_queued_req_failed.........: 0.00% 0 out of 0
http_req_duration..............: avg=16.81ms min=745.08µs med=12.21ms max=838.54ms p(90)=35.1ms p(95)=44.31ms
{ scenario:read }............: avg=8.11ms min=745.08µs med=6.96ms max=830ms p(90)=14.01ms p(95)=17.39ms
{ scenario:write }...........: avg=24.78ms min=1.82ms med=21.9ms max=838.54ms p(90)=43.33ms p(95)=52.58ms
http_req_failed................: 0.00% 0 out of 320641
{ scenario:read }............: 0.00% 0 out of 153346
{ scenario:write }...........: 0.00% 0 out of 167293
http_reqs......................: 320641 1068.318129/s
Everything succeeds. The fallback cache and message queue don’t activate as expected.
Now let’s start breaking things.
Scenario 1: PostgreSQL down for 1 minute
k6 output:
http_queued_req_duration.......: avg=1m48s min=1m7s med=1m25s max=3m2s p(90)=2m45s p(95)=2m51s
http_queued_req_failed.........: 0.00% 0 out of 70
http_queued_reqs...............: 70 0.227933/s
http_req_duration..............: avg=11.94ms min=36.87µs med=6.79ms max=1.17s p(90)=27.6ms p(95)=36.72ms
{ scenario:read }............: avg=5.7ms min=642.75µs med=3.39ms max=1.08s p(90)=10.85ms p(95)=13.98ms
{ scenario:write }...........: avg=20.97ms min=36.87µs med=17.06ms max=1.17s p(90)=38.53ms p(95)=47.78ms
http_req_failed................: 0.00% 0 out of 270681
{ scenario:read }............: 0.00% 0 out of 160154
{ scenario:write }...........: 0.00% 0 out of 110525
http_reqs......................: 270681 881.386655/s
What happens:
- The PostgreSQL circuit opens.
- Reads keep working because fallback cache serves them.
- Writes are deferred — they get enqueued, and the worker processes them once Postgres recovers.
- Virtual users respect the suggested 202 polling interval, so write RPS naturally drops.
In a real system, cache hit ratios would be lower — but with only 25 entities per VU, everyone hits cache reliably.
Scenario 2: Redis down for 1 minute
k6 output:
http_queued_req_duration.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_queued_req_failed.........: 0.00% 0 out of 0
http_req_duration..............: avg=13.85ms min=362.41µs med=9.11ms max=3.11s p(90)=30.17ms p(95)=39.24ms
{ scenario:read }............: avg=6.28ms min=726.66µs med=4.89ms max=844.76ms p(90)=11.9ms p(95)=15.04ms
{ scenario:write }...........: avg=22.79ms min=362.41µs med=19.11ms max=3.11s p(90)=39.71ms p(95)=48.94ms
http_req_failed................: 0.70% 2045 out of 291874
{ scenario:read }............: 0.00% 0 out of 158010
{ scenario:write }...........: 1.52% 2045 out of 133862
http_reqs......................: 291874 972.431145/s
What happens:
- The Redis circuit opens.
- Reads still succeed because of the DB-first strategy.
- Writes fail with 503 because idempotency cannot be guaranteed without Redis, and accepting write requests without it would risk corruption or duplication.
- Virtual users respect the
Retry-Afterheader, so write RPS naturally drops.
Scenario 3: PostgreSQL and RabbitMQ down for 1 minute
k6 output:
http_queued_req_duration.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_queued_req_failed.........: 0.00% 0 out of 0
http_req_duration..............: avg=15.94ms min=629.95µs med=10.4ms max=2.21s p(90)=34.16ms p(95)=43.32ms
{ scenario:read }............: avg=7.39ms min=648.62µs med=5.66ms max=1.13s p(90)=13.19ms p(95)=16.47ms
{ scenario:write }...........: avg=26.09ms min=629.95µs med=22.26ms max=2.21s p(90)=43.79ms p(95)=53.06ms
http_req_failed................: 0.72% 2082 out of 285945
{ scenario:read }............: 0.00% 0 out of 155091
{ scenario:write }...........: 1.59% 2082 out of 130852
http_reqs......................: 285945 952.711964/s
What happens:
- Both Postgres and RabbitMQ circuits open.
- You see a short latency spike caused by amqplib’s reconnect attempts, which are cut off by the circuit timeout.
- Reads are still fine because of the fallback cache.
- Writes fail because with both DB and message queue unavailable, the system has no safe place to store the operation.
- After Postgres recovers, its breaker closes.
- RabbitMQ stays in half-open mode because no messages need to be published when the DB is healthy.
Scenario 4: Everything down
All breakers open after hitting the error threshold, and the system fails fast, with occasional probes to check if any dependency has recovered.
This is the correct outcome: no retries, no thread starvation, no cascading failures — just clean fast-fail behavior.
All outages shown here were triggered by stopping containers to make Grafana’s up/down charts easier to read. You can also reproduce them by injecting latency: the circuit will gather enough failure data and open automatically, stabilizing the system in the same way you saw in the RabbitMQ case.
More chaos scenarios are available here.
Conclusion
With safe retries, a fallback cache, a message queue, circuit breakers, and solid telemetry, the system behaves much more predictably when things go wrong. The chaos tests simply confirmed that each piece pulls its weight under pressure.
Still, this is just one way to approach backend resilience. I’m sure there are angles I didn’t explore or places that could be improved. If you have ideas, different experiences, or suggestions, I’d genuinely love to hear them.
Thanks for reading!















Top comments (0)