Majd-sufyan

Posted on Jun 25

System Design Journey — Week 4: Reliability, Failures & Designing a Payment API

#architecture #distributedsystems #sre #systemdesign

Overview

In Week 4, I focused on a topic that every distributed system eventually faces:

Failures are inevitable.

No matter how well a system is designed, networks fail, servers crash, databases become unavailable, and requests time out.

The goal this week was to understand how reliable systems continue operating despite these failures.

My main focus areas were:

Fault tolerance
Retries and timeouts
Circuit breakers
Idempotency
Designing systems that avoid cascading failures

To apply these concepts, I designed a simplified Payment API, where correctness matters more than almost any other requirement.

Reliability vs Availability

One idea that stood out immediately is that a system can be available without being reliable.

For example:

A payment service might always respond
But accidentally charge a customer twice

Technically, the system is available.

But it is not reliable.

This changed how I think about backend systems.

Users care less about whether a request returns a response and more about whether the system behaves correctly.

Timeouts, Retries & Circuit Breakers

Most distributed systems communicate over unreliable networks.

Sometimes a request succeeds, but the response never arrives.

Sometimes a downstream service becomes slow.

Sometimes it becomes completely unavailable.

Timeouts

A timeout prevents requests from waiting forever.

Instead of hanging indefinitely, the request fails after a predefined period.

This protects resources and prevents thread exhaustion.

Retries

Retries allow temporary failures to recover automatically.

Examples:

Temporary network issue
Short database outage
Service restart

However, retries can also be dangerous.

If thousands of clients immediately retry a failing service, they can amplify the outage.

This is known as a retry storm.

Circuit Breakers

Circuit breakers help prevent cascading failures.

When a downstream service starts failing repeatedly:

New requests are stopped early
The service is given time to recover
Resources are protected

A circuit breaker acts similarly to an electrical fuse.

Instead of allowing one failure to spread across the system, it isolates the problem.

Idempotency: The Most Important Concept This Week

The biggest lesson from Week 4 was idempotency.

An operation is idempotent when performing it multiple times produces the same result as performing it once.

For example:

Creating a payment is not naturally idempotent.

If a payment request is processed twice:

The customer may be charged twice
Financial records become inconsistent
Customer trust is lost

To solve this problem, payment APIs typically require an Idempotency Key.

The client sends a unique identifier with the request:

POST /payments

Idempotency-Key: abc123

If the same request is retried:

The server recognizes the key
Returns the original result
Prevents duplicate charges

This allows clients to safely retry requests when failures occur.

Applying the Concepts: Designing a Payment API

To practice these ideas, I designed a simplified payment processing system.

Functional Requirements

Create payments
Retrieve payment status
Prevent duplicate charges
Return payment history

Non-Functional Requirements

High reliability
Strong consistency
Low latency
Fault tolerance
High availability

Unlike previous systems, correctness is more important than raw performance.

High-Level Architecture

The system consists of:

Stateless API servers
PostgreSQL for durable storage
Redis for idempotency lookups
Load balancer
Payment processor integration

Request flow:

Client submits payment request
API validates the idempotency key
Payment is stored in the database
The external payment provider is called
The result is returned to the client

Failure Scenarios

One of the most useful exercises this week was identifying failure modes.

This exercise reinforced an important lesson:

Designing for failure is often more important than designing for success.

What Changed in My Thinking

Before this week, I often thought about performance first.

Now I find myself asking different questions:

What happens if this request runs twice?
What happens if the downstream service fails?
What happens if the response never arrives?
What happens if retries overload the system?

These questions feel much closer to how real production systems are designed.

Reflections

Week 4 was less about scaling and more about correctness.

The most valuable takeaway was realizing that distributed systems spend a surprising amount of time handling situations where things go wrong.

Reliability is not achieved by preventing failures.

It is achieved by expecting failures and designing systems that can recover from them.

What’s Next — Week 5

Replication
Consistency models
Read replicas
Leader-follower architectures

The journey continues 🚀

DEV Community