DEV Community

Cover image for System Design Journey — Week 4: Reliability, Failures & Designing a Payment API
Majd-sufyan
Majd-sufyan

Posted on

System Design Journey — Week 4: Reliability, Failures & Designing a Payment API

Overview

In Week 4, I focused on a topic that every distributed system eventually faces:

Failures are inevitable.

No matter how well a system is designed, networks fail, servers crash, databases become unavailable, and requests time out.

The goal this week was to understand how reliable systems continue operating despite these failures.

My main focus areas were:

  • Fault tolerance
  • Retries and timeouts
  • Circuit breakers
  • Idempotency
  • Designing systems that avoid cascading failures

To apply these concepts, I designed a simplified Payment API, where correctness matters more than almost any other requirement.


Reliability vs Availability

One idea that stood out immediately is that a system can be available without being reliable.

For example:

  • A payment service might always respond
  • But accidentally charge a customer twice

Technically, the system is available.

But it is not reliable.

This changed how I think about backend systems.

Users care less about whether a request returns a response and more about whether the system behaves correctly.


Timeouts, Retries & Circuit Breakers

Most distributed systems communicate over unreliable networks.

Sometimes a request succeeds, but the response never arrives.

Sometimes a downstream service becomes slow.

Sometimes it becomes completely unavailable.

Timeouts

A timeout prevents requests from waiting forever.

Instead of hanging indefinitely, the request fails after a predefined period.

This protects resources and prevents thread exhaustion.


Retries

Retries allow temporary failures to recover automatically.

Examples:

  • Temporary network issue
  • Short database outage
  • Service restart

However, retries can also be dangerous.

If thousands of clients immediately retry a failing service, they can amplify the outage.

This is known as a retry storm.


Circuit Breakers

Circuit breakers help prevent cascading failures.

When a downstream service starts failing repeatedly:

  • New requests are stopped early
  • The service is given time to recover
  • Resources are protected

A circuit breaker acts similarly to an electrical fuse.

Instead of allowing one failure to spread across the system, it isolates the problem.


Idempotency: The Most Important Concept This Week

The biggest lesson from Week 4 was idempotency.

An operation is idempotent when performing it multiple times produces the same result as performing it once.

For example:

Creating a payment is not naturally idempotent.

If a payment request is processed twice:

  • The customer may be charged twice
  • Financial records become inconsistent
  • Customer trust is lost

To solve this problem, payment APIs typically require an Idempotency Key.

The client sends a unique identifier with the request:

POST /payments

Idempotency-Key: abc123

If the same request is retried:

  • The server recognizes the key
  • Returns the original result
  • Prevents duplicate charges

This allows clients to safely retry requests when failures occur.


Applying the Concepts: Designing a Payment API

To practice these ideas, I designed a simplified payment processing system.

Functional Requirements

  • Create payments
  • Retrieve payment status
  • Prevent duplicate charges
  • Return payment history

Non-Functional Requirements

  • High reliability
  • Strong consistency
  • Low latency
  • Fault tolerance
  • High availability

Unlike previous systems, correctness is more important than raw performance.


High-Level Architecture

The system consists of:

  • Stateless API servers
  • PostgreSQL for durable storage
  • Redis for idempotency lookups
  • Load balancer
  • Payment processor integration

Request flow:

  1. Client submits payment request
  2. API validates the idempotency key
  3. Payment is stored in the database
  4. The external payment provider is called
  5. The result is returned to the client

Failure Scenarios

One of the most useful exercises this week was identifying failure modes.

This exercise reinforced an important lesson:

Designing for failure is often more important than designing for success.


What Changed in My Thinking

Before this week, I often thought about performance first.

Now I find myself asking different questions:

  • What happens if this request runs twice?
  • What happens if the downstream service fails?
  • What happens if the response never arrives?
  • What happens if retries overload the system?

These questions feel much closer to how real production systems are designed.


Reflections

Week 4 was less about scaling and more about correctness.

The most valuable takeaway was realizing that distributed systems spend a surprising amount of time handling situations where things go wrong.

Reliability is not achieved by preventing failures.

It is achieved by expecting failures and designing systems that can recover from them.


What’s Next — Week 5

  • Replication
  • Consistency models
  • Read replicas
  • Leader-follower architectures

The journey continues 🚀

Top comments (0)