DEV Community

nash9
nash9

Posted on

Building an "Unstoppable" Serverless Payment System on AWS (Circuit Breaker Pattern)

Building an "Unstoppable" Serverless Payment System on AWS

What happens when your payment gateway goes down? In a traditional app, the user sees a spinner, then a "500 Server Error," and you lose the sale.
I wanted to build a system that refuses to crash. Even if the backend database is on fire, the user's order should be accepted, queued, and processed automatically when the system heals.

Here is how I implemented the Circuit Breaker Pattern using AWS Step Functions, Java Lambda, and Event-Driven Architecture—without provisioning a single server.

The Tech Stack

I chose a hybrid, cloud-native stack to enforce strict decoupling between the Frontend and the Backend.

  • Frontend: Python (Streamlit) – Acts as the Store & Admin Dashboard.
  • Orchestration: AWS Step Functions – The "Brain" handling the logic.
  • Compute: AWS Lambda (Java 11) – The "Worker" handling business logic.
  • State Store: Amazon DynamoDB – Stores circuit status (Open/Closed) and Order History.
  • Resiliency: Amazon SQS – The "Parking Lot" for failed orders.
  • Observability: Grafana Cloud (Loki) – Log aggregation.
  • Infrastructure: Terraform – Complete IaC.

Note : Use terraform to manage resources ,best practice to keep all your resources terraform in separate file for creation/deletion/any kind of update.

The Problem: Cascading Failures

In microservices, if Service A calls Service B, and Service B hangs, Service A eventually hangs too. If thousands of users click "Pay," your database gets hammered with retries, effectively DDoS-ing yourself.
The Solution? A Circuit Breaker.
Just like in your house: if there is a surge, the breaker trips to save the house from burning down.

High Level Architecture:

I designed the system to handle three distinct states:

  1. Green Path (Closed): The backend is healthy. Orders process immediately.
  2. Red Path (Open): The backend is crashing. The system detects this, "Trips" the circuit, and stops sending traffic to the backend.
  3. Yellow Path (Recovery): Orders are routed to a Queue (SQS) to be retried later automatically.

HLD may look scary but it is will make your app unstoppable.

HLD

How It Works The Logic Flow :

The core of this project is an AWS Step Functions State Machine. It acts as a traffic controller.

  1. The Check Every time a user clicks "Pay," the workflow first checks DynamoDB. Is the Circuit Status OPEN? If YES: Skip the backend entirely. If NO: Proceed to the Java Lambda.
  2. The Execution The workflow invokes a Java Lambda to process the payment. Success: It updates the Order History to COMPLETED and emits an event to EventBridge (triggering a customer email via SNS). Failure: It catches the error and retries with Exponential Backoff (Wait 1s, then 2s). 3.** The "Trip"** If the backend fails repeatedly, the Step Function: Writes Status: OPEN to DynamoDB. Alerts the SysAdmin via SNS ("Critical: Circuit Tripped"). Marks the order as FAILED in the dashboard.
  3. The Self-Healing (Auto-Retry) This is the coolest part. If the circuit is Open, new orders are not rejected. They are marked as QUEUED and sent to Amazon SQS. A "Retry Handler" Lambda listens to this queue. It waits for a delay (e.g., 30s).It re-submits the order to the Step Function.If the backend is fixed, the order processes. If not, it goes back to the queue.

lld

Tested Data Scenarios:

  1. SUCCESS
    success

  2. CHAOS Mode

chaos

Observability & Monitoring: I integrated Grafana Cloud (Loki) to ingest logs from CloudWatch.
Streamlit Dashboard: Shows live status of orders (PENDING → COMPLETED or FAILED).
Grafana Explore: Allows deep searching of logs using {service="order-processor"} to find specific stack traces.

Key Learnings & Trade-offs

  1. Complexity vs. Reliability This architecture is more complex than a simple API call. You have more moving parts (Queues, State Machines). However, the trade-off is High Availability. The frontend never sees a crash.
  2. The "Ghost" Data When using Catch blocks in Step Functions, the original input (Order ID) is replaced by the Error Message. I learned to use ResultPath to preserve the original data so I could update the database even after a crash.
  3. Cost Optimization Step Functions Standard Workflows are expensive at scale. For production, I would switch this to Express Workflows and use ARM64 (Graviton) for the Lambdas to reduce costs by ~40%.

Application looks like:

  1. Order placing UI reference
    Order placing ui

  2. Admin UI

Conclusion
This project demonstrates how Event-Driven Architecture allows you to build systems that degrade gracefully. Instead of losing revenue during a crash, we simply "pause" the traffic and process it when the storm passes.
Technologies used: AWS, Java, Python, Terraform, Grafana.

Follow for more. Thanks for reading !

Top comments (0)