How Would I Build a Payment System That Doesn't Lose Money
This is the first post in a series called How Would I Build. I take a real engineering problem, reason through it in plain language, then name the concept at the end. Jargon comes last.
I didn't sit down knowing all of this. Most of it came from asking "okay but why would that break" until something clicked.
Starting here: how do you build a payment system that handles 10,000 transactions per second without losing a single one?
Two customers, one suya guy
There's a suya spot near you. One guy grilling. Two customers walk up at the same time and both want the last stick.
He can't serve both at once. One of them has to wait.
Now imagine he ignores that and gives both customers the same stick. That's your payment bug.
Two people hit withdraw on the same wallet at the exact same time. Both requests check the balance. Both see enough money. Both go through. You've paid out twice what was in the account.
My first instinct was: lock it. But I had to understand what "lock it" meant before it made any sense.
Whoever gets there first locks the wallet in the database before the second request can even read the balance. The second request waits. When the first finishes, the second checks again and finds nothing left.
In Postgres that's SELECT FOR UPDATE. First transaction locks the row, second waits outside. No double spend.
Multiple operations running at the same time and stepping on each other is a concurrency problem. The specific bug where both reads happen before either write is a race condition. The lock pattern is pessimistic locking.
The suya guy gets popular. He buys a bigger grill.
The queue is long now. So he upgrades everything. Better grill, more charcoal, faster setup.
It helps. But the queue comes back. Because no matter how big the grill is, one person is still running it. There's a ceiling.
That's vertical scaling. More power to the same machine. It buys time, nothing more.
At some point the grill isn't the problem. One location is.
So he opens a second spot. Then a third. Three customers served simultaneously across three locations. Smaller grills, higher total output.
That's horizontal scaling. More machines, not a bigger one.
Not everything needs the main guy
Most people walking up aren't placing new orders. They're checking if their order is ready. Those people are slowing down the grill.
So he puts a separate person on "is my order ready" questions. That person has a copy of the order list and handles all status checks. The main guy handles new orders only.
Read traffic stops choking write performance.
That's read replicas. The main database handles writes. Copies handle reads.
But replicas don't fix everything. If 10,000 new orders are coming in per second, the main grill still has a ceiling. To scale writes, you need more main grills — actual separate locations taking orders, not copies. Customers 1 to 2,000 go to location one. 2,000 to 4,000 go to location two.
That's sharding.
What if someone at spot one wants to pay for a friend at spot three
Spot one collects the money. Before the suya guy can radio spot three to release the order, his phone dies. Spot three never gets the message. The money is gone and the friend got no suya.
Getting both spots to agree before anything moves works, but it's slow. And if the person coordinating disappears mid-process, both spots freeze waiting for a signal that never comes.
The better approach: don't wait for both at once. Spot one takes the money and writes down "I took this, spot three owes a release." If spot three never confirms, spot one reverses the charge. Each step knows exactly what to undo if something breaks. No upfront coordination.
That pattern is a SAGA. Every step has a compensating action ready before anything runs. Stripe uses this for cross-service money movement.
Multiple spots, but who directs customers
Five suya spots now. A new customer walks up and has no idea which one has the shortest queue.
So someone stands at the entrance. Their only job is to look at all five spots and point each new customer to the next free one.
That's a load balancer. It sits in front of your servers and distributes incoming traffic. Nginx handles this well.
That person at the entrance is now a single point of failure. So a backup stands right next to them. First one goes down, the backup steps in. That switch is failover. Running both is a high availability setup.
Rate limiting sits on top of this. One customer can only place so many orders per minute. It won't solve legitimate volume, but it stops bad actors from making things worse.
Something breaks and you don't know where
At 10,000 transactions per second, a failed transaction could have broken at the entrance, the order taker, the payment handler, the records person, or the notification sender. Logs across twenty servers.
The fix: stamp every transaction with a unique ID the moment it enters the system, something like txn-8f3a2c, and write that ID into every log it touches. When something breaks, you search that one ID and see the full journey in one place.
That ID is a correlation ID. Following a request across services using it is distributed tracing. Tools like Datadog or the ELK stack pull all those logs into one searchable place.
The full picture
A request comes in. The load balancer sends it to one of several servers, with a standby ready if the active one dies. Rate limiting filters abuse. The server runs SELECT FOR UPDATE to lock the row. Reads hit replicas, writes hit the primary. Sharding splits write load across multiple database nodes. Cross-node transactions use SAGAs so failures don't leave money in limbo. Every step logs a correlation ID that flows into a centralised system where any transaction can be traced end to end.
The suya guy who once stood alone at one grill now runs five locations, a backup at every entrance, a separate person on status checks, and a ledger that knows exactly what to reverse when something goes wrong.
That's a payment system that doesn't lose money.
I'm Damola, a backend engineer. Find the rest of this series on GitHub. Follow me on Dev.to for the next one.
Top comments (0)