Building a Payment Gateway That Doesn't Lie: How I Solved Distributed State Failures in Go

#distributedsystems #softwareengineering #go #postgres

Your server just charged a customer's card. The bank confirmed it — funds reserved, authorization ID returned. Then, a millisecond later, your server crashes.

Your database never got the memo.

Now your system thinks the payment failed. FicMart's order service re-routes the customer to a failure page, maybe even prompts them to retry. But the bank already has a hold on their money. The customer gets charged twice, or worse — their funds are locked in limbo with no order attached.

This isn't a hypothetical. It's the fundamental challenge of payment processing in distributed systems, and it's deceptively easy to ignore until it happens in production. I built FicMart Payment Gateway — a production-grade payment gateway in Go — specifically to confront this problem head-on. Here's how I thought through it.

The Real Enemy: Partial Failures

Most engineers think about failures in binary terms. Either a request succeeds or it fails. But distributed systems introduce a third, nastier category: partial failures — where some things succeed and others don't, with no clean way to tell which is which.

In payment processing, this is especially dangerous because two systems are involved: your gateway and the bank. When you ask the bank to capture $50, the sequence looks like this:

Gateway calls bank: "Capture $50 for Auth #123"
Bank processes it: "Done. Capture ID: #456"
Gateway prepares to save CAPTURED to the database
Gateway crashes
Database still says AUTHORIZED

The money has moved. But your system doesn't know it. And because you have no record of Capture #456, you have no way to reconcile without manual intervention.

This is the problem I set out to solve. The solution came down to three interlocking patterns.

Pattern 1: Capture Intent Before Acting

The core insight is simple: your database needs to know what you're about to do, not just what you've done.

Before the gateway makes any external bank call, it persists the payment in an intermediate state. For a capture, that means transitioning from AUTHORIZED to CAPTURING before touching the bank. A naive state machine looks like this:

PENDING → AUTHORIZED → CAPTURED

But this leaves a blind spot. If the gateway crashes between AUTHORIZED and CAPTURED, there's no record that a capture was ever attempted. Was the bank called? Did it succeed? You don't know.

The intermediate state closes that gap:

PENDING → AUTHORIZED → CAPTURING → CAPTURED → REFUNDING → REFUNDED
                    ↓
                 VOIDING → VOIDED

CAPTURING is not just a status — it's a signal of intent. It says: "A capture was started here. If you find me stuck in this state, you know exactly what to do." The transition into it happens inside the same database transaction that acquires the idempotency lock, so the intent is either fully committed or fully rolled back — no ambiguity.

This is borrowed from database engineering: the Write-Ahead Log pattern, where you record what you're about to do before doing it so recovery is always possible.

For authorizations specifically, this gets more nuanced. PCI compliance means you can never store raw card details, so if a crash happens during authorization, there's no way to retry it — the card data is gone. Rather than pretending this is solvable automatically, PENDING authorizations older than 10 minutes are marked FAILED and flagged for manual reconciliation. Some failures can't be fully automated away, and being honest about that is better than silently losing money.

The domain layer enforces all of this with zero database or HTTP dependencies. Business rules — you can't void a captured payment, you can't refund an unauthorized one — live in pure Go. The domain is the source of truth for what's allowed, completely independent of what's stored.

Pattern 2: Background Workers That Heal the System

Intermediate states create the evidence. Background workers act on it.

The RetryWorker polls the database on a configurable interval, looking for payments stuck in CAPTURING, VOIDING, or REFUNDING past their retry window. For each one, it re-invokes the appropriate bank operation using the original idempotency key.

That last part is what makes this safe. Because the bank supports idempotency, sending the same key twice doesn't trigger a second charge — it returns the cached result from the first attempt. The worker doesn't need to know whether the original call succeeded or not. If the bank already processed it, we get the success response back and update the database. If it didn't, we process it now. Either way, the database eventually converges to reality.

Before any retry decision is made, errors are classified:

Transient errors (timeouts, 500s) — retry with exponential backoff and jitter to avoid hammering the bank
Permanent errors (card declined, insufficient funds, auth expired) — fail fast, no retry
Business rule violations (invalid state transitions) — reject immediately at the domain layer

This classification is what separates a robust retry system from one that makes things worse. Retrying a permanent error doesn't fix anything — a declined card won't become approved on the fifth attempt. Treating it as retryable wastes cycles and delays the customer from finding out their payment failed.

The ExpirationWorker handles a different edge case: authorized payments approaching the bank's 7-day authorization window. Rather than trusting the local clock blindly, the worker checks the bank's state before marking anything expired — with a 48-hour grace period to account for distributed clock skew.

Pattern 3: Idempotency as the Safety Net

Recovery workers only work if retrying is safe. That guarantee comes entirely from idempotency.

Every external-facing operation requires an Idempotency-Key header. But the enforcement here goes deeper than most implementations.

Idempotency state is stored in PostgreSQL, not Redis — deliberately. This means it survives restarts and is subject to ACID guarantees. The idempotency_keys table does two jobs simultaneously.

It's a response cache. Once an operation completes, the result is stored against the key. Future requests with the same key get the cached response instantly, without touching the bank.

It's a distributed lock. A locked_at timestamp is set when an operation begins and cleared when it finishes. If two requests arrive with the same key at the same time, the second enters a polling loop — checking every 100ms — until the first completes, then receives the same response. No double-processing, no race conditions.

There's also a subtler protection: a request_hash (SHA-256 of the request body) stored alongside each key. If a client tries to reuse an idempotency key with different parameters — a different amount, a different payment — the gateway rejects it with an IDEMPOTENCY_MISMATCH error. This prevents a class of silent bugs where key reuse returns a stale result for a completely different operation.

The three patterns form a chain: intermediate states give workers something to act on → workers retry using the original idempotency key → idempotency makes those retries safe. Remove any one of them and the others stop working.

What I'd Do Differently at Scale

Building this taught me as much about the limits of my approach as the strengths of it.

The most important change in a high-traffic environment would be moving idempotency lookups to Redis. PostgreSQL works here, but for a gateway handling thousands of requests per second, sub-millisecond idempotency checks matter. I'd keep Postgres as the durable fallback but use Redis as the hot path.

I'd also move to event sourcing for payment state. Right now, the payments table stores the current state — you can see that a payment is CAPTURED, but you can't see the full timeline of how it got there. An append-only payment_events table would make debugging orphaned authorizations significantly easier: you'd be able to reconstruct exactly where the gap between the bank's state and yours opened up.

The retry worker would also benefit from FOR UPDATE SKIP LOCKED on its database queries. Currently, multiple worker instances compete for the same stuck payments. Skip-locked semantics let workers divide the work without blocking each other — a meaningful concurrency improvement once the system is under real load.

Finally, I'd add chaos testing: deliberately crashing the gateway at the exact millisecond between a bank response and the database commit. That's the failure mode this entire system is designed to handle, and the only way to be truly confident it works is to make it happen on purpose.

What This Really Taught Me

Payment systems forced me to think about a dimension of engineering I hadn't fully internalized before: correctness under failure, not just correctness under normal conditions.

It's easy to build a service that works when everything goes right. The interesting engineering happens when you ask: what is the worst possible moment for this process to crash, and what does the system look like afterward? That question shapes every decision in this gateway — the intermediate states, the write-ahead pattern, the idempotency locking, the recovery workers.

The result is a system that doesn't just handle payments. It handles uncertainty. And in distributed systems, uncertainty is the only thing you can count on.

The full source code is available on GitHub: DanielPopoola/ficmart-payment-gateway