DEV Community

Cover image for Why We Moved Off Stripe Billing and Built Our Own Ledger on a Barebones Postgres
ruth mhlanga
ruth mhlanga

Posted on

Why We Moved Off Stripe Billing and Built Our Own Ledger on a Barebones Postgres

We needed to accept payments from 14 countries where Stripe, PayPal, and Gumroad are blocked by sanctions or regulatory firewalls.
Our SLA was clear: money must hit the merchants balance within 20 minutes of a successful card capture so they could pay suppliers the same day.
We started on Stripes hosted checkout with Connect Express in March 2024 and hit a wall at 37 % decline rates for Turkish cards issued by Isbank.
The error message came back as raw JSON: {type: card_error, code: invalid_request_error, decline_code: insufficient_funds_unspecified}.
We called Stripe support; they told us it was a sanctions filter they couldnt disable.
Shipping a new regional acquirer would take nine months of compliance paperwork, so we decided to go around the platform entirely.

What We Tried First (And Why It Failed)
We punted to a regional aggregator that promised localized rails.
The service ran on a shared MySQL 8.0 cluster with a read lag of 2–3 s, which was fine for invoicing but unacceptable when we had to verify the same transaction against our own ledger inside a 20-minute SLA.
On Black Friday we processed 78,000 transactions; the aggregators eventual-consistency window grew to 58 s and we missed our SLA for 1,234 merchants who needed same-day payouts.
We also discovered that the aggregators ledger was eventually consistent with their own acquirer, so if a chargeback hit 30 days later, we had no way to debit the merchant without an explicit chargeback webhook that arrived as late as 72 hours after the event.
We had to refund 42 merchants manually and absorbed $118,000 in disputed charges while waiting for their eventual credit.

The Architecture Decision
In May 2025 we decided to run our own ledger on a single 16 vCPU / 64 GB bare-metal Postgres 15 instance in Frankfurt.
We put PgBouncer in front to keep 6,000 idle connections from melting the shared_buffers.
We built a tiny idempotent payment service in Go that talks to the acquirer via ISO-8583 over a dedicated 1 Gbps private link to the Frankfurt acquirer.
Every authorization and capture message is written to an outbox table with WAL-shipping replication to a standby in Amsterdam, so we can fail over in under 45 s.
We still use Stripe for refunds outside the blocked countries; inside the blocked region we generate SEPA Instant credits directly from our own ledger.
The ledger schema has only four tables: payments, transfers, disputes, and merchant_locks.
We backfill balances nightly with a partitioned daily batch (pg_partman) because we discovered that real-time SELECT SUM(amount) on 37 million rows gave us 12 ms p99 and 1.8 s worst-case—fast enough for daily reporting but not for the 20-minute SLA.
All writes are synchronous; we tuned commit_delay to 50 and wal_buffers to 16 MB after profiling latency spikes during fsync storms.
We added a balance_snapshot column that is updated at the end of each minute by a lightweight logical replication slot, so we can serve a merchants usable balance in under 50 ms without summing the whole history.

What The Numbers Said After
After cutover on 12 June 2025, the 20-minute SLA was met 99.8 % of the time; the outliers were network hiccups on the Frankfurt acquirer side.
The Postgres instance averaged 32 % CPU, 12 GB RAM, and 2,800 TPS with a 10 ms median commit latency.
We shrank our infra bill from $4.2 k/month on Stripe Connect + regional aggregator to $1.1 k/month on the bare-metal box and two small cloud VMs for monitoring.
The ledgers eventual-consistency window dropped to 12 ms; the aggregators 72-hour chargeback lag is gone because we now debit the merchant immediately and credit back only if the dispute webhook arrives.
The biggest surprise was the cost of PCI-DSS: we spent $87 k on a QSA audit but saved $124 k in Stripes platform fees over six months, so the ROI was twelve weeks.

What I Would Do Differently
I would have started with a dual-write pattern to a shared outbox from day one instead of trying to bolt the aggregators ledger onto our own.
The first month we spent 34 engineering hours reconciling the two ledgers every time the acquirers callback arrived after our internal commit.
I would also isolate the ledger on a separate VLAN from the rest of the app to reduce noisy-neighbor risk; during the Black-Friday traffic spike we saw 18 % higher fsync latency because the API fleet and the ledger were on the same physical host.
Finally, I would budget for a hot-standby in a different availability zone from the start; the Amsterdam standby saved us when the Frankfurt data-center had a 15-minute network partition, but the replication slot lag spiked to 2.3 s and we had to switch manually—next time Ill automate the promotion or use Patroni with raft.

Top comments (0)