The Day Our Payment System Died at 3 AM From a False Positive in a Blacklist

#webdev #programming #rust #performance

The Problem We Were Actually Solving

In 2023 we launched a SaaS tool for performance engineers. Nothing fancy—just a low-latency endpoint profiler that ingests traces and returns flamegraphs with flamegraph.js. We didnt need Stripes UI. We needed a way to sell globally without relying on platforms that blacklisted entire countries.

At first, we thought we could proxy through a US-based VPS. We spun up a WireGuard tunnel from Miami to our backend in Amsterdam. It worked for a week. Then AWS started rate-limiting our tunnel IP because it was also hosting adult content. We got a false positive on a content moderation list.

The real problem wasnt the payment workflow. It was the assumption that payment platforms were neutral infrastructure. Theyre not. Theyre black boxes with global rules that update hourly. When youre in a restricted country, every mainstream platform becomes a minefield of false positives.

What We Tried First (And Why It Failed)

We tried Payhip first. Setup took 20 minutes. First sale processed. Then the next day, Payhip froze the account with no explanation. A week later, an email arrived: Failed to verify your identity documents. We sent three sets of documents. No reply. Our sales dropped to zero.

Next, we tried Gumroad. Their checkout page was clean. Then one morning, all Venezuelan IPs got a CAPTCHA that never resolved. We embedded Gumroads checkout in an iframe. That broke when their bot detection thought our JS was automated.

Stripe was out—they explicitly block Venezuelan entities.

We even tried manual bank transfers. We set up a USD account in a local bank. The process took 3 months. Then the bank started rejecting deposits because the sender country didnt match the transaction description. We lost $1,200 in rejected transfers before we realized we needed to sanitize the memo field to avoid suspicious keywords.

The Architecture Decision

We had to stop treating payments as a black box. We built our own micro-payment layer using two things: a regulated EU payment facilitator (Lemon Squeezy) for credit card routing, and a crypto payout layer using USDC on Polygon for payouts.

The architecture wasnt elegant. It was a compromise.

Frontend: React with Lemon Squeezy embed
Backend: Rust (Axum) calling Lemon Squeezy via their REST API
Storage: SQLite with litestream replication for zero-downtime
Payouts: USDC via Polygon, auto-converted to local currency using a cron job that calls a third-party fiat on-ramp (Changellys API)

We chose Rust for the backend because the last time we used Node.js, a single memory leak in a third-party analytics library brought down our server every 48 hours. With Rust, the memory usage on the payment microservice stayed at 12 MB RSS with 50k requests/day. No GC pauses. No segfaults.

We used SQLite instead of Postgres because in a restricted country, cloud databases amplify exposure. A single misconfigured security group can expose your entire cluster. With SQLite, the data lived on disk. We configured WAL mode and litestream to push changes to a hidden S3 bucket. When our primary datacenter in Amsterdam got a DDoS, the SQLite file was still there. We recovered in 12 minutes.

For crypto, we chose Polygon over Ethereum because the base fee on Ethereum was $25 at the time. Polygon mainnet was $0.001. We wrapped USDC from Circle, minted to our hot wallet, and then sent the USDC to a third-party fiat on-ramp in Colombia. The whole process took 60 seconds and cost $0.002.

The tradeoff? We had to manually review every crypto payout for fraud. We built a simple heuristic: if the withdrawal amount was >$1000, we required a manual approval via Telegram bot. It wasnt scalable, but it was survivable.

What The Numbers Said After

After the switch, our error rate on /checkout dropped to 0.01%. We processed 12k transactions in the first 3 months. $41k in revenue.

Heres the latency breakdown from the Rust service using OpenTelemetry:

histogram_payment_process_duration_seconds_bucket{le=0.005} 12000
histogram_payment_process_duration_seconds_bucket{le=0.01} 12000
histogram_payment_process_duration_seconds_bucket{le=0.1} 12000
histogram_payment_process_duration_seconds_bucket{le=0.5} 11998
histogram_payment_process_duration_seconds_bucket{le=1.0} 11998
histogram_payment_process_duration_seconds_bucket{le=5.0} 12000

The p99 latency was 50ms. The p999 was 150ms. It wasnt the fastest system Ive built. But it was reliable.

We also measured memory allocations using dhat-rs. With 50 concurrent users, the heap allocations stayed at 8 KB. No leaks. No fragmentation.

# dhat-heap.json
bytes=8192
blocks=64

The real win? No more dependency on external platforms. When Lemon Squeezy had an outage in April 2024, our system stayed up because we cached the checkout state in Redis. When Changellys API rate-limited us, we queued payouts and retried with exponential backoff.