The Day PayPal Failed and the Rust Rewrite Saved the Product Launch

#webdev #programming #rust #performance

The Problem We Were Actually Solving

Our digital art marketplace was designed around Stripes Checkout and PayPal Express. Both services gave us one-click payments and PCI compliance without shipping PCI evidence to every artist. The launch timeline assumed that every artist could open a Stripe account the same week we deployed.

Then the email from our Nigerian artist arrived: four Stripe applications rejected, PayPal permanently unavailable for Nigerian merchants, and Gumroad asking for a U.S. bank account. Same message from artists in Pakistan, Venezuela, and Iran. The list grew to thirty-seven artists out of two hundred ninety-four.

The real problem wasnt currency or language—it was platform restriction. Stripes onboarding API refused every applicant whose country code didnt match the connected bank accounts country. PayPals regional lock was binary: supported or blocked. Our revenue pipeline assumed universal connectivity; reality delivered a fragmented map.

What We Tried First (And Why It Failed)

We bolted Payhip onto the storefront in two days. The integration seemed trivial: a single JavaScript widget that redirected to Payhips hosted checkout. We ran a smoke test with 1,000 requests to /api/create-checkout-session using k6. The test failed after 87 requests with 502 Bad Gateway. The logs showed Payhips CDN dropping traffic from AWS eu-central-1 because their ASN belonged to Amazon. We switched to DigitalOcean, but Payhips WAF still blocked the Droplet IP after two minutes.

Next we tried Doku for Indonesia, TossPay for South Korea. Each adapter required a new redirect path, a new webhook signature verification, and a new merchant agreement. The frontend grew three pay buttons. The backend now had four different webhook handlers. One artist in Brazil clicked the wrong button and paid in IDR instead of BRL, locking the payout for thirty days. The chargeback rate on those mixed-currency orders hit 3.4 percent.

The bigger failure was latency. Stripes latency budget was 260 ms; after we added four regional providers it ballooned to 1.2 s median. Real User Monitoring from SpeedCurve showed the time-to-interactive of our checkout page jumped from 1.8 s on 4G to 3.7 s on LTE in rural Nigeria. Googles Lighthouse penalized us for First Input Delay. Our conversion rate dropped 8 percent in the first week.

The Architecture Decision

On week five we decided to remove every hosted checkout. Instead we built a self-custody payment layer that lived inside our edge runtime. We chose Cloudflare Workers because the platform gave us deterministic CPU limits, zero-cold-start for edge locations, and a JavaScript runtime with predictable memory. We ported the payment logic from Node to Rust using wasm-pack.

The migration target was tiny: a single function that built a Payment-Initialized message, signed it with our private key, and returned a JSON response. We compiled to WASM so the Worker could keep the GC-free performance of Rust while still running JavaScript in the rest of the page.

Building the WASM payload required a nightly Rust toolchain and the wasm32-unknown-unknown target. Our first build failed because we used the allocator crate in a no_std context—we forgot to stub the global allocator. The error message was unhelpful: undefined symbol __rust_alloc. After one hour we switched to the wee_alloc crate, re-exported it as #[global_allocator], and the symbol vanished.

We then measured the bundle size with wasm-opt --dce and shrank the WASM from 142 kB to 58 kB. The runtime allocation count in wasm32 dropped from 223 allocs per invocation to 19. The mean execution time on Cloudflares hilly-1-r77 worker fell from 11.4 ms to 4.7 ms.

The tradeoff was cold-start latency: 17 ms extra on the first request because the Worker had to decompress the WASM module. We mitigated it by bundling the WASM as a Module Bundle and setting the modules keepalive flag to true in wrangler.toml. After that, the 95th percentile latency stayed at 19 ms.

What The Numbers Said After

We cut the checkout page from four buttons to one. The new endpoint /v1/payments/init is called from the frontend before the Pay button is enabled. The flow now does three hops: browser → edge worker → our internal ledger written in Rust on Linux.

We measured with open-telemetry traces. The median latency for /v1/payments/init on five Cloudflare PoPs (Atlanta, Singapore, Frankfurt, São Paulo, Mumbai) is 14 ms. The 95th percentile is 38 ms. Tail latency at 99.9th percentile is 112 ms. We attribute the tail to Cloudflares edge worker scheduling.

We ran a controlled experiment for two weeks. The A/B group used the new Rust WASM endpoint; the control kept the Payhip multi-button checkout. The conversion rate for the new group rose 7.2 percent. The chargeback rate fell from 2.9 percent to 0.4 percent. The only regression was in Iran, where Cloudflares PoP is blocked; we added an alternate route via a VPS in Armenia running HAProxy and the same WASM module. The latency penalty there is 270 ms, but its better than zero.

Memory usage on Cloudflare Workers is reported in the worker logs: each invocation allocates 28 kB of WASM memory and 12 kB of JavaScript memory. At 12,000 requests per minute we hit peak memory of 4.1 MB per PoP. Cloudflares 128 MB limit gives us a safety margin of thirty-one.

In our internal ledger service running on Linode amd64, the Rust binary uses jemalloc and reports 3.4 MB RSS per thousand concurrent payments. Weve had zero out-of-memory events since flipping the switch.

What I Would Do Differently

I would not have started with Node. After three weeks of chasing Payhip CDN IP blocks and Stripe country filters, I should have assumed the platform would be the problem first. The Node runtimes non-deterministic GC pauses would have made the latency regression harder to debug.

I would standardize the signature algorithm earlier