whiteknightonhorse

Posted on May 1

How We Replaced Our x402 Payment Facilitator in a Single Afternoon

#ai #mcp #web3 #tutorial

This morning I got an email from Coinbase Developer Platform (CDP):

Your organization has reached the monthly free tier limit for settled x402 payments using the CDP Facilitator. Additional settled payments will require a payment method on file.

Fine. I tried to add a payment method. CDP rejected my KYC documents three times. There is no support channel for KYC issues — just "try again later".

By that point we were already settling roughly 869 x402 payments per day through CDP for our MCP gateway (576 tools, 177 providers, public API at apibase.pro). When the free tier flipped, every paid call would start failing on /settle. Agents would get 5xx, escrow would leak, customer USDC would orphan.

I had two options:

Wait for KYC, lose revenue, hope it resolves.
Replace CDP. Build our own facilitator. Today.

I picked option 2. Here is how it went.

What an x402 facilitator actually does

The x402 protocol standardizes pay-per-call HTTP. The agent (payer) signs an EIP-3009 transferWithAuthorization permit. The merchant verifies the signature and submits the permit on-chain to actually move USDC.

That last part — submitting on-chain — is the facilitator's job. Most x402 services delegate it to a SaaS provider:

Facilitator	Auth	Cost	KYC	Bazaar discovery
CDP (Coinbase)	Ed25519 JWT	Free tier, then unbounded	Required for billing	Yes
PayAI	None	Free	None	No
x402.org	None	Free, testnet-leaning	None	No
Self-hosted	n/a — your own wallet	Base gas (~$0.0005/settle)	None	No

The facilitator role is genuinely simple: hold an EOA with ETH for gas, submit USDC transferWithAuthorization calls. The whole thing is public on-chain. There is no proprietary algorithm. CDP just runs a wallet and an HTTP API around viem (or some equivalent).

So why does anyone pay for a facilitator? Because operating a hot wallet 24/7 with proper monitoring and key handling is annoying and most teams would rather expense it.

That logic flips the moment the SaaS itself becomes the failure mode.

The build: 350 lines of glue

The good news: the @x402/core and @x402/evm SDKs already include a complete in-process facilitator implementation. It just normally lives behind an HTTP layer in CDP/PayAI's infrastructure. Nothing prevents us from running it ourselves.

The whole architecture, top to bottom:

                 x402ResourceServer (SDK, untouched)
                            |
                            v
                 LocalFacilitatorClient (new, ~120 lines)
                  - Redis SETNX lock per operator address
                  - Prometheus metrics
                  - PayAI HTTP fallback on throw
                            |
                            v
                 x402Facilitator (SDK, in-process)
                 + registerExactEvmScheme(signer)
                            |
                            v
                 viem WalletClient + publicActions
                  - operator privkey from .env
                  - Base mainnet, fixed RPC
                  - holds ETH only, never USDC

The viem signer is six lines:

import { createWalletClient, http, publicActions } from 'viem';
import { privateKeyToAccount } from 'viem/accounts';
import { base } from 'viem/chains';

const account = privateKeyToAccount(process.env.X402_OPERATOR_PRIVATE_KEY);
const signer = createWalletClient({
  account,
  chain: base,
  transport: http(rpcUrl, { retryCount: 2, retryDelay: 200 }),
}).extend(publicActions);

The facilitator wiring is four:

import { x402Facilitator } from '@x402/core/facilitator';
import { registerExactEvmScheme } from '@x402/evm/exact/facilitator';

const facilitator = new x402Facilitator();
registerExactEvmScheme(facilitator, { signer, networks: 'eip155:8453' });

That is genuinely the entire payment-settlement layer. The SDK does the EIP-3009 signature verification, builds the calldata, submits via the signer, waits for the receipt. We add three things on top:

A Redis SETNX lock keyed by operator address (TTL 60s). With multiple containers (api + worker) running on the same operator wallet, two concurrent eth_getTransactionCount calls would return the same nonce and the second tx would fail "nonce too low". The lock serializes settle calls cross-container.
Prometheus metrics: x402_local_settle_total{result} counter, x402_local_settle_duration_seconds histogram with result label, x402_operator_eth_balance gauge. Plus alerts at <0.005 ETH balance and >5% error rate.
PayAI HTTP fallback wired transparently inside LocalFacilitatorClient. If our local settle throws (RPC outage, signer drift, anything), the same call automatically retries against PayAI's free public facilitator. Single-RPC blips do not drop revenue.

Two architectural decisions worth explaining:

Operator wallet ≠ receiver wallet. Our receiver address (the one that accumulates USDC revenue) has no private key in the runtime environment. The operator address is a fresh hot wallet that holds only ETH for gas. If the operator key is compromised, the attacker steals unspent gas (~$10), not customer USDC. Compromise of the receiver requires compromising whatever cold custody we use for that address. Keep them separate. Always.

The SDK does not iterate fallback clients automatically. I assumed it did — the existing code I inherited had a comment claiming "if first client throws, tries next." Reading the actual @x402/core/server source disproved this. x402ResourceServer.initialize() populates a (version, network, scheme) → client lookup map, and the FIRST registered client wins. So the fallback semantics had to live inside our LocalFacilitatorClient itself: try local, catch, delegate to a PayAI HTTP client.

Full code in src/payments/local-facilitator.ts.

Three gotchas that cost us seven minutes of 502s

The implementation worked on first try. The deployment is what bit me. Three docker-compose footguns, in order:

1. `docker compose restart` does not re-read `env_file`

I edited .env to set X402_FACILITATOR_MODE=local, then ran docker compose restart api worker. The container came up healthy. Logs showed mode: "remote".

I assumed the env file load was broken. I read it three times. It was correct. Then I ran docker exec apibase-api-1 env | grep X402_FACILITATOR_MODE — empty.

restart does not reload env_file. It just SIGHUPs the existing container, keeping the env vars from the original up. To pick up new values you need docker compose up -d --force-recreate --no-deps api worker.

Documented nowhere prominent. Cost me ~3 minutes.

2. `--force-recreate` does not pull new images

The push to main had triggered a CI build that published a new image to ghcr.io. The image included our new LocalFacilitatorClient code. I assumed --force-recreate would pull it.

It did not. It happily reused the locally cached pre-push image. The container started with mode=local (from the new env), but the actual code was old (no LocalFacilitatorClient). Logs showed CDP being registered as primary.

Fix: docker compose pull api worker before recreate. Or --pull always. Cost: another ~2 minutes of confusion.

3. After `--force-recreate`, the inner nginx caches the old IP

This one was the worst. After recreating api, the new container got a new IP on the internal Docker network. Our edge nginx (apibase-nginx-1) had the old IP cached from its config-load time. Every request to apibase.pro/api/v1/tools returned 502 because nginx kept connecting to a dead IP.

Symptoms: outer health checks pass (200 from static pages), API endpoints all 502, container looks healthy from docker ps, docker exec apibase-nginx-1 wget http://api:3000/health/ready works. Took me a moment to realize nginx was holding 172.18.0.5 from the previous container while the new one was at 172.18.0.7.

Fix: docker exec apibase-nginx-1 nginx -s reload after every recreate. Cost: ~2 minutes of users seeing 502s before I diagnosed it.

I documented all three in our runbook so the next person (or my future self) does not repeat them.

Council Review and auto-fix

Before pushing, I ran our internal /councilreview skill — eight specialized reviewers (Security, Performance, Reliability, API Designer, Domain Expert, Crypto Specialist, DevOps, Observability) each look at the diff independently. They surfaced five MEDIUM findings I had not thought of:

Crypto Specialist: viem http() transport had no retry config. Added { retryCount: 2, retryDelay: 200 } so transient public-RPC blips are absorbed before triggering PayAI fallback.
Reliability Engineer: my Redis lock had a final redis.get(key) !== token check after the SETNX retry loop, which has a tiny race window where the SET succeeds and the TTL expires before the GET. Replaced with a simple in-loop boolean.
Performance Engineer: histogram buckets started at 500ms, masking sub-second optimization wins. Tightened to [0.1, 0.25, 0.5, 1, 2, 5, 10, 30].
Observability Engineer (×2): settle duration histogram had no result label, so I could not distinguish success-latency from fallback-latency. Also added a brand-new x402_operator_lock_wait_seconds histogram so I would see lock contention rising before requests started timing out.

The skill auto-applied all five fixes (with a TS+ESLint+Jest validation gate that would revert everything if any of the three regressed). Took 90 seconds. Six new useful changes, zero work.

Day-0 numbers (with honest framing)

I am writing this an hour after mode=local went live. I genuinely do not know how it will perform under a peak hour, an RPC outage, or a Base fee-market spike yet. What I do know:

~15 settles processed through LocalFacilitatorClient so far
0 errors, 0 fallbacks (Day-0 cherry — does not generalize)
~0.001 ETH spent on gas (~$3)
Operator wallet balance hit the dashboard, low-balance alert wired and tested
Production /health/ready returned 200 throughout (after the nginx reload fix)
Base finality on every settle, all transactions visible on basescan

I will follow up in two weeks with a real soak report: 7-day error rate, fallback rate, p99 lock-wait latency, total ETH spent, total USDC settled. The numbers I quote there will be production-true, not hopeful.

What I learned

Three things, ordered from operational to philosophical:

Read the SDK source, do not trust the inline docstrings. The "iterates fallback clients" comment was in our code, not in the SDK. The actual SDK behavior is single-client-wins. If I had not read @x402/core/server/index.js directly, our PayAI fallback would have been dead code.
Hot wallet operations are simpler than they sound. Generate a key, fund with ETH, hold it in .env (chmod 600, gitignored), monitor balance via Prometheus alert. The operational complexity of running our own facilitator is genuinely lower than the operational complexity of dealing with KYC, billing dashboards, and SaaS quota limits. We are saving roughly $50/month at our current volume — not a lot, but the dependency removal is the actual prize.
Vendor dependencies on critical revenue paths are technical debt. A free tier with a paid escalator is a free tier until the day it isn't. CDP did not do anything wrong here — they sent a clear notification, gave a clear path forward, and have legitimate KYC requirements. But "depend on a vendor for the part of our system that converts agent calls to revenue" was a fragile choice we had not noticed making. Self-hosting is one less hop, one less account to manage, one less single point of failure.

Try it yourself

If you run an x402-priced API and want to drop your facilitator dependency:

# Generate an operator wallet
npx tsx -e "
const { generatePrivateKey, privateKeyToAccount } = require('viem/accounts');
const pk = generatePrivateKey();
console.log('address:', privateKeyToAccount(pk).address);
console.log('privkey:', pk);
"

# Fund the address with ~$10 of ETH on Base via any exchange withdrawal

# Add to .env (chmod 600, gitignored, never log)
echo "X402_OPERATOR_PRIVATE_KEY=<the privkey>" >> .env

# Replace your HTTP facilitator with the in-process one
# (full reference impl: github.com/whiteknightonhorse/APIbase/blob/main/src/payments/local-facilitator.ts)

That is the whole pattern. The full architecture write-up — operational design, two-wallet model, observability stack, key rotation procedure — lives in docs/x402-facilitator.md. MIT-licensed, fork freely.

If you ship something on this pattern, open an issue on the repo. I would like to know it works for someone other than us.

APIbase is a unified MCP gateway with 576 tools and 177 providers, paid via x402 USDC on Base or MPP USDC on Tempo. Production endpoint: https://apibase.pro/mcp.

DEV Community

How We Replaced Our x402 Payment Facilitator in a Single Afternoon

What an x402 facilitator actually does

The build: 350 lines of glue

Three gotchas that cost us seven minutes of 502s

1. `docker compose restart` does not re-read `env_file`

2. `--force-recreate` does not pull new images

3. After `--force-recreate`, the inner nginx caches the old IP

Council Review and auto-fix

Day-0 numbers (with honest framing)

What I learned

Try it yourself

Top comments (0)

What an x402 facilitator actually does

The build: 350 lines of glue

Three gotchas that cost us seven minutes of 502s

1. docker compose restart does not re-read env_file

2. --force-recreate does not pull new images

3. After --force-recreate, the inner nginx caches the old IP

Council Review and auto-fix

Day-0 numbers (with honest framing)

What I learned

Try it yourself

1. `docker compose restart` does not re-read `env_file`

2. `--force-recreate` does not pull new images

3. After `--force-recreate`, the inner nginx caches the old IP