Every SMS verification reseller eventually hits the same wall: codes don't always arrive. In our production data over the last six months, the failure rate has been remarkably stable around 28-32% depending on country and service. Telegram in Indonesia is closer to 40%. WhatsApp in the US is closer to 18%.
The naive refund flow looks like this:
User buys number → Wait 20 minutes → No code arrived → Refund user
This is what we shipped in v1. It cost us roughly $400 per week in losses before we caught it. Here's what was happening, what we built to fix it, and the Postgres patterns that made the new flow safe.
The bug
The failure mode wasn't subtle once we instrumented it: we were refunding users even when the upstream provider had charged us and not refunded us back. Both providers we use (HeroSMS and 5SIM) have their own internal refund logic — sometimes they refund automatically when no code arrives within their SLA, sometimes they don't, sometimes they do but ~30 minutes later.
Our v1 logic just checked our own clock: 20 minutes after purchase, if status was still "WAITING" in our DB, we'd refund the user. That meant:
- Provider eventually refunds us: we got our money back later, fine, no net loss
- Provider does NOT refund us: we ate the cost of the number, lost it from our margin
- Provider refunds us LATE, after we already refunded the user: double-refund detected on reconciliation, manual cleanup
- Code arrives at minute 21: user already refunded, code is now wasted
The first failure mode was the killer. We were refunding 30% of activations, but only ~70% of those resulted in a corresponding provider refund. So we were losing the cost of ~9% of all activations on top of the markup we'd already given back.
With ~50,000 activations a month, that's a meaningful number.
The state machine
The fix was to make every activation live in exactly one of these states, with explicit transitions:
┌─────────────────┐
│ PURCHASED │ ◄── initial, provider returned an activation ID
└────────┬────────┘
│
┌────────▼────────┐
│ WAITING │ ◄── polling or webhook for code
└────┬───────┬────┘
│ │
received│ │timeout (20m) OR cancelled
│ │
┌────▼──┐ ┌──▼──────────────┐
│RECEIVED│ │ EXPIRED │
└────┬──┘ └────────┬────────┘
│ │
┌───────▼──┐ ┌────────▼────────────────┐
│COMPLETED │ │PROVIDER_REFUND_PENDING │ ◄── waiting on provider refund
└──────────┘ └───────┬─────────────────┘
│
┌───────▼───────┐
│ REFUNDED │ ◄── user gets credit back, only after provider confirms
└───────────────┘
Key rules:
- Never refund the user before
PROVIDER_REFUND_PENDING → REFUNDEDtransition. -
Every state transition is an atomic Postgres
UPDATEwith aWHERE status = expected_previous. No race conditions. - The provider refund check is a balance delta, not a status field. We track provider balance every minute, and when the balance goes up after an EXPIRED activation, we tag that delta to the activation and only then refund the user.
The Postgres part
The table:
CREATE TABLE activations (
id bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
user_id uuid NOT NULL REFERENCES users(id),
provider text NOT NULL CHECK (provider IN ('herosms', '5sim')),
provider_activation_id text NOT NULL,
service text NOT NULL,
country text NOT NULL,
cost_charged numeric(10,4) NOT NULL,
status text NOT NULL CHECK (status IN (
'PURCHASED', 'WAITING', 'RECEIVED', 'COMPLETED',
'EXPIRED', 'PROVIDER_REFUND_PENDING', 'REFUNDED', 'CANCELLED'
)),
status_updated_at timestamptz NOT NULL DEFAULT now(),
created_at timestamptz NOT NULL DEFAULT now(),
expires_at timestamptz NOT NULL,
refund_provider_balance_delta_id bigint REFERENCES provider_balance_deltas(id),
UNIQUE (provider, provider_activation_id)
);
CREATE INDEX idx_activations_status_expires ON activations (status, expires_at)
WHERE status IN ('WAITING', 'PROVIDER_REFUND_PENDING');
The partial index is the workhorse. We have hundreds of thousands of activations in this table, but only the ones in WAITING or PROVIDER_REFUND_PENDING need to be scanned by the cron. The partial index makes that scan O(active rows), not O(total rows).
The atomic transition function:
CREATE OR REPLACE FUNCTION transition_activation_status(
p_activation_id bigint,
p_from_status text,
p_to_status text
) RETURNS boolean AS $$
DECLARE
v_updated int;
BEGIN
UPDATE activations
SET status = p_to_status,
status_updated_at = now()
WHERE id = p_activation_id
AND status = p_from_status;
GET DIAGNOSTICS v_updated = ROW_COUNT;
RETURN v_updated = 1;
END;
$$ LANGUAGE plpgsql;
The WHERE status = p_from_status is the atomic guard. If two crons race to transition the same activation, only one wins, the other returns false. No locks needed, no advisory locks, no redis. Just Postgres row-level MVCC.
The cron that scans for expired activations:
-- Run every 60 seconds
WITH expired AS (
SELECT id, provider, provider_activation_id, cost_charged
FROM activations
WHERE status = 'WAITING'
AND expires_at < now()
FOR UPDATE SKIP LOCKED
LIMIT 100
)
UPDATE activations
SET status = 'EXPIRED',
status_updated_at = now()
FROM expired
WHERE activations.id = expired.id
RETURNING activations.id, activations.provider, activations.provider_activation_id;
FOR UPDATE SKIP LOCKED is critical. It means multiple cron workers can run in parallel and won't block each other — they'll just claim different rows. We run two workers in different regions for redundancy.
The provider balance reconciliation
This is the part that makes the whole thing work. Every 60 seconds we ping each provider's getBalance endpoint and store the delta:
CREATE TABLE provider_balance_deltas (
id bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
provider text NOT NULL,
balance_before numeric(12,4) NOT NULL,
balance_after numeric(12,4) NOT NULL,
delta numeric(12,4) NOT NULL,
observed_at timestamptz NOT NULL DEFAULT now()
);
When delta > 0 (provider gave us money back), we look at activations in EXPIRED state for that provider where the cost matches the delta within a tolerance, and tag them:
UPDATE activations
SET status = 'PROVIDER_REFUND_PENDING',
refund_provider_balance_delta_id = $1
WHERE id = (
SELECT id FROM activations
WHERE provider = $2
AND status = 'EXPIRED'
AND ABS(cost_charged - $3) < 0.001
ORDER BY status_updated_at ASC
LIMIT 1
);
Then a final cron transitions PROVIDER_REFUND_PENDING → REFUNDED and credits the user wallet in the same transaction.
The key insight: the provider's refund and the user's refund are now linked by a real cash flow, not by hope. If the provider doesn't refund, the user doesn't get refunded. We eat the loss explicitly and visibly, instead of bleeding it silently.
What this changed
- Refund-related losses dropped ~80% on the first deploy.
- User complaints about double-refunds went to zero because we no longer race the provider.
- Reconciliation became boring, which is the highest praise reconciliation can get.
- Support load dropped because the rare "my code arrived after I was refunded" complaints went away — we now hold the refund until after the timeout AND the provider has confirmed.
What I'd do differently
- I'd build the state machine on day one. We retrofitted it on top of v1 and it was painful.
- I'd use a real provider webhook listener for the
WAITING → RECEIVEDtransition, not just polling. We have webhooks now but they came late. - I'd tolerance-match on provider balance deltas more carefully. Our first version was too strict and missed legitimate refunds. Now we use a 0.1% tolerance and a time window.
- I'd add a
RECONCILIATION_FAILEDstate for the cases where the provider balance came back but didn't match any expired activation. Those go to an admin queue.
Why share this
If you're building anything that resells a paid upstream service — SMS, voice, virtual cards, pay-per-call — this pattern probably applies. The naive refund flow looks fine in dev and bleeds cash in prod. The state machine is not glamorous but it pays for itself in the first month.
For what it's worth, VerifySMS runs on this exact pattern in production today across HeroSMS and 5SIM, on Supabase Edge Functions and Postgres. If you want to see the user-facing side — instant virtual numbers, automatic refunds, no KYC — we're on the App Store and the web.
Questions about the state machine, the balance reconciliation, or the partial index strategy — ask in the comments.
Top comments (0)