The Refund State Machine That Saved Our SMS Reseller from Bleeding Money

#postgres #architecture #backend #showdev

Every SMS verification reseller eventually hits the same wall: codes don't always arrive. In our production data over the last six months, the failure rate has been remarkably stable around 28-32% depending on country and service. Telegram in Indonesia is closer to 40%. WhatsApp in the US is closer to 18%.

The naive refund flow looks like this:

User buys number → Wait 20 minutes → No code arrived → Refund user

This is what we shipped in v1. It cost us roughly $400 per week in losses before we caught it. Here's what was happening, what we built to fix it, and the Postgres patterns that made the new flow safe.

The bug

The failure mode wasn't subtle once we instrumented it: we were refunding users even when the upstream provider had charged us and not refunded us back. Both providers we use (HeroSMS and 5SIM) have their own internal refund logic — sometimes they refund automatically when no code arrives within their SLA, sometimes they don't, sometimes they do but ~30 minutes later.

Our v1 logic just checked our own clock: 20 minutes after purchase, if status was still "WAITING" in our DB, we'd refund the user. That meant:

Provider eventually refunds us: we got our money back later, fine, no net loss
Provider does NOT refund us: we ate the cost of the number, lost it from our margin
Provider refunds us LATE, after we already refunded the user: double-refund detected on reconciliation, manual cleanup
Code arrives at minute 21: user already refunded, code is now wasted

The first failure mode was the killer. We were refunding 30% of activations, but only ~70% of those resulted in a corresponding provider refund. So we were losing the cost of ~9% of all activations on top of the markup we'd already given back.

With ~50,000 activations a month, that's a meaningful number.

The state machine

The fix was to make every activation live in exactly one of these states, with explicit transitions:

            ┌─────────────────┐
            │   PURCHASED     │  ◄── initial, provider returned an activation ID
            └────────┬────────┘
                     │
            ┌────────▼────────┐
            │    WAITING      │  ◄── polling or webhook for code
            └────┬───────┬────┘
                 │       │
         received│       │timeout (20m) OR cancelled
                 │       │
            ┌────▼──┐ ┌──▼──────────────┐
            │RECEIVED│ │     EXPIRED     │
            └────┬──┘ └────────┬────────┘
                 │             │
         ┌───────▼──┐ ┌────────▼────────────────┐
         │COMPLETED │ │PROVIDER_REFUND_PENDING  │  ◄── waiting on provider refund
         └──────────┘ └───────┬─────────────────┘
                              │
                      ┌───────▼───────┐
                      │   REFUNDED    │  ◄── user gets credit back, only after provider confirms
                      └───────────────┘

Key rules:

Never refund the user before PROVIDER_REFUND_PENDING → REFUNDED transition.
Every state transition is an atomic Postgres UPDATE with a WHERE status = expected_previous. No race conditions.
The provider refund check is a balance delta, not a status field. We track provider balance every minute, and when the balance goes up after an EXPIRED activation, we tag that delta to the activation and only then refund the user.

The Postgres part

The table:

CREATE TABLE activations (
  id              bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  user_id         uuid NOT NULL REFERENCES users(id),
  provider        text NOT NULL CHECK (provider IN ('herosms', '5sim')),
  provider_activation_id  text NOT NULL,
  service         text NOT NULL,
  country         text NOT NULL,
  cost_charged    numeric(10,4) NOT NULL,
  status          text NOT NULL CHECK (status IN (
    'PURCHASED', 'WAITING', 'RECEIVED', 'COMPLETED',
    'EXPIRED', 'PROVIDER_REFUND_PENDING', 'REFUNDED', 'CANCELLED'
  )),
  status_updated_at  timestamptz NOT NULL DEFAULT now(),
  created_at         timestamptz NOT NULL DEFAULT now(),
  expires_at         timestamptz NOT NULL,
  refund_provider_balance_delta_id  bigint REFERENCES provider_balance_deltas(id),
  UNIQUE (provider, provider_activation_id)
);

CREATE INDEX idx_activations_status_expires ON activations (status, expires_at)
  WHERE status IN ('WAITING', 'PROVIDER_REFUND_PENDING');

The partial index is the workhorse. We have hundreds of thousands of activations in this table, but only the ones in WAITING or PROVIDER_REFUND_PENDING need to be scanned by the cron. The partial index makes that scan O(active rows), not O(total rows).

The atomic transition function:

CREATE OR REPLACE FUNCTION transition_activation_status(
  p_activation_id  bigint,
  p_from_status    text,
  p_to_status      text
) RETURNS boolean AS $$
DECLARE
  v_updated int;
BEGIN
  UPDATE activations
  SET status = p_to_status,
      status_updated_at = now()
  WHERE id = p_activation_id
    AND status = p_from_status;

  GET DIAGNOSTICS v_updated = ROW_COUNT;
  RETURN v_updated = 1;
END;
$$ LANGUAGE plpgsql;

The WHERE status = p_from_status is the atomic guard. If two crons race to transition the same activation, only one wins, the other returns false. No locks needed, no advisory locks, no redis. Just Postgres row-level MVCC.

The cron that scans for expired activations:

-- Run every 60 seconds
WITH expired AS (
  SELECT id, provider, provider_activation_id, cost_charged
  FROM activations
  WHERE status = 'WAITING'
    AND expires_at < now()
  FOR UPDATE SKIP LOCKED
  LIMIT 100
)
UPDATE activations
SET status = 'EXPIRED',
    status_updated_at = now()
FROM expired
WHERE activations.id = expired.id
RETURNING activations.id, activations.provider, activations.provider_activation_id;

FOR UPDATE SKIP LOCKED is critical. It means multiple cron workers can run in parallel and won't block each other — they'll just claim different rows. We run two workers in different regions for redundancy.

The provider balance reconciliation

This is the part that makes the whole thing work. Every 60 seconds we ping each provider's getBalance endpoint and store the delta:

CREATE TABLE provider_balance_deltas (
  id              bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  provider        text NOT NULL,
  balance_before  numeric(12,4) NOT NULL,
  balance_after   numeric(12,4) NOT NULL,
  delta           numeric(12,4) NOT NULL,
  observed_at     timestamptz NOT NULL DEFAULT now()
);

When delta > 0 (provider gave us money back), we look at activations in EXPIRED state for that provider where the cost matches the delta within a tolerance, and tag them:

UPDATE activations
SET status = 'PROVIDER_REFUND_PENDING',
    refund_provider_balance_delta_id = $1
WHERE id = (
  SELECT id FROM activations
  WHERE provider = $2
    AND status = 'EXPIRED'
    AND ABS(cost_charged - $3) < 0.001
  ORDER BY status_updated_at ASC
  LIMIT 1
);

Then a final cron transitions PROVIDER_REFUND_PENDING → REFUNDED and credits the user wallet in the same transaction.

The key insight: the provider's refund and the user's refund are now linked by a real cash flow, not by hope. If the provider doesn't refund, the user doesn't get refunded. We eat the loss explicitly and visibly, instead of bleeding it silently.

What this changed

Refund-related losses dropped ~80% on the first deploy.
User complaints about double-refunds went to zero because we no longer race the provider.
Reconciliation became boring, which is the highest praise reconciliation can get.
Support load dropped because the rare "my code arrived after I was refunded" complaints went away — we now hold the refund until after the timeout AND the provider has confirmed.

What I'd do differently

I'd build the state machine on day one. We retrofitted it on top of v1 and it was painful.
I'd use a real provider webhook listener for the WAITING → RECEIVED transition, not just polling. We have webhooks now but they came late.
I'd tolerance-match on provider balance deltas more carefully. Our first version was too strict and missed legitimate refunds. Now we use a 0.1% tolerance and a time window.
I'd add a RECONCILIATION_FAILED state for the cases where the provider balance came back but didn't match any expired activation. Those go to an admin queue.

Why share this

If you're building anything that resells a paid upstream service — SMS, voice, virtual cards, pay-per-call — this pattern probably applies. The naive refund flow looks fine in dev and bleeds cash in prod. The state machine is not glamorous but it pays for itself in the first month.

For what it's worth, VerifySMS runs on this exact pattern in production today across HeroSMS and 5SIM, on Supabase Edge Functions and Postgres. If you want to see the user-facing side — instant virtual numbers, automatic refunds, no KYC — we're on the App Store and the web.

Questions about the state machine, the balance reconciliation, or the partial index strategy — ask in the comments.