Gabriel Anhaia

Posted on May 24

Design a Multi-Device Authentication Service (Sessions vs JWT vs Passkeys)

#systemdesign #interview #security #auth

Book: System Design Pocket Guide: Interviews — 15 Real System Designs, Step by Step
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Sessions, JWTs, refresh tokens, OAuth, magic links, passkeys. Your authentication service has to support all of them, on any device, with a "log out everywhere" button that actually works. Most candidates pick one mechanism and freeze. The interviewer wants to see you weigh the tradeoff and design the seams.

Here are the five decisions that frame the whole thing, the three implementations of the log-out button (cheap to bulletproof), and the gotcha that quietly breaks production at month four.

The five decisions

Before you draw a single box, write these on the whiteboard:

Token shape: opaque session ID or self-contained JWT?
Stateful or stateless: does verification hit a store on every request?
Revocation model: how do you kill a credential before it expires?
Device tracking: one session per user, or one per device?
MFA gating: always-on, risk-based, or step-up on sensitive actions?

Each decision constrains the next. Pick stateless JWTs and revocation becomes a research project. Pick per-user sessions and "log out my laptop only" becomes impossible. The interviewer is watching you handle the constraint propagation, not memorize a stack.

Session model: server-side store, easy revocation

The classic shape. Login succeeds, the server creates a session row and returns an opaque ID via cookie. Every request hits a store (Redis, Postgres, whatever) to validate.

CREATE TABLE sessions (
  id            BYTEA PRIMARY KEY,          -- 256-bit random
  user_id       UUID NOT NULL,
  device_id     UUID NOT NULL,
  created_at    TIMESTAMPTZ NOT NULL,
  last_seen_at  TIMESTAMPTZ NOT NULL,
  expires_at    TIMESTAMPTZ NOT NULL,
  ip            INET,
  user_agent    TEXT,
  mfa_verified  BOOLEAN NOT NULL DEFAULT false
);

CREATE INDEX sessions_user_id_idx ON sessions (user_id);
CREATE INDEX sessions_expires_idx ON sessions (expires_at);

Revocation is a DELETE. Log out everywhere is DELETE FROM sessions WHERE user_id = $1. Device-tracking is free because every row already names a device. Auditing is free too.

The cost is a round trip per request. At 50k RPS, that's 50k Redis hits per second. Doable, but you'll cache the session in the gateway's local memory for 5-10 seconds to absorb the load, and you'll size Redis for the working set of active sessions, not the total count.

The hidden cost: every service that wants to know "who is this" needs the store. Pick this model and your auth service becomes a hard dependency for the whole platform. Plan for it.

JWT model: stateless, fast, revocation is the hard problem

A JWT carries its own claims, signed by the auth service. Any service with the public key can validate it without calling back. No store lookup, no round trip. Beautiful, until you need to kick someone out.

// Verifying a JWT on a downstream service
import { jwtVerify } from 'jose';

const { payload } = await jwtVerify(token, publicKey, {
  issuer: 'auth.example.com',
  audience: 'api.example.com',
});

// payload.sub is the user id. No DB call.

The token's claims are frozen at issue time. If you change a user's role, demote them, or notice their token leaked, the existing JWT keeps working until it expires. That's the whole catch. Every revocation strategy is a workaround for this one property.

Three workarounds, in order of pain:

Short expiry + refresh tokens. Issue JWTs that live 5-15 minutes. Compromise window is small. Refresh tokens live in a server store and can be revoked.
Revocation list (denylist). Maintain a set of jti claims that should be rejected. Every verifier consults the list. You just put the store back; the win is partial (only revoked tokens need a lookup).
Versioned claims. Embed a token_version claim. Bump the version in the user record on revocation. Verifiers compare to the user's current version. Cheap to write, expensive to read (still a store hit).

Notice the pattern: every workaround re-introduces the state the JWT was meant to avoid. Stateless verification is the wrong goal. The right goal is cheap verification, and short JWTs + refresh tokens gets you there.

Refresh tokens: the bridge

Refresh tokens are the seam where session-like state meets stateless verification. Issue a short JWT (15 min) plus a long refresh token (30 days). The refresh token sits in a server store and looks like a session row.

CREATE TABLE refresh_tokens (
  id              BYTEA PRIMARY KEY,
  user_id         UUID NOT NULL,
  device_id       UUID NOT NULL,
  family_id       UUID NOT NULL,            -- rotation lineage
  issued_at       TIMESTAMPTZ NOT NULL,
  expires_at      TIMESTAMPTZ NOT NULL,
  revoked_at      TIMESTAMPTZ,
  replaced_by     BYTEA REFERENCES refresh_tokens(id)
);

When the JWT expires, the client posts the refresh token to /auth/refresh. The server validates it, rotates it (issues a new one and revokes the old), and returns a fresh JWT + fresh refresh token. The family_id tracks the rotation lineage.

The rotation matters for one reason: replay detection. If a refresh token gets used twice (once by the legit client, once by an attacker who stole it), the server sees a token in family_id = X being used after it was rotated. That's a hard signal. Revoke the entire family and force re-login.

def refresh(token_id: bytes) -> NewTokenPair:
    row = db.fetch_one(
        "SELECT * FROM refresh_tokens WHERE id = $1", token_id
    )
    if row is None:
        raise InvalidToken()
    if row.revoked_at is not None:
        # Replay detected. Kill the whole family.
        db.execute(
            "UPDATE refresh_tokens SET revoked_at = now() "
            "WHERE family_id = $1 AND revoked_at IS NULL",
            row.family_id,
        )
        raise InvalidToken()
    if row.expires_at < now():
        raise ExpiredToken()

    new_refresh = issue_refresh(row.user_id, row.device_id, row.family_id)
    db.execute(
        "UPDATE refresh_tokens SET revoked_at = now(), replaced_by = $1 "
        "WHERE id = $2",
        new_refresh.id, token_id,
    )
    return NewTokenPair(jwt=issue_jwt(row.user_id), refresh=new_refresh)

This is the architecture worth defending in the interview. Short JWTs for cheap verification, refresh tokens for revocation, families for replay detection. Almost every modern auth service ships this shape.

Passkeys (WebAuthn): the 2026 default

Passkeys replaced "password + TOTP" as the recommended default for new sign-in flows. They're public-key credentials stored in the user's device keychain (or roaming via iCloud Keychain, Google Password Manager, 1Password). The server never sees a secret.

Registration:

Server sends a challenge plus user info.
Browser asks the authenticator (Touch ID, Windows Hello, security key) to generate a keypair.
Authenticator returns the public key plus an attestation. Server stores the public key against the user.

Authentication:

Server sends a challenge.
Authenticator signs it with the matching private key.
Server verifies the signature against the stored public key.

The server changes you need:

CREATE TABLE webauthn_credentials (
  credential_id   BYTEA PRIMARY KEY,
  user_id         UUID NOT NULL,
  public_key      BYTEA NOT NULL,
  sign_count      BIGINT NOT NULL DEFAULT 0,
  transports      TEXT[],                   -- ['internal','hybrid','usb']
  aaguid          UUID,                     -- authenticator model
  backup_eligible BOOLEAN NOT NULL,
  backup_state    BOOLEAN NOT NULL,
  created_at      TIMESTAMPTZ NOT NULL,
  last_used_at    TIMESTAMPTZ
);

sign_count is a monotonically increasing counter the authenticator returns. If it ever goes backwards, the credential was cloned. backup_eligible plus backup_state tell you whether the passkey is synced (iCloud) or device-bound (a YubiKey).

Once a user signs in with a passkey, your service issues the same JWT + refresh token pair as before. WebAuthn replaces the credential check, not the session layer. That's a useful framing for the interview: passkeys are an authentication factor, the rest of the architecture stays.

The one new failure mode: a user loses every device with the passkey. Your account-recovery flow has to exist and has to not be a password reset to email (defeats the point). Use a recovery code printed at enrollment, a verified backup device, or a delegated recovery contact.

"Log out everywhere": three implementations

This is the question that separates candidates. Three answers, cheap to bulletproof.

1. Delete sessions / refresh tokens. Works if you're session-based or short-JWT + refresh. One SQL statement:

DELETE FROM refresh_tokens WHERE user_id = $1;
DELETE FROM sessions       WHERE user_id = $1;

The catch: existing JWTs keep working until they expire. If your JWT lives 15 minutes, the user's old devices have up to 15 minutes of grace. For consumer apps that's usually fine. For a banking app it isn't.

2. Bump a token version. Add token_version to the user row and to every JWT claim. On verify, compare. Mismatch = reject.

def verify(token: str) -> Claims:
    claims = jwt.decode(token, public_key)
    user_version = cache.get_user_token_version(claims['sub'])
    if claims['ver'] != user_version:
        raise Revoked()
    return claims

Log out everywhere becomes UPDATE users SET token_version = token_version + 1 WHERE id = $1. Every existing JWT is now invalid. Verifiers pay one cache hit per request (cache the version aggressively, invalidate on bump).

3. Per-credential revocation broadcast. The bulletproof option. Maintain a denylist of jti claims with TTLs matching the JWT's exp. Push revocations to every verifier via Redis pub/sub, NATS, or a streaming bus. Verifiers keep a local Bloom filter, fall back to the central store on a hit.

def verify(token: str) -> Claims:
    claims = jwt.decode(token, public_key)
    if revocation_bloom.might_contain(claims['jti']):
        if redis.sismember('revoked', claims['jti']):
            raise Revoked()
    return claims

This combines low-latency verification (Bloom filter is in-process) with hard revocation guarantees. The cost is operational: a streaming bus to maintain, a Bloom filter to size, and a denylist that has to expire entries.

For most products, option 1 plus short JWTs is the right answer. Option 2 if your JWTs live longer than a minute and you can't tolerate that grace. Option 3 if you're under regulatory pressure (PCI, HIPAA, SOC 2 with specific revocation SLAs).

Device tracking: sessions per device

A row per (user, device) is what makes "log out my laptop only" possible. The device_id is generated by the client on first auth, persisted in secure storage (Keychain on iOS, Credential Manager on Android, IndexedDB + a fingerprint on web). It survives reinstalls if you ship a recovery hint, and it survives nothing if you don't.

Track three things per device row:

last_seen_at: update on every refresh, drives "active devices" UI.
ip and a geo lookup: drives "new sign-in from Brazil" emails.
user_agent plus a friendly device label ("Pixel 9 Pro", "Chrome on macOS"): what the user actually recognizes in the settings page.

Avoid hard fingerprinting (canvas, WebGL, font enumeration). It's brittle, it breaks privacy modes, and the GDPR conversation is uncomfortable. A self-generated device ID covers 95% of the use cases without the legal surface area.

MFA gating: risk-based vs always-on

Always-on MFA is the safe answer. Every login asks for a second factor. Users hate it. They also enable it and forget about it, which is the point.

Risk-based MFA is the better UX answer. Score each login attempt: new device, new geo, unusual hour, password reset within 24h, impossible travel since last session. Above a threshold, demand a second factor. Below, skip.

def risk_score(attempt: LoginAttempt) -> int:
    score = 0
    if attempt.device_id not in known_devices(attempt.user_id):
        score += 40
    if attempt.geo != last_geo(attempt.user_id):
        score += 20
    if impossible_travel(attempt):
        score += 60
    if attempt.password_changed_within(hours=24):
        score += 20
    if attempt.hour not in usual_hours(attempt.user_id):
        score += 10
    return score

A score above 50 triggers a step-up: passkey, TOTP, push approval. The implementation lives in the auth service, the rules live in config so you can tune them without a deploy.

Step-up MFA is the related pattern: sensitive actions (change password, add a new payment method, delete account) trigger MFA even mid-session. Implement it as a mfa_verified_at timestamp on the session. Within 5 minutes counts, older requires a re-prompt.

The 90-second answer

If the interviewer asks for the elevator pitch:

Short-lived JWTs (15 min) for fast, stateless verification at every service. Long-lived refresh tokens (30 days, rotated on use, families tracked for replay detection) backed by a Postgres or Redis store. That's the revocation surface. Per-device session rows so the user can see and kill individual devices. Passkeys (WebAuthn) as the primary factor for new accounts, password-plus-TOTP as the legacy fallback. Risk-based step-up MFA for sensitive actions. "Log out everywhere" bumps a token_version on the user record so existing JWTs fail their next verify, and deletes all refresh tokens so they can't bootstrap a new JWT. Device-tracking via a self-generated client ID, not browser fingerprinting.

That covers the five decisions, names the rotation pattern, includes a real revocation strategy, and namechecks the modern default. It's about 90 seconds spoken.

The gotcha: JWT revocation lists grow without bound

The trap I see in code reviews more than any other: someone implements a JWT denylist as a Redis set keyed on user ID, with no TTL on the entries.

# Don't do this
redis.sadd(f'revoked:{user_id}', token_jti)

A year later, that set has 50k JTIs in it, half of them long expired, every verify is shipping a 200KB set membership check. Eventually the verify path dominates your CPU and someone deletes the set, which silently un-revokes everything.

Two fixes:

Store JTIs with TTLs matching the JWT exp. Use individual keys, not sets. SET revoked:<jti> 1 EX <seconds-until-exp>. Redis evicts on its own.
Use short JWTs. If your JWT lives 5 minutes, you barely need a denylist. Revoke the refresh token, wait 5 minutes, done. The denylist becomes a defense-in-depth layer for the worst case, not the primary mechanism.

Or, put bluntly: if your JWTs live an hour and you need real revocation, your JWTs are too long.

The interview takeaway: revocation drives the architecture. Pick a token lifetime that matches your revocation SLA. Five minutes if you can absorb the refresh chatter, fifteen if you can't, never an hour for anything that matters. The rest of the design (refresh families, passkeys, device tracking, MFA gating) falls out of that one decision.

What's your team's revocation SLA, and does your current JWT lifetime actually match it?

If this was useful

Auth services are one of the 15 walkthroughs in the System Design Pocket Guide: Interviews. The chapter walks the same five decisions in interview-pace, plus the variants the interviewer is likely to push you on (OAuth provider integration, mobile vs web token storage, the multi-tenant case). It's the book I'd hand a friend the week before an L5 system-design loop.