wartzar-bee

Posted on May 29 • Originally published at refresh-guard-guide.pages.dev

The OAuth refresh-token race that logs your users out — and the two-layer fix

#oauth #security #node #webdev

Your auth has worked for months. Then you ship a small change — a page that fires a few API calls in parallel, a worker pool, a second CLI instance, an agent — and suddenly users get logged out at random. The logs say invalid_grant. Sometimes it's worse: refresh_token_reused, and a working session is nuked everywhere.

Nothing in your token flow is wrong. The bug is that you're doing the correct flow concurrently with a token that only tolerates being used once.

The race, step by step

An OAuth2 client holds a short-lived access token and a long-lived refresh token. When the access token expires, you POST the refresh token to the token endpoint and get a new access token.

With refresh-token rotation — now the default at Okta, Auth0, Microsoft, and Salesforce, and recommended by the OAuth 2.0 Security BCP for public clients — that refresh token is single-use. The refresh response carries a new refresh token, and the one you just sent is invalidated the instant the first refresh succeeds.

The bug appears whenever more than one request needs a token at the same time. With two callers A and B:

t0   access token is expired (or within the skew window)
t1   caller A reads creds, sees "expired", POSTs refresh_token = R0
t2   caller B reads creds, sees "expired", POSTs refresh_token = R0   // same token!
t3   provider processes A: issues access A1 + rotates R0 -> R1, REVOKES R0
t4   provider processes B: R0 is revoked  ->  400 invalid_grant

Both callers did exactly what the textbook says. The loser of the race presented a token the winner already rotated away. That's the invalid_grant.

Why it can be worse than a stray error

Some providers (Okta, Auth0, Salesforce) run refresh-token reuse detection. Presenting an already-rotated refresh token looks identical to a stolen token being replayed — the provider can't tell your innocent race from an attack — so it does the safe thing and revokes the entire refresh-token family, logging the user out everywhere.

That's the difference between a retryable hiccup and a support ticket. On these providers, serializing refresh isn't an optimization — it's a correctness requirement.

The trap: invalid_grant reads like "the user is logged out, re-auth them." Under concurrency it usually means "a sibling request already refreshed; your copy is stale." Re-authenticating on every concurrency-induced invalid_grant produces exactly the "surprise re-login" symptom you're trying to kill.

The fix has two layers — and people ship only one

The whole fix reduces to one rule: make exactly one refresh happen, and have every other caller use its result instead of starting their own. But there are two scopes, and using the wrong-scope fix is the #1 reason the bug "comes back" after you thought you fixed it.

Layer 1 — In-process single-flight (one process, many concurrent calls)

The first caller to see expiry starts the refresh and stores the in-flight Promise. Every other caller awaits that same promise instead of starting its own. JavaScript's single-threaded event loop makes "check the flag, set the promise" atomic — no lock needed.

let inflight = null;            // the single shared refresh promise (null when idle)
let creds = loadCreds();        // { access_token, refresh_token, expires_at }
const SKEW_MS = 60_000;         // refresh ~1 min before real expiry

function isValid(c) {
  return c?.access_token && c.expires_at - SKEW_MS > Date.now();
}

async function getValidToken() {
  if (isValid(creds)) return creds.access_token;     // fast path, no refresh

  // SINGLE-FLIGHT: if a refresh is already running, await THAT one.
  if (!inflight) {
    inflight = doRefresh(creds)
      .then(next => { creds = next; return next; })
      .finally(() => { inflight = null; });          // clear so the next expiry can refresh
  }
  return (await inflight).access_token;              // every concurrent caller awaits the SAME promise
}

Two details that are easy to get wrong, and both bite in production:

Clear the promise in finally, not then. Otherwise a failed refresh leaves a rejected promise wedged in inflight forever, and every future call re-rejects with the stale error — a "stuck promise." finally clears it on success and failure so the next call retries cleanly.
Store the promise before the first await. Assign inflight synchronously, so a second caller arriving on the next microtask actually sees it.

With 50 callers hitting an expired token, exactly one refresh runs and the other 49 await it. If your token lives in one process — a server, a single worker, a browser tab — single-flight plus rotation-merge (below) is the complete fix. You do not need a lock file.

Layer 2 — Cross-process lock (many processes share one credential)

Here's the part people miss. An in-process lock — a shared promise, an async mutex, a library's internal lock — coalesces refreshes within one event loop. Two separate processes each have their own memory and their own inflight variable. They cannot see each other's in-flight refresh. Two CLIs, two workers, two containers, or two agents reading the same credential file are right back in the race; single-flight did nothing for them.

Your topology	What you need
One process, concurrent calls / request fan-out	In-process single-flight (+ rotation-merge)
One token file shared by multiple CLIs / workers / agents	Single-flight per process + a cross-process lock + re-read + atomic write
Many machines sharing one credential	A distributed lock (Redis/DB) or a token-broker service

For the multi-process case you need three things together: an exclusive lock, a re-read after acquiring it, and an atomic write.

async function getValidTokenMultiProcess() {
  let creds = await readToken();
  if (isValid(creds)) return creds.access_token;      // fast path, no lock

  return withTokenLock(async () => {                  // O_EXCL lock file: one process wins
    creds = await readToken();                         // *** RE-READ inside the lock ***
    if (isValid(creds)) return creds.access_token;     // a sibling already refreshed -> done
    const next = mergeRotation(creds, await doRefresh(creds));
    await writeTokenAtomic(next);                      // temp file + rename (atomic swap)
    return next.access_token;
  });
}

The re-read after acquiring the lock is the step everyone forgets — and it's the whole point. By the time you get the lock, the process that held it before you may have already refreshed. If you blindly refresh anyway, you send a just-rotated token and reproduce the exact invalid_grant you were trying to avoid, only now serialized. Re-read, and if it's already fresh, use it and skip the refresh entirely. That converts "two refreshes serialized" (still burns the rotated token on the second) into "one refresh + one cache hit."

Order matters: lock → re-read → refresh only if still stale → atomic write → release. Drop the re-read and the lock just serializes the same bug. Drop the atomic write and you trade the network race for a file-corruption race.

The other `invalid_grant`: rotation-merge

Independent of locking, how you persist the refresh response is its own source of invalid_grant. Providers disagree on what they return:

Rotating providers (Okta, Auth0, Microsoft, Salesforce) return a new refresh_token every refresh — save it, or your next refresh uses a revoked token.
Google returns a refresh_token only on the first authorization; refresh responses omit it. If you overwrite stored credentials with the response as-is, you erase the refresh token and force a full re-consent.

One rule handles both — rotation-merge: if the response carries a refresh_token, use it; if it doesn't, keep the previous one.

function mergeRotation(prev, res) {
  const merged = { ...res };
  if (!merged.refresh_token && prev?.refresh_token) {
    merged.refresh_token = prev.refresh_token;        // Google omitted it -> keep the old one
  }
  if (merged.expires_in && !merged.expires_at) {
    merged.expires_at = Date.now() + merged.expires_in * 1000;
  }
  return merged;
}

Naive overwrite silently works for rotating providers and silently breaks Google. Naive "always keep the old one" silently works for Google and silently breaks rotation. Merge is the only rule correct for both.

One more: re-read before failing

Even with single-flight, a race can slip through across processes or at a deploy boundary. So make invalid_grant handling self-healing — before you surface it as "log in again," re-read the stored token once; a sibling may have just refreshed it. Recover silently if so; reserve the disruptive re-login for when the grant is genuinely gone (user revoked, password changed, idle-expired).

The checklist

In order of leverage (1–3 fix the single-process case, which is most reports; 4–6 add the multi-process case):

[ ] Refresh proactively with a skew (30–60s before expiry) so callers don't all hit the cliff at once.
[ ] Single-flight in-process — one shared in-flight Promise; everyone awaits it; cleared in finally.
[ ] If a credential is shared across processes, take an exclusive lock (lock file / O_EXCL).
[ ] Re-read after acquiring the lock and short-circuit if a sibling already rotated.
[ ] Persist atomically — temp file + rename, mode 0600; never write the token file in place.
[ ] Rotation-merge on persist; keep the previous refresh_token when the response omits one.
[ ] Re-read before failing on invalid_grant; only re-auth when the grant is genuinely gone.

If you'd rather not re-derive it

The patterns above are small and the code is complete enough to copy — that's deliberate; this is a build-it-yourself-friendly post. If you'd rather pull in a primitive, refresh-guard is a small, MIT, zero-dependency library that packages the in-process single-flight + correct rotation-merge + atomic file persistence as one installable thing, with a typed provider-quirks table for the gotchas above.

import { createTokenManager, fileStore } from "refresh-guard";

const tokens = createTokenManager({
  provider: "google",                                // optional: picks a quirks profile
  store: fileStore("~/.myapp/creds.json"),           // atomic temp-file + rename persistence
  refresh: async (prev) => {
    const r = await fetch(TOKEN_URL, { method: "POST", body: form(prev.refresh_token) });
    return await r.json();                           // { access_token, expires_in, refresh_token? }
  }
});

// Call from anywhere, as often as you like — exactly ONE refresh happens:
const accessToken = await tokens.getValidToken();

Honest scope: it solves the in-process case (single-flight) plus rotation-merge and atomic persistence. It does not ship a cross-process lock — if you share one credential across processes, you still layer the lock-file pattern from Layer 2 around it. (Disclosure: I maintain it, and I wrote the vendor-neutral guide it's based on. The patterns work with any OAuth client, or none.)

Full guide with the complete cross-process lock implementation, the provider quirks table, and an FAQ: https://refresh-guard-guide.pages.dev/

Takeaway

invalid_grant under load almost never means "the user is logged out." It means two requests refreshed the same single-use token at once. Make exactly one refresh happen — single-flight inside a process, a re-read-after-lock across processes — merge rotation correctly, and re-read before you ever force a re-login. That's the whole fix.

DEV Community

The OAuth refresh-token race that logs your users out — and the two-layer fix

The race, step by step

Why it can be worse than a stray error

The fix has two layers — and people ship only one

Layer 1 — In-process single-flight (one process, many concurrent calls)

Layer 2 — Cross-process lock (many processes share one credential)

The other `invalid_grant`: rotation-merge

One more: re-read before failing

The checklist

If you'd rather not re-derive it

Takeaway

Top comments (0)

The race, step by step

Why it can be worse than a stray error

The fix has two layers — and people ship only one

Layer 1 — In-process single-flight (one process, many concurrent calls)

Layer 2 — Cross-process lock (many processes share one credential)

The other invalid_grant: rotation-merge

One more: re-read before failing

The checklist

If you'd rather not re-derive it

Takeaway

The other `invalid_grant`: rotation-merge