How Two Silent Bugs Locked Every User Out of Production

#devops #cloudcomputing #firebase #gcp

A Cloud Run + Firebase App Check CORS postmortem

Users couldn't log in.

No crash. No server error. No deployment alarm. Just a wave of CORS failures quietly rolling through production while everything on our end looked completely normal.

This is the story of how two unrelated bugs combined to cause a login outage — and how we found them.

The Symptom

Browser consoles were showing CORS errors on every API request. The classic red line:

Access to fetch at 'https://api.example.com/...' from origin 'https://app.example.com'
has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present.

First instinct: something changed in the CORS config. But nothing had changed. The config was identical to the day before, when everything worked fine.

So we started digging deeper.

Root Cause #1 — Cloud Build Was Silently Stripping IAM Permissions

Our production deployment pipeline is handled by Google Cloud Build, deploying to Cloud Run.

Inside cloudbuild-prod.yaml, we had this flag on the deploy step:

'--no-allow-unauthenticated'

This flag tells Cloud Run: "Don't allow unauthenticated invocations." On the surface, that sounds like a sensible security setting. But here's the problem nobody warned us about.

Every time a build failed mid-deploy, this flag actively removed the roles/run.invoker IAM binding from allUsers.

That binding is what allows the public internet to call the service. Without it, every request — including requests from logged-in users — gets rejected at the infrastructure level, before your application code even runs.

So the sequence of events was:

A deploy fails partway through (a fairly normal occurrence)
--no-allow-unauthenticated removes the public invoker binding as part of cleanup
The previous working version of the service is still running — but now it's no longer publicly callable
Users get blocked. The app looks fine on our end. CORS errors show up on theirs.

The fix: Remove --no-allow-unauthenticated from the build config. Manage IAM invoker permissions separately and explicitly, outside of the deploy step, so they can never be accidentally stripped.

Root Cause #2 — Middleware Order Was Killing OPTIONS Requests

Once we fixed the IAM issue, we found a second problem lurking underneath.

Our Express backend uses Firebase App Check to verify that requests come from a legitimate client app. The middleware setup looked roughly like this:

app.use(appCheckMiddleware);  // ← was first
app.use(cors(corsOptions));

Here's why that order is fatal for browser clients.

Before a browser sends a cross-origin request (like a POST or PUT), it first sends a preflight request — an HTTP OPTIONS call to ask the server: "Are you okay with this?" Only if the server responds correctly does the browser send the real request.

App Check middleware was intercepting those OPTIONS preflight requests and rejecting them — because preflight requests don't carry an App Check token. They're sent automatically by the browser; they have no body, no auth header, and no App Check token.

So CORS headers were never set on the preflight response. The browser never got permission to proceed. Every cross-origin request was dead on arrival.

The fix: Move CORS middleware before App Check:

app.use(cors(corsOptions));       // ← CORS first, always
app.use(appCheckMiddleware);      // ← App Check after

Additionally, App Check middleware should explicitly pass through OPTIONS requests:

function appCheckMiddleware(req, res, next) {
  if (req.method === 'OPTIONS') return next(); // let preflight through
  // ... rest of App Check verification
}

Why These Two Bugs Were So Hard to See Together

Each bug on its own would have been straightforward to diagnose. Together, they masked each other in a specific way:

The IAM issue meant some requests were failing at the infrastructure level — before reaching the app
The middleware issue meant others were failing inside the app — but without clear error messages
Both produced the same visible symptom: CORS errors in the browser

When you see CORS errors, your brain immediately goes to "something changed in CORS config." But CORS errors are often a symptom, not the cause. The real cause can be anywhere upstream.

Lessons

1. Middleware order is not cosmetic. It's logic.
The sequence of your middleware defines your request pipeline. A wrong order can silently break entire categories of requests. Document it. Review it deliberately.

2. Always audit what your CI/CD flags are doing to IAM.
Deployment flags that touch IAM bindings should be treated as infrastructure changes, not just deploy options. Understand exactly what each flag does on failure, not just on success.

3. When CORS fails in production, trace the OPTIONS request first.
Open DevTools, find the preflight OPTIONS request, and look at its response. If it's getting a 401, 403, or no CORS headers at all — the problem is upstream of your CORS config.

4. Two unrelated bugs can look like one.
This outage had two root causes with no relationship to each other. Fixing one revealed the other. Don't stop investigating after you find the first issue.

The Takeaway

The scariest production incidents aren't the dramatic ones with stack traces and alarms. They're the quiet ones where everything looks fine on your end — and users are silently blocked on theirs.

CORS errors in particular are notorious for hiding the real problem. Treat them as a starting point, not an answer.

Have you been hit by a similar issue? Drop it in the comments — I'd love to hear how others have navigated CORS and middleware ordering in production.