Abhay

Posted on May 14

We moved our Next.js app from Vercel to Google Cloud Run. Here's how it actually went.

#nextjs #googlecloud #vercel #cloudrun

We moved our production Next.js 16 app from Vercel to Google Cloud Run last week. The whole thing took about ten focused hours from "let's plan this" to "production traffic on GCP." Some parts went well. A few caught us off guard. One bug only showed up on the second deploy and would bite anyone with the same setup.

This is the honest story. If you're thinking about the same move, hopefully you skip the parts I tripped over.

Why we were on Vercel

When the project was three weeks old, I needed a deploy target I could ignore. vercel --prod worked. Custom domains worked. SSL worked. Preview branches just appeared. I had a Next.js 14 app, no time for infra, and a list of customer features that mattered more than where the bits ran.

We sat on Vercel Pro for almost a year. It earned its keep. Everything I complain about below is the predictable cost of an early-stage setup meeting real product needs.

Why we moved

A few unrelated things lined up at the same time.

Most of our backend was already on GCP. Our heavy work runs in background workers that pull from pgmq, a Postgres queue extension. Those workers needed more compute than Vercel functions allowed, so they went on Cloud Run from day one. The web app stayed on Vercel. We had a seam in the architecture nobody wanted to admit was a seam: "the system runs on GCP, except the part the user actually sees."

We also got into the GCP for Startups program. That's about a year of credits sitting in the account, which changes the cost math. The web app was our biggest line item on Vercel and would be one of the smaller ones on GCP.

Most of our users are in India. Cloudflare proxies everything anyway, so end-user TTFB is dominated by Cloudflare's edge in Bombay. But origin-pull latency matters for routes the CDN can't cache — dashboards, server actions, an SSE stream that holds connections open for minutes. Vercel was serving us from bom1 too, but with extra hops through their edge middleware. After cutover I measured ~45ms median TTFB through the new stack against ~170ms before, from the same client.

The last reason is the one that quietly drove me crazy. Workers had alerts in Cloud Monitoring, secrets in Secret Manager, IAM in Terraform. Web had Vercel's UI, Vercel's env var system, Vercel's auth model. Two halves of one product, two mental models for every change.

None of those reasons by itself was enough. Together they were.

The plan, before it met reality

Six phases, each meant to be revertible on its own.

Audit first. Inventory every env var, every domain, every webhook. Compare to what the new Terraform would create. Then build infra: Cloud Run service, Global HTTPS load balancer, Cloud Armor in preview mode, Cloud CDN, Artifact Registry, Workload Identity Federation for GitHub Actions, monitoring alerts. Apply with a placeholder image so the resources exist before any real build pushes one. Then CI/CD: a workflow that builds the Docker image, pushes to Artifact Registry, canary-deploys to Cloud Run with no traffic, waits for the revision to go Ready, then shifts traffic. Then a 3–5 day soak running the new stack in parallel with Vercel, smoke-testing through the LB IP with curl --resolve so no DNS changes yet. Then the actual DNS flip. Then teardown.

I ended up doing the soak in about fifteen minutes instead of five days, but I'll get to that.

The Sensitive env var problem

This is the part I want most people to know about, and the part I want to be careful not to give bad advice on.

Vercel has two main classes of environment variable for secrets: encrypted and sensitive. Both are encrypted at rest. The difference is how readable they are after that. Encrypted vars decrypt back to plaintext on Vercel's systems for things like the dashboard, the CLI (vercel env pull), and read APIs. Sensitive vars are protected so the plaintext is only available to your running build and runtime. You can't read them back through the dashboard, the CLI, or any API. The dashboard just shows an empty box. vercel env ls lists them by name only.

This protection is not theoretical. Vercel had a security incident in April 2026 where an attacker pivoted through a third-party AI tool into a Vercel employee's Google Workspace, and from there into Vercel internal systems. The attacker enumerated and decrypted non-sensitive environment variables for a subset of customers. Sensitive variables were not exposed. Vercel's advice after the incident was to rotate any non-sensitive secrets and move secret material to the Sensitive class going forward.

So when twenty-two of our secrets were Sensitive — payment provider keys, webhook signing secrets, a database encryption key, an LLM provider key, our database service role, Sentry tokens — that wasn't a mistake. That was security working correctly. The "you can't read them back" property is the whole point. If those values could be pulled out from outside the running deployment, an attacker with the right access could pull them too.

The cost of that property is that when you leave the platform, those values are unrecoverable from Vercel. The answer is not to weaken the security class. The answer is to rotate.

So we rotated every Sensitive secret as part of the move. The flow per secret was the same. Generate a new value. Put it in Secret Manager. Update the upstream provider's dashboard to issue or accept the new key. Switch the consuming code in the new infra. Watch traffic confirm the new key is being used, then revoke the old. The old Vercel values stay locked away forever. Which is fine. They were never meant to come back out.

The real work was the choreography. Payment processors usually let you have multiple active API keys, so you add the new one, watch usage shift, revoke the old. Webhook signing secrets are trickier when the receiver only accepts one — most providers let you accept two during a rotation window. OAuth client secrets often allow only one active value, so you eat a short window of failed callbacks or spin up a second OAuth client. Database credentials are easiest if you just create a new DB user for the new infra instead of rotating the existing one.

If you're on Vercel today, keep using Sensitive for actual secrets. Rotation is the right path out, and being on Sensitive protects you from the class of incident Vercel disclosed in April.

The Cloudflare origin cert trick

Google-managed SSL certs on a GCP load balancer don't go ACTIVE until at least one of their SAN domains has a DNS record pointing at the LB. This is how Google validates ownership. It also means there's a chicken-and-egg moment at cutover: until DNS flips, the cert is stuck in PROVISIONING.

A few minutes of a not-yet-ready cert is usually fine. We had two reasons to avoid it. Cloudflare in front of the LB connects to origin in Full (strict) mode, which requires a valid origin cert at the TLS handshake. If the origin cert is provisioning at the same moment DNS flips, Cloudflare can fail open or fail closed depending on settings — neither of which I wanted to discover in production. And the managed cert provisioning timeline runs anywhere from 15 to 60 minutes in practice. That's too unpredictable.

The fix is a Cloudflare Origin Certificate. Cloudflare issues you a 15-year cert signed by their internal CA. Their edge fully trusts it. No browser does, but that's fine — browser users only ever talk to Cloudflare's public edge cert. You upload the origin cert to the LB as a self-managed certificate, and the Cloudflare-to-LB hop works the moment you flip DNS.

I bound both certs on the HTTPS proxy. Cloudflare origin cert primary, Google managed cert as fallback. The Google cert eventually provisioned to ACTIVE about 25 minutes after cutover, by which point it didn't matter. Defensive overkill, maybe. The whole worry goes away if you're comfortable with a 30-minute "this might 5xx" window. I wasn't.

Two bugs that hide until they don't

Both of these are silent failures. Both make your old revision keep serving while everything looks fine in the browser. Both come from things that are easy to get wrong if you copy-paste your deploy script.

The first one is the wait-for-Ready check. Our workflow does this:

gcloud run services update "$SERVICE" \
  --image "$IMAGE" --no-traffic --tag="sha-${SHORT_SHA}"

for i in $(seq 1 60); do
  STATUS=$(gcloud run revisions describe "$LATEST" \
    --format='value(status.conditions[?type=Ready].status)')
  if [ "$STATUS" = "True" ]; then break; fi
  sleep 5
done

The [?type=Ready] is JMESPath filter syntax. It works in the aws CLI. It does not work in gcloud --format=value(...). gcloud silently returns an empty string. The poll times out at five minutes. The deploy step fails. Traffic-shift never runs. The new revision sits Ready on 0% traffic and the old one keeps serving. The fix is to poll the service-level field instead:

READY=$(gcloud run services describe "$SERVICE" \
  --format='value(status.latestReadyRevisionName)')
if [ "$READY" = "$LATEST" ]; then break; fi

That's the gcloud-native pattern. The JMESPath thing got in there because the workflow was adapted from an old AWS deploy script. If your workflow has any inherited syntax, do one pass on it before you cut over.

The second bug only shows up on the second deploy. Our canary flow tags the currently-serving image as :previous-good before shifting traffic, so rollback is one workflow-run away. The command is gcloud artifacts docker tags add against the :previous-good tag. On the first ever deploy, that tag doesn't exist yet, so the command just creates it and everyone is happy. On every deploy after that, the tag already exists pointing at a different image, so the command has to delete the old binding before creating the new one. That delete needs artifactregistry.tags.delete. Our deployer service account had roles/artifactregistry.writer, which covers create but not delete.

Result: the second deploy passed every step except the tag step. Same failure shape as the JMESPath bug — new revision Ready on 0% traffic, old revision still serving. It took me fifteen minutes to figure out because the symptom looked identical to the first bug. The fix was one Terraform line: deployer SA gets roles/artifactregistry.repoAdmin on each repo instead of writer. Same scope, slightly more permission, covers both operations.

If your team has a canary deploy that re-tags an image to mark rollback targets, audit that flow today. Both these bugs are invisible until they aren't.

The actual cutover

I'd planned for three to five days of soak. It ended up being about fifteen minutes.

The smoke test was simple. I edited /etc/hosts to point the production hostname at the GCP LB IP. Then I accepted the cert warning in the browser and walked through the things that matter: OAuth callback, the SSE streaming route, server actions, payments checkout. Everything worked. Cloudflare's orange-cloud proxy means a DNS flip-back is under sixty seconds at any point. Both stacks share the same Supabase backend, so there's no data divergence to worry about either.

The disciplined version of risk management would have been five days of synthetic monitoring before flipping. The pragmatic version was a checklist, a smoke test, and the knowledge that rollback was instant.

I flipped DNS at 23:00 IST. Cloudflare picked up the new origin in under thirty seconds. Cloud Run request volume climbed from zero to real traffic inside the next minute. Twelve hours later, the only anomalies in the logs were the same bot scanners that had been hitting xmlrpc.php and /wp-admin on the old stack.

The defense-in-depth thing I got wrong

I'd been carrying a task that said "Cloud Armor must flip from preview to enforce by day-7." The original plan was Cloud Armor's OWASP rules in front of Cloud Run, plus Cloudflare's free DDoS layer, plus our app-level checks.

Once we actually got the new stack up, I realized we had Cloudflare Enterprise on the zone, which includes Cloudflare's Managed Ruleset, Exposed Credentials Check, and OWASP Core Ruleset. After deploying those, Cloud Armor became the second WAF in the chain, running the same OWASP signatures as Cloudflare but later in the path. Two WAFs blocking the same traffic just doubles the false-positive triage surface.

Cloudflare WAF is now the primary blocker. Cloud Armor stays in preview mode permanently — it logs everything Cloudflare let through, which is useful forensic data, but it doesn't block. The day-7 enforce deadline is gone. If something gets past Cloudflare in an attack, Cloud Armor's logs show the pattern and I can write a tuned Cloudflare rule. The decision lives in the Terraform with a comment explaining why preview is permanent, so whoever joins next doesn't think it's a TODO.

The Vercel ending

We didn't delete the Vercel project. I downgraded to Hobby and kept the auto-deploy from main running. The project rebuilds on every push. The deployment sits there warm. No production traffic touches it. It stays available as a warm-standby in case GCP has a bad day.

If we ever need to roll back at the infrastructure level, the failover is a Cloudflare DNS A record flip. Sixty seconds. Same backend. That posture would have horrified Vercel-fan-me from a year ago. Pragmatic-me thinks it's a useful redundancy at zero cost.

What I'd do differently

A few things, in order of how much I want to take them back. Plan secret rotation into the migration timeline from day one. Sensitive-class env vars are the correct security posture, and the way they protect values is the point. Bake the provider-dashboard coordination work into the schedule from the start. Audit your deploy workflow for inherited JMESPath syntax before your first cutover, not after. Grant artifactregistry.repoAdmin to the deployer service account on day one, or pick a tagging strategy that doesn't need delete. Use a Cloudflare Origin Certificate even when you also have a managed cert — it's free, it lasts fifteen years, and it removes a class of cutover failure. If your testing is good, skip the soak. Five days of synthetic monitoring does not catch what one careful smoke test catches. And don't run two WAFs in enforce mode at the same time unless someone on the team has time to triage false positives twice.

The shape now

Internet
  ↓
Cloudflare (orange-cloud)
  ├─ WAF Managed Ruleset (Block)
  ├─ Exposed Credentials Check (Block)
  ├─ OWASP Core Ruleset (Log → Block after soak)
  └─ Rate Limiting + DDoS L3/L4/L7
  ↓
GCP Global HTTPS LB (Cloudflare Origin Cert)
  ↓
Cloud Armor (preview, forensic logging only)
  ↓
Cloud CDN (USE_ORIGIN_HEADERS cache mode)
  ↓
Cloud Run v2 (asia-south1, ingress = INTERNAL_LB)

More moving parts than Vercel. Operational overhead is roughly the same once it's running, because everything is in Terraform and the deploy is a git push. Cost is covered by startup credits for the next year. Without those credits, this stack is more expensive than Vercel Pro at our current always-on worker footprint — be honest with yourself about that math before you commit.

If you read this far and you're sitting on the same decision, I'm happy to talk through your specifics. The secret-rotation choreography and the canary-deploy IAM gotchas are where people get stuck.

Top comments (1)

Raju Dandigam • May 15

This is the kind of migration write-up teams actually need because it covers the hidden operational details, not just the happy path. The sensitive env var rotation and canary-deploy IAM issue are exactly the types of things that only show up in real production moves. I also like the pragmatic rollback posture with Cloudflare and a warm standby. For Next.js teams moving beyond early-stage Vercel simplicity, this is a very useful architecture reference.