ohyeah

Posted on May 1

Your Next.js health check is lying to you (and how to fix it)

#nextjs #webdev #devops #supabase

I've been monitoring my own SaaS in production for the last two months, and I've watched the same bug pattern hit indie projects over and over:

The app is on fire. Customers are seeing 500s. Stripe webhooks are silently failing. And yet GET /api/health is cheerfully returning 200 OK, every minute, like nothing's wrong.

The reason is almost always the same: the health check is testing the wrong thing.

This post is about what a health check should actually do, the three failure modes that catch people, and a working Next.js 13+ implementation you can paste in.

The "useless 200" pattern

Here's the health check I see most often in indie Next.js codebases:

// src/app/api/health/route.ts
export async function GET() {
  return Response.json({ ok: true });
}

This endpoint can only fail in one way: the Next.js process itself is dead. If that happens, your hosting platform was already going to know — Vercel/Render/Fly notice the process crashed before your monitor does.

What this endpoint cannot tell you:

Did someone rename DATABASE_URL to DATABASE_POOL_URL in env vars and forget to update the code?
Did your Supabase service-role token expire?
Did the connection pool max out and start refusing connections?
Did a middleware change start returning 308 redirects to /login for everything?
Is your background queue stuck?
Is the Stripe webhook handler returning 200 but silently swallowing events?

Every one of those bugs has hit a real production app I know of in the last 90 days. None of them were caught by a return { ok: true } health check. All of them were eventually caught by customer complaints — the worst possible monitor.

The three layers of "healthy"

Before showing the fix, the mental model that makes this easier:

Layer 1: shallow. "Is the function reachable?" This is the useless 200. It tells you the runtime is up, nothing more.

Layer 2: middle. "Are my critical dependencies reachable from this function right now?" Database. Auth provider. Cache. The cheapest possible roundtrip that actually exercises auth and the connection pool.

Layer 3: deep. "Is the entire system functioning?" Background workers running. Cron jobs not stuck. Queue not backed up. This is expensive and runs less often.

Most indie projects only need Layer 2. Layer 1 is what you have today and it doesn't help. Layer 3 is what big shops do; you don't need it yet.

The rest of this post is about doing Layer 2 correctly in Next.js.

The fix: a real Next.js 13+ health endpoint

Here's the Route Handler I run on my own SaaS. It uses Supabase but the pattern is the same for any DB:

// src/app/api/health/route.ts
import { NextResponse } from "next/server";
import { createSupabaseServiceRole } from "@/lib/supabase/server";

export const runtime = "nodejs";
export const dynamic = "force-dynamic";

export async function GET() {
  try {
    const supabase = createSupabaseServiceRole();

    // Cheapest possible call that exercises the connection pool + auth.
    // head: true returns no rows — microseconds, no payload.
    const { error } = await supabase
      .from("profiles")
      .select("id", { count: "exact", head: true })
      .limit(1);

    if (error) throw error;

    return NextResponse.json(
      { ok: true, ts: Date.now() },
      { headers: { "Cache-Control": "no-store" } }
    );
  } catch (e) {
    return NextResponse.json(
      { ok: false, error: (e as Error).message },
      { status: 503, headers: { "Cache-Control": "no-store" } }
    );
  }
}

There are five non-obvious decisions in those 25 lines. Each one is a bug I've personally watched bite somebody.

1. `runtime = "nodejs"`, not edge

Health checks should hit the same runtime your real traffic hits. If your app runs on the Node.js runtime (most indie SaaS), your health check should too. Otherwise you're testing a runtime your customers never use.

2. `dynamic = "force-dynamic"`

Without this, Next.js or your CDN can serve a cached 200 even after your DB is down. The cache happily reports "healthy" while every customer request is failing. I've seen this exact bug in production. Hard to debug because the health check looks fine.

3. `Cache-Control: no-store` on every response

Same reason. Belt and suspenders. CDNs respect no-store even if Next.js gets it wrong.

4. A real DB roundtrip — not `SELECT 1`

SELECT 1 works for raw Postgres, but it's a half-measure. You want a query that exercises:

Connection pool acquisition (catches "pool exhausted")
Auth (catches "service role token expired")
A real table the app uses (catches "ran migrations on the wrong DB")

The head: true count query above does all three. It costs microseconds and transfers no rows. Use the cheapest possible real query, not the cheapest possible fake query.

5. `503` on failure — not `200` with `{ ok: false }`

This is the one people get most wrong. Most upstream monitors — Kubernetes liveness probes, GCP/AWS health checks, external uptime tools — trigger on HTTP status, not body content. If you return 200 { ok: false }, your monitor sees a successful response and your platform never takes the bad pod out of rotation.

503 Service Unavailable is the right status for "I'm running but I can't serve traffic." Use it.

"But I want my health check at `/healthz`"

The k8s convention is /healthz. Easy with App Router rewrites in next.config.js:

module.exports = {
  rewrites: async () => [
    { source: "/healthz", destination: "/api/health" },
  ],
};

Now both URLs work and your liveness probe stays idiomatic.

Monitoring it from outside

Here's the part that costs people the most, because they think a health check is enough by itself.

It isn't. Your platform's liveness probe (Vercel internal, Kubernetes, etc.) checks the pod. It does not check:

DNS — did your registrar accidentally let the domain expire?
TLS — did your certificate auto-renewal silently fail?
CDN edge — is Cloudflare serving stale 502s while origin is fine?
The path between user and pod — region outage, BGP drama
Third-party degradation — your code is fine, but Stripe/OpenAI is throwing 500s

The only way to catch these is to hit your public URL from outside your infra, on a schedule, from multiple regions.

You can roll your own with cron-job.org + a Slack webhook in 30 minutes. Or use any external uptime monitor. I built SitePulse for this exact reason — one of the war stories below was what kicked it off — but the stack doesn't matter. The point is: don't rely on your own infra to tell you your own infra is broken.

The war story that made me write this

Last year I shipped a deploy that renamed an env var. I'd updated .env.example. I'd updated the code. I had not updated the production env var. The deploy went green. The 200-only /api/health kept returning 200. CI passed because tests use a different config.

For 41 minutes, every customer request to the affected endpoint returned 500. I noticed because someone tweeted at me.

If the health check had done a real DB roundtrip with the production config, it would have failed at deploy-time and the platform would have refused to promote the build. Instead it merrily reported "healthy" while 100% of real traffic broke.

That bug cost me a customer. Worse, it cost me trust — they'd been one of my early users.

The two-line fix (real DB query, 503 on failure) would have caught it inside the first request after deploy.

TL;DR

A health check that returns 200 without checking anything tells you the function is reachable. That's it. It's the cheapest possible information and it's almost never the information you need.

A useful health check:

Hits the same runtime your customers hit (runtime = "nodejs" if that's what you use)
Refuses to be cached (dynamic = "force-dynamic" + Cache-Control: no-store)
Does a real, cheap roundtrip to your most fragile dependency
Returns 503 on failure, not 200 with a flag in the body
Is checked from outside your infra, not just by your platform's internal probe

Do that and you'll catch the boring bugs that take down indie SaaS — environment drift, expired tokens, silent CDN issues — before your customers do.

If you found this useful, I wrote a shorter version on Stack Overflow covering the same patterns, and I publish more indie-SaaS-on-Vercel posts here on dev.to. The monitoring tool I built (SitePulse) is free for 5 monitors if you want the "external monitor" half of the story without writing it yourself.

DEV Community

Your Next.js health check is lying to you (and how to fix it)

The "useless 200" pattern

The three layers of "healthy"

The fix: a real Next.js 13+ health endpoint

1. `runtime = "nodejs"`, not edge

2. `dynamic = "force-dynamic"`

3. `Cache-Control: no-store` on every response

4. A real DB roundtrip — not `SELECT 1`

5. `503` on failure — not `200` with `{ ok: false }`

"But I want my health check at `/healthz`"

Monitoring it from outside

The war story that made me write this

TL;DR

Top comments (0)

The "useless 200" pattern

The three layers of "healthy"

The fix: a real Next.js 13+ health endpoint

1. runtime = "nodejs", not edge

2. dynamic = "force-dynamic"

3. Cache-Control: no-store on every response

4. A real DB roundtrip — not SELECT 1

5. 503 on failure — not 200 with { ok: false }

"But I want my health check at /healthz"

Monitoring it from outside

The war story that made me write this

TL;DR

1. `runtime = "nodejs"`, not edge

2. `dynamic = "force-dynamic"`

3. `Cache-Control: no-store` on every response

4. A real DB roundtrip — not `SELECT 1`

5. `503` on failure — not `200` with `{ ok: false }`

"But I want my health check at `/healthz`"