ohyeah

Posted on Apr 30

I built my own UptimeRobot in a weekend with Next.js 16 + Vercel Cron

#nextjs #webdev #vercel #indiehackers

I've been paying UptimeRobot for years. It works. The free tier is generous. I have no real beef with them.

But every time I added a 6th monitor, the upgrade modal appeared. Every time I logged in to check a site, the dashboard nudged me toward Pro. Every time I wanted a public status page on my own domain, that was a paid feature too.

Eventually I asked the question every indie dev asks at some point: how hard could this actually be?

It turned out: a weekend to MVP, two weeks to ship to paying customers. Here's the architecture, the parts that surprised me, and the bugs that cost me an afternoon each.

The whole product, on one page

Probe a list of URLs every minute. HEAD or GET, optional body keyword check.
Detect "down" reliably — don't email you because of one flaky packet.
Email when status flips. Don't email every minute the site stays down.
Render a public status page at a custom slug.
Bill it. $9/month for 25 monitors, free up to 5.

That's the spec. Anything else I considered building, I asked: "would my own indie projects need this?" The answer for incident management, on-call rotations, request tracing, RUM, and Slack threading was: no. So they didn't get built.

The stack

Next.js 16 (App Router) on Vercel
Supabase for Postgres + Auth (Tokyo region — more on this below)
Vercel Cron runs a single endpoint every minute
Resend for alert emails
Stripe Checkout + webhooks for billing

That's it. No queue, no Redis, no separate worker fleet. The whole backend is one cron endpoint and a handful of Server Actions.

The 1-minute heartbeat

Vercel Cron sends a GET to /api/cron/check every minute. A single endpoint handles every monitor on the platform — no per-monitor crons, no fan-out queue.

The flow:

cron tick
  → claim_due_monitors (Postgres function, atomic SELECT FOR UPDATE)
    → process up to 200 monitors in parallel batches of 25
      → fetch each URL with AbortController timeout
        → upsert check result + flip status if needed
          → enqueue alert if status transitioned

The Postgres function is the load-bearing piece. It locks rows that are due for a check, bumps their next_check_at, and returns them in one round-trip. Two cron workers will never claim the same monitor in the same tick, because Postgres handles the contention for me.

-- simplified
create function claim_due_monitors(p_limit int)
returns setof monitors
language plpgsql
as $$
begin
  return query
    update monitors
    set next_check_at = now() + (interval_seconds * interval '1 second')
    where id in (
      select id from monitors
      where next_check_at <= now() and active = true
      order by next_check_at
      for update skip locked
      limit p_limit
    )
    returning *;
end;
$$;

for update skip locked is the magic. It lets a second cron worker (which won't happen here, but you want it to be safe) skip rows that are already being processed instead of waiting for a lock.

Probing 25 URLs concurrently in one function

Each tick can hit dozens of URLs. The cron route batches them with Promise.allSettled:

const CONCURRENCY = 25;
for (let i = 0; i < monitors.length; i += CONCURRENCY) {
  const slice = monitors.slice(i, i + CONCURRENCY);
  const results = await Promise.allSettled(
    slice.map((m) => processMonitor(admin, m))
  );
  // ...tally results, log errors
}

The probe itself is just fetch with three things you must get right:

const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), monitor.timeout_ms);

try {
  const res = await fetch(monitor.url, {
    method: monitor.keyword ? "GET" : monitor.method,
    redirect: "follow",
    signal: controller.signal,
    headers: {
      "user-agent": "SitePulseBot/1.0 (+https://sitepulse.satosushi.co)",
      "cache-control": "no-cache",
    },
    cache: "no-store", // never let Next cache a probe
  });

  // ALWAYS drain the body, even if you don't need it.
  // Otherwise the socket stays open and the next probe pays connect cost.
  if (!monitor.keyword) {
    try { await res.arrayBuffer(); } catch {}
  }

  // ...record result
} finally {
  clearTimeout(timeout);
}

Three subtle things in there:

cache: "no-store" — Next.js will happily cache fetch responses in production. You don't want a cached HTTP probe.
Drain the body — if you don't read the response body, the underlying connection sits in limbo. Across hundreds of probes per minute, this matters.
AbortController for timeouts — fetch has no built-in timeout. The default is "wait forever." Don't.

The 1-second latency I didn't notice for two days

I deployed the first version and a page load felt sluggish. Not broken — just sluggish. Maybe 800ms-1.2s for the dashboard to render.

Vercel Functions default to iad1 (Washington DC). My Supabase project is in Tokyo. Every Server Component that hit the database was making a US-east → Tokyo → US-east round trip per query. With 3-4 queries per page render, that's a second of pure network sitting between the user and the page.

// vercel.json
{
  "regions": ["hnd1"]
}

One line. Pinning functions to Tokyo (hnd1) drops Server Component render time to under 100ms. The lesson generalises: always colocate compute with your primary data store, especially for Server Components, where every render is a synchronous database conversation.

The Server Component cookie crash

Next.js 16's App Router gives Server Components access to cookies. Supabase's createServerClient wants a setAll callback so it can refresh tokens.

But Server Components are read-only — you can't set cookies during a render in production. If a token refresh happens during a Server Component pass, setAll throws, and the entire page returns a 500 with that lovely React digest:

Server Components render
 → Supabase tries to refresh expired token
 → setAll attempts to write cookies
 → ERR_HTTP_HEADERS_SENT-style error
 → 500 with digest 972974443

Fix is one try/catch:

{
  cookies: {
    getAll() {
      return cookieStore.getAll();
    },
    setAll(cookiesToSet) {
      try {
        cookiesToSet.forEach(({ name, value, options }) =>
          cookieStore.set(name, value, options)
        );
      } catch {
        // Server Component context — token will be refreshed on next request.
        // Safe to swallow.
      }
    },
  },
}

The Supabase docs hint at this but the existing examples I copied didn't have the try/catch. If your Supabase + Next.js 16 app sometimes 500s on logged-in users after a token expiry, this is probably why.

The trailing whitespace bug that ate three hours

I copied my STRIPE_WEBHOOK_SECRET from the Stripe CLI output into Vercel's env var UI. Webhooks 401'd in production. Local was fine.

The Stripe webhook secret has a trailing newline if you copy from a terminal. Vercel stores it verbatim — including the newline. The HTTP header Stripe-Signature then doesn't match anything, signature verification fails, and you get a 400 in your logs with no obvious cause.

The fix is to never trust your clipboard:

echo -n "$STRIPE_WEBHOOK_SECRET" | tr -d '\n\r\t ' | vercel env add STRIPE_WEBHOOK_SECRET production

Same gotcha applies to any header-borne secret: API tokens, basic auth, JWT signing keys. If signature verification fails in prod but works locally, check whitespace before checking anything else.

What I deliberately didn't build

The list of things people expect from a "real" uptime monitor that I left out:

Logs / RUM / transaction monitoring. That's what Sentry and Logflare are for.
Multi-region probing. I check from one region. If Cloudflare is down in São Paulo and your site is up in São Paulo, you and I will both find out at the same time.
On-call schedules / rotations. Indie devs are a one-person rotation. If I'm asleep, the alert waits.
Slack / Discord / PagerDuty / OpsGenie. Email + SMS covers the people I'm building for.
5-second checks. 1-minute is enough for indie projects. Sub-minute is genuinely expensive to do reliably.

Cutting features wasn't a sacrifice — it was the product. The competitors I respect (UptimeRobot, BetterStack, Pingdom) all do most of what I left out, and that's exactly why their pricing pages have four columns and a "contact sales" button.

What I learned about flat pricing

The most-discussed part of this product hasn't been the technical stack — it's been the price.

$9/month for 25 monitors. No per-seat. No per-region. No per-channel. Free up to 5 monitors.

The reasoning: when I'm picking a tool for a side project, I don't have time to evaluate three pricing tiers and figure out which one I'd grow into. I want one number. Flat pricing forces the product to do less, which forces me to make better tradeoffs about what to build.

It also means the product can never become "enterprise." That's fine. There are already excellent enterprise uptime monitors. There aren't enough good ones for indie devs.

The link

The product is live at sitepulse.satosushi.co — 5 monitors free, $9/mo for 25, no card to start. If you've been on UptimeRobot and want to see how it stacks up, I wrote a side-by-side comparison too.

Not open source — that's the business — but happy to answer architecture questions in the comments. The cron-claim-and-fan-out pattern in particular has been more reliable than any queue I've shipped, and I think it generalises to a lot of "do this thing every N seconds for N users" problems where you'd otherwise reach for SQS.

DEV Community