DEV Community

Cover image for Shipping uptime monitoring to an existing cron product: the decisions I got wrong first
Samed Kahyaoglu
Samed Kahyaoglu

Posted on • Originally published at drumbeats.io

Shipping uptime monitoring to an existing cron product: the decisions I got wrong first

Quick context: I am Samed, founder of Drumbeats — started as a cron and background-job monitoring service, now also does HTTP uptime. This post is the build log of adding uptime monitoring on top of a product that was not originally designed for it. Mostly so the next person doing this can skip the three decisions I had to redo.

The starting point

Drumbeats' existing model was push-based: your cron job pings a unique URL when it runs, we alert you when it does not ping on schedule. Everything in the system — monitors, notification routing, incidents, status pages — assumes "an event happened (or did not)."

HTTP uptime is the opposite. It is pull-based: we make a request every minute, record the result, and alert when things go wrong. Different data model, different billing implications, different failure modes.

The interesting question was not "can I bolt uptime on." The interesting question was "which decisions do I inherit from the cron side, and which do I reject?"

Four decisions mattered. I got two of them right on the first pass, and two of them wrong.

Decision 1: retry logic before alerting (got this right)

A single failed HTTP check should not page anyone. Networks are noisy. Deploys take 30 seconds. Load balancers briefly return 502s during warmup.

I shipped with 3 retries spaced 5 seconds apart before opening an incident. If any retry succeeds, nothing happens. If all three fail, we open the incident and send notifications.

This was the easy call. The hard call was the parameters:

  • Retry count: 2 is too few (you still get false positives on transient blips), 5 is too many (you are now 25+ seconds late on real outages)
  • Retry spacing: 1 second is too tight (you are just retrying the same network state), 10+ seconds is too slow (real users are getting errors while you wait)

Three retries at five-second gaps is 15 seconds of delay on a true positive, and it filters almost every false positive I have seen in my own checks over the last six months. If someone wants tighter alerting they can drop the interval to 1 minute; if they want looser, 5 minutes.

I left the retry parameters hard-coded on purpose. Every exposed configuration knob is a support ticket waiting to happen.

Decision 2: billing model (got this wrong first)

My first draft: uptime monitors get a flat monthly cost per monitor, priced by check interval. Cheap at 5-min, more at 1-min.

This was wrong for three reasons:

  1. It created a second billing mental model on top of the existing one (cron uses a usage-based "Beats" system).
  2. It made retries invisible — a flapping endpoint that triggered 3 retries per check would cost me compute I was not charging for.
  3. It meant I had to explain two different pricing systems on the pricing page.

What I shipped instead: every HTTP check consumes 1 Beat. Every retry consumes 1 Beat. Same unit as everything else in the product. A monitor checking every minute uses ~43,200 Beats/month; every 5 minutes uses ~8,640.

The retry billing is the important part. It aligns my costs with my prices. A customer with a perpetually flapping endpoint is doing 4x the checks (check + 3 retries per cycle when failing) and pays 4x the Beats. My infra cost scales linearly, my revenue scales linearly, I never need to care about their error rate.

The thing I got wrong: I initially did not bill for retries. I shipped internally with "retries are free" thinking it was a nice gesture. It was not a nice gesture — it was an invitation for pathological customers. Fixed before launch.

Decision 3: reuse notification plumbing (got this right)

I reused the existing notification-group system instead of building an uptime-specific alerting path.

If a user already has Slack + Email configured for their cron monitors, their uptime alerts go to the same places. No second configuration panel, no "oh, uptime alerts have their own routing rules" surprise.

This decision fell out of the cron-side investment I had already made. The notification-group abstraction was built to be monitor-type-agnostic, and this is the first time I got to cash in on that. If I had hard-coded notification logic into the cron monitor type originally, I would have been paying the cost here.

Decision 4: status page integration (got this wrong first)

My first draft separated uptime monitors and cron monitors on the public status page. Two lists, two headings.

Real-world customer reaction during beta: confusion. Their status page is for their users, who do not care whether a specific check is implemented as a cron ping or an HTTP probe. They care whether "API" or "Payment processing" is up.

I collapsed it into a single operational view. A cron monitor and an uptime monitor can sit under the same status-page component, and the end user sees one health state. Internally I still treat them as different monitor types — they have different schemas, different check runners, different alerting rules — but the public surface is unified.

Lesson I will keep: internal model boundaries should not leak into the customer-facing surface unless they carry genuine customer value. "The implementation type of this check" is not customer-facing value.

What I am not doing yet

Three things I deliberately cut from v1:

  • Keyword verification (assert response body contains a string)
  • SSL certificate expiry alerts
  • Multi-region checks

All three are on the next list. I shipped without them because none of them are the 80% case — the 80% case is "is my URL returning 2xx on schedule," and adding the other three would have doubled the scope of v1 without changing who could adopt it on day 1.

The multi-region one is the interesting one technically. I have a draft design I am sitting on. Will write about it separately.

The honest retrospective

If I were starting over, I would have:

  1. Designed the billing unit first, before writing a single line of the check runner. Every other decision cascades from it.
  2. Prototyped retry behavior against real third-party endpoints earlier. The "3 retries × 5 seconds" number looks obvious in hindsight. It was not obvious from a whiteboard.
  3. Killed the feature if the notification-group abstraction did not exist. Rebuilding that would have doubled the project. Knowing what abstractions you can stand on is half the engineering decision.

If you are adding a new check type to an existing monitoring product, or adding any pull-based feature to a push-based system, my honest advice is: the product decisions are harder than the code. The code is a week. The decisions are three weeks.

Top comments (0)