DEV Community

Cover image for Stop Relying Entirely on Uptime Kuma for Incident Response
Justyn Larry for Irin Observability

Posted on • Originally published at irinobservability.com

Stop Relying Entirely on Uptime Kuma for Incident Response

Before I get into this, it is not a knock on Uptime Kuma. It's a genuinely amazing, easy-to-use piece of software. If you run a homelab or a small fleet and you're not using it, you probably should be. It's free, self-hosted, beautiful, and it does the thing it was built to do better than almost anything else at any price.

There's always a "but," though, so before we get to it I want to spend a little time on what Uptime Kuma does well.

TL;DR:

Uptime Kuma is excellent at telling you when a service becomes unreachable, but it cannot explain why a service is slow or unhealthy while still responding. That requires internal metrics from tools like Prometheus, Grafana, and Alloy. Reachability monitoring and systems monitoring solve different problems, and mature environments typically use both.

Where Uptime Kuma excels

Uptime Kuma answers one question extremely well: is this reachable? It'll ping a host, hit an HTTP endpoint and check the status code, watch a TCP port, validate a TLS cert's expiry, query a DNS record, check a keyword on a page, watch a Docker container, even poke a game server. It checks on a tight interval, shows you a clean history, and when something stops responding it fires a notification through basically any channel you can name. Ninety-plus notification integrations. Status pages you can hand to your users. Two-factor auth. A genuinely nice UI.

For "tell me the moment my website, my reverse proxy, my Plex, or my Home Assistant stops answering," it's close to perfect. The interval is short, setup is measured in minutes, and there's practically no maintenance. It has earned a famously loyal userbase for a reason.

I'm not here to tell you it isn't the answer, or to convince you to ditch it for something else. I still think everyone running infrastructure of any size should have something like it watching their endpoints. I have it running in a Proxmox container on my own homelab. But there's a gap I noticed while using it, and this post is about the specific moment when you ask Uptime Kuma a question it wasn't designed to answer, and what you do when that moment arrives.

Growing pains

There usually comes a time, as your homelab or business grows, when your database starts feeling slow. Queries that used to be instant are taking a little longer. It's not a real problem yet, but you can tell something is off.

The Uptime Kuma dashboard is all green. The database port is answering, the HTTP healthcheck returns 200, every light on the board is on. Uptime Kuma is correctly reporting that the service is up.

And it's not wrong. That's the thing. The service is reachable. But "reachable" and "healthy" mean different things, and you've just walked into the space between them. If the disk that database lives on is pinned at 100% IO utilization because a backup job and a big query are fighting over it, your queries are queuing behind that contention, and from the outside the port still answers in time to pass the check. The board is green, the database is slow, and there's no contradiction.

Uptime Kuma doesn't see any of that, and the reason it can't isn't a missing feature, it's the architecture. It checks your systems from the outside looking in. It has no way to see what's happening inside your servers. What are the disk, memory, CPU, and kernel actually doing?

What you need at that moment is something standing inside the box, reading the system from within. That's a different category of tool.

Reachability versus internals

There are two kinds of monitoring, and once you see the split you understand why mature setups end up running both.

Reachability monitoring (Uptime Kuma) asks the basic questions:

  • Can I get to it?
  • Is the port open, the page loading, the cert valid, the container running?

It reports what it can see from the outside, which is exactly what you want for "is my service up and can my users reach it." It's easy, simple, and honest about what it knows.

Systems monitoring (the Prometheus world) asks questions that are a little more involved:

  • What's going on inside the machine?
  • How busy is each CPU core?
  • How much memory is actually available once you account for cache?
  • What's the disk IO utilization, the queue depth, the read and write latency?
  • How much network throughput, how many dropped packets?
  • Is memory slowly leaking over days?

It's an internal view, and it answers why a service is behaving the way it is.

Neither replaces the other. Reachability tells you that something is wrong. Systems metrics tell you why. The database scenario above needs both: Uptime Kuma to eventually notice if the slowness becomes an actual outage, and system metrics to explain the slowness long before it gets there.

The internal view

The standard way to get the inside view on Linux is a tiny agent called node_exporter. It's a small binary that runs on the box, reads metrics straight from the kernel, and exposes them for a time-series database (Prometheus) to collect. Pair it with Grafana for dashboards, and for logs, pair Loki with a shipper. The traditional choice there was Promtail, though Grafana has since moved Promtail into long-term support and now steers you toward Grafana Alloy, which handles both metrics and logs in a single agent. (I wrote a comparison of those two separately.)

With either node_exporter or Alloy running, the database scenario stops being a mystery. The exact moment things felt slow, you can pull up:

  • Disk IO utilization on that box, and watch it pin to 100% right when the slowness started.
  • The specific disk and the read/write split, so you can see it was the backup volume contending with queries.
  • CPU broken out by mode, so you can rule out CPU as the cause.
  • Memory availability over the past week, so you can see whether pressure had been building.

And if you have Loki collecting logs alongside the metrics, you can line up the disk IO spike against the log line where the backup job kicked off, and the whole story assembles itself in one view. Uptime Kuma told you the service was up. The system metrics tell you the backup job is strangling your database disk, which is what you actually need to know to fix it before it hits production.

(If the PromQL behind those dashboards is unfamiliar, I wrote up the five queries you actually need to monitor a Linux server separately.)

The honest part: this is more work

Standing up a stack to see inside your servers is not as simple as setting up Uptime Kuma, and that simplicity is a real part of why Uptime Kuma is so loved. Moving to system metrics comes with a cost.

node_exporter or Alloy goes on every server, with Prometheus running somewhere to collect from them. Grafana dashboards have to be built, or imported from the community and then tweaked until they're readable instead of overwhelming. Alert rules have to be written to fire on real problems without crying wolf. Metric and log retention have to be configured. And then the whole thing needs ongoing maintenance.

This is the irony nobody warns you about: you now have a monitoring stack that itself needs monitoring, which is partly why you want predictive disk alerts on the box running Prometheus.

None of it is hard, exactly. But it's an ongoing process with no end, and it's a different commitment than the near-zero maintenance of an Uptime Kuma container you set up once and edit when new services come online. Uptime Kuma is the right tool for reachability and status pages, and it costs almost nothing to run. System metrics cost more, but they become relevant the moment you start wondering why services aren't behaving, even though they're still showing green.

So what do you do with this

The real takeaway here isn't a product, it's the distinction, because that understanding outlives any particular tool. Outside-in tells you something broke. Inside-out tells you why. Uptime Kuma is one of the best outside-in tools ever made, and it'll happily keep doing that job for you forever. It just wasn't built to explain the why.

When you do need the why, you've got two options: run the inside-view stack yourself (node_exporter or Alloy, Prometheus, Grafana, Loki), which is completely viable and a great way to learn, or hand it to someone who runs it for you.

For full disclosure, that second path is the reason I built Irin Observability, a managed version of that inside-view stack for small teams and homelabs that have outgrown pure reachability checks but don't want a second full-time job maintaining a metrics pipeline. It's meant to sit alongside something like Uptime Kuma, not replace it, because reachability and internals are different jobs and the mature answer is to run both.

Either way, keep the green board. Just add the view from inside the box when you start asking why.

Top comments (2)

Collapse
 
topstar_ai profile image
Luis

This post is making a point a lot of teams eventually hit in production:

Uptime Kuma is great for checks, but weak as a single source of truth for incident response.

Tight summary

The core argument is that people over-rely on Uptime Kuma dashboards and alerts as their incident system, when in reality it is only a signal generator, not an incident management layer.

What’s being missed in most setups

  1. Monitoring ≠ Incident response Uptime Kuma tells you:

something is down
something recovered

But it does not handle:

incident grouping (deduplication across cascading failures)
severity classification (P0 vs P3)
escalation policies (who gets paged when)
acknowledgment workflows
incident timelines / postmortems

So teams end up treating raw alerts as “incidents,” which creates noise and confusion during outages.

  1. Alert storms during real outages When a dependency fails (DB, network, upstream API), Kuma will:

trigger multiple independent alerts
flood Slack/Telegram/email
give no context about root cause

This leads to “we know everything is down, but nothing is actionable.”

  1. No correlation layer Modern incident response needs:

grouping related failures
identifying upstream vs downstream cause
suppressing dependent alerts

Uptime Kuma intentionally doesn’t do this — it’s a monitor, not an observability engine.

  1. Missing incident lifecycle Real incident systems (PagerDuty, Rootly, Opsgenie-style flows) provide:

incident creation
ownership assignment
status updates
resolution tracking
post-incident review

Kuma stops at “DOWN / UP”.

The real takeaway (what the article is pushing toward)

Uptime Kuma should be treated as a signal input, not the incident brain.

Production setups usually evolve into:

Kuma / Prometheus / probes → signal layer
Alert manager / routing layer → dedup + grouping
Incident system (PagerDuty-style or custom) → workflow + ownership
Practical implication

If you're building production infra:

Use Uptime Kuma for:

uptime checks
heartbeat monitoring
external validation

But pair it with:

alert routing (Alertmanager, custom event bus, or webhook processor)
incident orchestration layer (even a lightweight internal one)

Otherwise you get the classic failure mode:

“We have monitoring, but no incident system.”

That’s exactly the gap this post is calling out.

Collapse
 
justyn_larry_e12a0d9779f4 profile image
Justyn Larry Irin Observability

This is a great breakdown, and you've pushed the point one layer further than my post did.

The article was really aimed at the diagnosis gap, where Kuma tells you a service is down, but the moment you ask why it's behaving that way, you need system metrics underneath it. You've taken that a step further into the incident-response gap, which is a different axis entirely.

Where I'd add a small wrinkle to your model is the routing layer. For many small teams, Alertmanager grouping and inhibition rules solve most of the alert-storm problem before they need a full incident-management platform. If a database failure is already firing, suppressing the downstream application alerts often restores signal quality without introducing the operational overhead of ownership workflows, escalation chains, and incident tracking.

The full incident lifecycle earns its place as environments grow. I think there's a middle ground where good signals plus good routing can get teams further than they expect. Thank you for taking the time to expand on the idea!