Most uptime tools answer the same question. They tell you "is service X up?" from one vantage point — the monitoring server. But in a hybrid cloud, "up" is not a property of a service. It's a property of a path.
The problem: "up" depends on where you're standing
A single-server monitor (Uptime Kuma, Pingdom, etc.) checks SQL-01 from itself. If it says UP, you assume everyone can reach it. But your real topology looks like this:
SQL-01 (on-prem)
FuncApp-01 (VNet-A) ❌ DOWN
VM-01 (VNet-B) ✅ UP
LogicApp-02 (VNet-A) ❌ DOWN
SQL-01 is fine. VNet-A's route/NSG is broken. A single-vantage monitor either sits in VNet-B (and reports all-green while half your workloads are cut off) or sits in VNet-A (and pages you for a "SQL outage" that isn't one).
The fix is to measure connectivity from the source that actually needs it.
Architecture: push-based agents, one row per (source × destination)
Instead of one server reaching out, run lightweight agents inside each network location — Azure Functions, AWS Lambda, GCP Cloud Functions, Docker, or on-prem VMs. Each agent runs the check from where it lives and pushes the result to the hub.
Agent (VNet-A) ──push──▶
Agent (VNet-B) ──push──▶ API hub ──▶ source × destination matrix
Agent (on-prem)──push──▶
Push-based matters: agents work behind firewalls, NATs and NSGs with no inbound ports opened. The agent only needs outbound HTTPS to the hub. That's what makes "monitor from literally anywhere" practical — the source measures its own real connectivity, and the matrix pinpoints whether a failure is the destination or a specific path.
Scaling the read side: heartbeat rollups
Distributed monitoring multiplies data: sources × destinations × checks. The dashboard's /summary endpoint refreshes every 30s and originally scanned every raw heartbeat in the last 24h:
- ~
monitors × (86400 / interval)rows - At 200 monitors @ 30s → ~576k rows per request, every 30s, per open dashboard.
That grows linearly with monitor count and check frequency. Solution: pre-aggregate each completed hour once into a heartbeat_rollup table.
24h window = [ completed hourly buckets (rollup) ] + [ current partial hour (raw) ]
Per-request rows drop from ~576k → ~28k and stay roughly flat as history grows. Each rollup row is one (monitor_id, bucket_start) with a time-weighted up_ms / down_ms / stale_ms breakdown, so:
uptime% = sum(up_ms) / (sum(up_ms) + sum(down_ms))
stale_ms (agent offline) is excluded from the denominator — a monitor being blind is not the same as the service being down.
The correctness trap: slice ownership
Uptime isn't count(up) / count(total). It's time-weighted. A "slice" spans two consecutive heartbeats and is owned by the later heartbeat's status; it's stale when its full length exceeds 2×interval + grace. When a slice crosses an hour boundary, its duration is split across hours, but its up/down/stale classification is decided once from the full slice — so summing hourly pieces exactly equals computing over the whole window. Carry-in/carry-out heartbeats keep each boundary self-consistent.
The rollup was verified identical to the full-window computation, staying within ~0.1pp of an exact sliding 24h window.
Real-time layer: Socket.IO
The rollups make history cheap; Socket.IO makes now live. Status transitions are pushed to open dashboards so the matrix flips red/green in real time instead of waiting for the next 30s poll.
Takeaway
If your infrastructure spans more than one network, "is it up?" is the wrong question. Ask "up from where?" — and measure it from the source.
UptimeGrid is a hosted platform built on this model (distributed source × destination monitoring, free tier, no card). Try it at uptimegrid.net. But even if you build your own, the lesson stands: monitor from the source, measure the path, not just the endpoint.
Top comments (1)
Thank you very much for taking the time to read this article.
If you are interested, I am always happy to discuss further. 😊