Le Gia Hoang

Posted on Apr 10

How we built multi-region uptime consensus on the BEAM — zero external dependencies

#elixir #erlang #distributed #devops

Originally published at uptrack.app/blog/multi-region-consensus

The 3am problem

UptimeRobot checks your site from one location. A CDN edge goes down in Frankfurt. Your server in Virginia is fine. Your users in Tokyo see no issues. But the single check from Frankfurt fails, and your phone buzzes at 3am.

This is the false alert problem. Single-region monitoring can't distinguish between "the internet is broken between point A and point B" and "your server is actually down."

The fix sounds simple: check from multiple regions, only alert if most agree. But the implementation is surprisingly hard. You need to coordinate checks across continents, collect results in real-time, compute consensus, and avoid duplicate alerts — all without adding latency or single points of failure.

Here's how we did it with zero external dependencies using Erlang/OTP primitives.

The architecture: one process per monitor

Uptrack follows the Discord/WhatsApp pattern: one BEAM process per long-lived entity. Each monitor gets its own GenServer that self-schedules checks via Process.send_after.

For multi-region, the same monitor runs on every node. Three continents, three processes, one monitor:

┌─────────────────────────────────────────────────────────┐
│              Erlang Cluster (Tailscale mesh)             │
│                                                         │
│  EU (Germany)      Asia (India)      US (Virginia)      │
│  ┌────────────┐   ┌────────────┐   ┌────────────┐      │
│  │MonitorProc │   │MonitorProc │   │MonitorProc │       │
│  │ id: "abc"  │   │ id: "abc"  │   │ id: "abc"  │       │
│  │ Gun → target│  │ Gun → target│  │ Gun → target│      │
│  │ pg group   │◄─►│ pg group   │◄─►│ pg group   │       │
│  └────────────┘   └────────────┘   └────────────┘       │
└─────────────────────────────────────────────────────────┘

Each process holds a persistent Gun HTTP connection to the target. TLS handshake happens once at startup — not on every check. A 30-second check cycle takes ~50ms of actual HTTP time.

How pg groups connect the continents

pg is an OTP module that's been in Erlang since OTP 23. It manages process groups across distributed nodes — tested at 5,000 nodes and 150,000 processes by the OTP team.

When a MonitorProcess starts, it joins a pg group named after its monitor ID:

# On init, each MonitorProcess joins the group
:pg.join(:monitor_checks, monitor_id, self())

# After checking, broadcast result to all group members
:pg.get_members(:monitor_checks, monitor_id)
|> Enum.each(&send(&1, {:region_result, @region, result}))

That's it. No message broker, no Redis pub/sub, no database polling. When a node dies, pg automatically removes its processes from all groups. Add a 4th region — the new process joins the pg group, and everyone sees it.

The consensus timeline: sub-second across 3 continents

Here's what happens when a site is down in Asia but up everywhere else:

T=0.0s  EU checks target → "up" → pg broadcast to Asia, US
T=0.2s  Asia checks target → "down" → pg broadcast to EU, US
T=0.5s  US checks target → "up" → pg broadcast to EU, Asia

T=0.5s  EU has: EU=up, Asia=down, US=up → 2/3 up → No alert
                  CDN blip in Asia? Your phone stays silent.

Real outage:

T=0.0s  EU checks → "down" → broadcasts
T=0.3s  Asia checks → "down" → broadcasts
T=0.5s  US checks → "down" → broadcasts

T=0.5s  All 3 nodes: 3/3 down → consensus = DOWN
        Home node (EU) increments consecutive_failures
        After 3 consecutive → ALERT

    Total: 30s interval x 3 confirmations + 0.5s consensus
         = ~91 seconds from outage to confirmed alert

The home node: who gets to press the button?

All 3 nodes compute the same consensus — but only one should trigger the alert. We use deterministic hash-based assignment:

defp home_node?(monitor_id) do
  nodes = [node() | Node.list()] |> Enum.sort()
  hash = :erlang.phash2(monitor_id, length(nodes))
  Enum.at(nodes, hash) == node()
end

defp maybe_trigger_alert(state) do
  if home_node?(state.monitor_id) do
    do_alert(state)
  else
    state  # other nodes track state but stay silent
  end
end

Same monitor ID always maps to the same node. If a node dies, the hash redistributes to survivors automatically.

Edge cases

Slow region — Asia's check takes 8 seconds (congested submarine cable). Solution: 10-second timeout. If a region doesn't respond, it's excluded from consensus — not counted up OR down. 2/2 is still valid.

Node crash — pg automatically removes dead processes from groups. EU and US continue with 2-node consensus. When the node comes back, its processes rejoin on startup.

Network partition — EU can talk to US but Asia is isolated. Each side operates independently. No split-brain alerts because consensus requires majority of visible nodes.

Why not just use the database?

We tried it. Three problems:

1. Staleness. When EU reads Asia's result, it might get the previous check from 30 seconds ago. Average staleness: ~15 seconds. That defeats multi-region consensus.

2. Advisory lock issues. A singleton aggregator needs leader election. PostgreSQL advisory locks don't work reliably with PgBouncer in transaction pooling mode.

3. The irony. We moved checks OUT of the database to remove the Oban bottleneck and get 100K+ concurrent checks. Putting consensus BACK in the database reintroduces the exact bottleneck we eliminated.

pg messages solve all three: zero staleness, no leader election, no database in the hot path.

Benchmark: 100K checks on a $10/mo server

On a single Netcup RS 1000 (4 AMD EPYC cores, 8GB RAM):

Concurrent	Checks/sec	P50
10,000	2,602	1.6s
50,000	2,782	8.2s
100,000	2,602	16.9s
110,000	2,444	19.1s

110,000 concurrent HTTP checks, zero failures, on hardware that costs less than a Netflix subscription.

With 4 nodes across 3 continents: ~440K monitors before needing a 5th node.

The core consensus code: ~40 lines

defmodule Uptrack.Monitoring.MonitorProcess do
  use GenServer

  def handle_info({:check_result, result}, state) do
    :pg.get_members(:monitor_checks, state.monitor_id)
    |> Enum.each(&send(&1, {:region_result, @region, result}))

    state = %{state |
      region_results: Map.put(state.region_results, @region, result),
      checking: false
    }
    maybe_evaluate_consensus(state)
  end

  def handle_info({:region_result, region, result}, state) do
    state = %{state |
      region_results: Map.put(state.region_results, region, result)
    }
    maybe_evaluate_consensus(state)
  end

  defp maybe_evaluate_consensus(state) do
    results = state.region_results
    total = map_size(results)
    expected = length(:pg.get_members(:monitor_checks, state.monitor_id))

    if total >= min(expected, 2) do
      down_count = Enum.count(results, fn {_, r} -> r.status == "down" end)
      consensus = if down_count > total / 2, do: "down", else: "up"

      state = %{state | last_check: %{status: consensus}, region_results: %{}}
      |> evaluate_result()
      |> record_result()
      |> maybe_trigger_alert()

      {:noreply, state}
    else
      {:noreply, state}
    end
  end
end

No Kafka, no Redis, no external coordinator. Just processes sending messages to each other — the thing the BEAM was literally built to do.

What we learned

The BEAM's distribution primitives are underused. Most Elixir apps treat nodes as independent units behind a load balancer. But pg, :global, and Erlang distribution enable architectures that would require Kafka + Redis + ZooKeeper in other ecosystems.
Don't put coordination in the database. The staleness problem alone killed the DB approach. pg messages are real-time with zero staleness.
The Discord/WhatsApp pattern scales to monitoring. One process per entity with self-scheduling and in-memory state works for chat guilds, phone calls, and uptime monitors.
$23/mo can compete with $54/mo. UptimeRobot charges $54/mo for multi-region checks. We do it on three $8-10 VPS nodes with better consensus logic.

Uptrack offers 50 free monitors — 10 at 30-second checks, 40 at 1-minute — with multi-region consensus on every plan. uptrack.app

DEV Community