<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Le Gia Hoang</title>
    <description>The latest articles on DEV Community by Le Gia Hoang (@legiahoang).</description>
    <link>https://dev.to/legiahoang</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F118291%2F82acfc51-42c3-41f4-bd3e-08bc9bd06377.jpeg</url>
      <title>DEV Community: Le Gia Hoang</title>
      <link>https://dev.to/legiahoang</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/legiahoang"/>
    <language>en</language>
    <item>
      <title>How we built multi-region uptime consensus on the BEAM — zero external dependencies</title>
      <dc:creator>Le Gia Hoang</dc:creator>
      <pubDate>Fri, 10 Apr 2026 16:22:30 +0000</pubDate>
      <link>https://dev.to/legiahoang/how-we-built-multi-region-uptime-consensus-on-the-beam-zero-external-dependencies-opi</link>
      <guid>https://dev.to/legiahoang/how-we-built-multi-region-uptime-consensus-on-the-beam-zero-external-dependencies-opi</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://uptrack.app/blog/multi-region-consensus" rel="noopener noreferrer"&gt;uptrack.app/blog/multi-region-consensus&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3am problem
&lt;/h2&gt;

&lt;p&gt;UptimeRobot checks your site from one location. A CDN edge goes down in Frankfurt. Your server in Virginia is fine. Your users in Tokyo see no issues. But the single check from Frankfurt fails, and your phone buzzes at 3am.&lt;/p&gt;

&lt;p&gt;This is the false alert problem. Single-region monitoring can't distinguish between "the internet is broken between point A and point B" and "your server is actually down."&lt;/p&gt;

&lt;p&gt;The fix sounds simple: check from multiple regions, only alert if most agree. But the implementation is surprisingly hard. You need to coordinate checks across continents, collect results in real-time, compute consensus, and avoid duplicate alerts — all without adding latency or single points of failure.&lt;/p&gt;

&lt;p&gt;Here's how we did it with zero external dependencies using Erlang/OTP primitives.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture: one process per monitor
&lt;/h2&gt;

&lt;p&gt;Uptrack follows the Discord/WhatsApp pattern: one BEAM process per long-lived entity. Each monitor gets its own GenServer that self-schedules checks via &lt;code&gt;Process.send_after&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For multi-region, the same monitor runs on every node. Three continents, three processes, one monitor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────┐
│              Erlang Cluster (Tailscale mesh)             │
│                                                         │
│  EU (Germany)      Asia (India)      US (Virginia)      │
│  ┌────────────┐   ┌────────────┐   ┌────────────┐      │
│  │MonitorProc │   │MonitorProc │   │MonitorProc │       │
│  │ id: "abc"  │   │ id: "abc"  │   │ id: "abc"  │       │
│  │ Gun → target│  │ Gun → target│  │ Gun → target│      │
│  │ pg group   │◄─►│ pg group   │◄─►│ pg group   │       │
│  └────────────┘   └────────────┘   └────────────┘       │
└─────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each process holds a persistent Gun HTTP connection to the target. TLS handshake happens once at startup — not on every check. A 30-second check cycle takes ~50ms of actual HTTP time.&lt;/p&gt;

&lt;h2&gt;
  
  
  How pg groups connect the continents
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;pg&lt;/code&gt; is an OTP module that's been in Erlang since OTP 23. It manages process groups across distributed nodes — tested at 5,000 nodes and 150,000 processes by the OTP team.&lt;/p&gt;

&lt;p&gt;When a MonitorProcess starts, it joins a pg group named after its monitor ID:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight elixir"&gt;&lt;code&gt;&lt;span class="c1"&gt;# On init, each MonitorProcess joins the group&lt;/span&gt;
&lt;span class="ss"&gt;:pg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:monitor_checks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;monitor_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# After checking, broadcast result to all group members&lt;/span&gt;
&lt;span class="ss"&gt;:pg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_members&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:monitor_checks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;monitor_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;|&amp;gt;&lt;/span&gt; &lt;span class="no"&gt;Enum&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;each&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;&amp;amp;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="ss"&gt;:region_result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;@region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No message broker, no Redis pub/sub, no database polling. When a node dies, pg automatically removes its processes from all groups. Add a 4th region — the new process joins the pg group, and everyone sees it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The consensus timeline: sub-second across 3 continents
&lt;/h2&gt;

&lt;p&gt;Here's what happens when a site is down in Asia but up everywhere else:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;T=0.0s  EU checks target → "up" → pg broadcast to Asia, US
T=0.2s  Asia checks target → "down" → pg broadcast to EU, US
T=0.5s  US checks target → "up" → pg broadcast to EU, Asia

T=0.5s  EU has: EU=up, Asia=down, US=up → 2/3 up → No alert
                  CDN blip in Asia? Your phone stays silent.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Real outage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;T=0.0s  EU checks → "down" → broadcasts
T=0.3s  Asia checks → "down" → broadcasts
T=0.5s  US checks → "down" → broadcasts

T=0.5s  All 3 nodes: 3/3 down → consensus = DOWN
        Home node (EU) increments consecutive_failures
        After 3 consecutive → ALERT

    Total: 30s interval x 3 confirmations + 0.5s consensus
         = ~91 seconds from outage to confirmed alert
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The home node: who gets to press the button?
&lt;/h2&gt;

&lt;p&gt;All 3 nodes compute the same consensus — but only one should trigger the alert. We use deterministic hash-based assignment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight elixir"&gt;&lt;code&gt;&lt;span class="k"&gt;defp&lt;/span&gt; &lt;span class="n"&gt;home_node?&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monitor_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  &lt;span class="n"&gt;nodes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="no"&gt;Node&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;list&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt; &lt;span class="o"&gt;|&amp;gt;&lt;/span&gt; &lt;span class="no"&gt;Enum&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="n"&gt;hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="ss"&gt;:erlang&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;phash2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monitor_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="no"&gt;Enum&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;at&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="k"&gt;defp&lt;/span&gt; &lt;span class="n"&gt;maybe_trigger_alert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;home_node?&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;monitor_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
    &lt;span class="n"&gt;do_alert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;else&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt;  &lt;span class="c1"&gt;# other nodes track state but stay silent&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same monitor ID always maps to the same node. If a node dies, the hash redistributes to survivors automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Edge cases
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Slow region&lt;/strong&gt; — Asia's check takes 8 seconds (congested submarine cable). Solution: 10-second timeout. If a region doesn't respond, it's excluded from consensus — not counted up OR down. 2/2 is still valid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Node crash&lt;/strong&gt; — pg automatically removes dead processes from groups. EU and US continue with 2-node consensus. When the node comes back, its processes rejoin on startup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Network partition&lt;/strong&gt; — EU can talk to US but Asia is isolated. Each side operates independently. No split-brain alerts because consensus requires majority of &lt;em&gt;visible&lt;/em&gt; nodes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not just use the database?
&lt;/h2&gt;

&lt;p&gt;We tried it. Three problems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Staleness.&lt;/strong&gt; When EU reads Asia's result, it might get the &lt;em&gt;previous&lt;/em&gt; check from 30 seconds ago. Average staleness: ~15 seconds. That defeats multi-region consensus.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Advisory lock issues.&lt;/strong&gt; A singleton aggregator needs leader election. PostgreSQL advisory locks don't work reliably with PgBouncer in transaction pooling mode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The irony.&lt;/strong&gt; We moved checks OUT of the database to remove the Oban bottleneck and get 100K+ concurrent checks. Putting consensus BACK in the database reintroduces the exact bottleneck we eliminated.&lt;/p&gt;

&lt;p&gt;pg messages solve all three: zero staleness, no leader election, no database in the hot path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark: 100K checks on a $10/mo server
&lt;/h2&gt;

&lt;p&gt;On a single Netcup RS 1000 (4 AMD EPYC cores, 8GB RAM):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concurrent&lt;/th&gt;
&lt;th&gt;Checks/sec&lt;/th&gt;
&lt;th&gt;P50&lt;/th&gt;
&lt;th&gt;Failures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10,000&lt;/td&gt;
&lt;td&gt;2,602&lt;/td&gt;
&lt;td&gt;1.6s&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50,000&lt;/td&gt;
&lt;td&gt;2,782&lt;/td&gt;
&lt;td&gt;8.2s&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100,000&lt;/td&gt;
&lt;td&gt;2,602&lt;/td&gt;
&lt;td&gt;16.9s&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;110,000&lt;/td&gt;
&lt;td&gt;2,444&lt;/td&gt;
&lt;td&gt;19.1s&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;110,000 concurrent HTTP checks, zero failures, on hardware that costs less than a Netflix subscription.&lt;/p&gt;

&lt;p&gt;With 4 nodes across 3 continents: ~440K monitors before needing a 5th node.&lt;/p&gt;

&lt;h2&gt;
  
  
  The core consensus code: ~40 lines
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight elixir"&gt;&lt;code&gt;&lt;span class="k"&gt;defmodule&lt;/span&gt; &lt;span class="no"&gt;Uptrack&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="no"&gt;Monitoring&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="no"&gt;MonitorProcess&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  &lt;span class="kn"&gt;use&lt;/span&gt; &lt;span class="no"&gt;GenServer&lt;/span&gt;

  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="n"&gt;handle_info&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="ss"&gt;:check_result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
    &lt;span class="ss"&gt;:pg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_members&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:monitor_checks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;monitor_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;|&amp;gt;&lt;/span&gt; &lt;span class="no"&gt;Enum&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;each&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;&amp;amp;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="ss"&gt;:region_result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;@region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;

    &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;%{&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
      &lt;span class="ss"&gt;region_results:&lt;/span&gt; &lt;span class="no"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;@region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="ss"&gt;checking:&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;maybe_evaluate_consensus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="n"&gt;handle_info&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="ss"&gt;:region_result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;%{&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
      &lt;span class="ss"&gt;region_results:&lt;/span&gt; &lt;span class="no"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;maybe_evaluate_consensus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="k"&gt;defp&lt;/span&gt; &lt;span class="n"&gt;maybe_evaluate_consensus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_results&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;map_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:pg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_members&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:monitor_checks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;monitor_id&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
      &lt;span class="n"&gt;down_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Enum&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"down"&lt;/span&gt; &lt;span class="k"&gt;end&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="n"&gt;consensus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;down_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"down"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"up"&lt;/span&gt;

      &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;%{&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="ss"&gt;last_check:&lt;/span&gt; &lt;span class="p"&gt;%{&lt;/span&gt;&lt;span class="ss"&gt;status:&lt;/span&gt; &lt;span class="n"&gt;consensus&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="ss"&gt;region_results:&lt;/span&gt; &lt;span class="p"&gt;%{}}&lt;/span&gt;
      &lt;span class="o"&gt;|&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;evaluate_result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
      &lt;span class="o"&gt;|&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;record_result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
      &lt;span class="o"&gt;|&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;maybe_trigger_alert&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

      &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="ss"&gt;:noreply&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="ss"&gt;:noreply&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;end&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No Kafka, no Redis, no external coordinator. Just processes sending messages to each other — the thing the BEAM was literally built to do.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The BEAM's distribution primitives are underused.&lt;/strong&gt; Most Elixir apps treat nodes as independent units behind a load balancer. But &lt;code&gt;pg&lt;/code&gt;, &lt;code&gt;:global&lt;/code&gt;, and Erlang distribution enable architectures that would require Kafka + Redis + ZooKeeper in other ecosystems.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Don't put coordination in the database.&lt;/strong&gt; The staleness problem alone killed the DB approach. pg messages are real-time with zero staleness.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Discord/WhatsApp pattern scales to monitoring.&lt;/strong&gt; One process per entity with self-scheduling and in-memory state works for chat guilds, phone calls, and uptime monitors.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;$23/mo can compete with $54/mo.&lt;/strong&gt; UptimeRobot charges $54/mo for multi-region checks. We do it on three $8-10 VPS nodes with better consensus logic.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Uptrack offers 50 free monitors — 10 at 30-second checks, 40 at 1-minute — with multi-region consensus on every plan. &lt;a href="https://uptrack.app" rel="noopener noreferrer"&gt;uptrack.app&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>elixir</category>
      <category>erlang</category>
      <category>distributed</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
