DEV Community

Vigilmon
Vigilmon

Posted on

Monitoring Your Phoenix LiveView App with Vigilmon

Monitoring Your Phoenix LiveView App with Vigilmon

Elixir is famous for "let it crash." BEAM processes restart themselves, supervisors keep trees healthy, and your app heals from most failures without human intervention.

But some failures don't self-heal: the node is unreachable from the internet, the Postgres connection pool is exhausted, a LiveView deploy left a port misbound. These require external eyes — a monitoring service that hits your app from outside and tells you when it can't get through.

This tutorial adds production observability to a Phoenix app:

  • A health check plug (zero-dependency, framework-idiomatic)
  • HTTP uptime monitoring with Vigilmon
  • Multi-region checks that benefit distributed Elixir deployments
  • Heartbeat monitoring for Oban jobs and GenServer-based workers
  • Slack alerts and a status page

Step 1: Add a health check plug

Phoenix doesn't need a library for a basic health check. A plug is idiomatic, fast, and has zero dependencies.

# lib/my_app_web/plugs/health_check.ex
defmodule MyAppWeb.Plugs.HealthCheck do
  import Plug.Conn

  def init(opts), do: opts

  def call(%Plug.Conn{request_path: "/health"} = conn, _opts) do
    checks = run_checks()
    status = if Enum.all?(checks, fn {_, v} -> v == :ok end), do: 200, else: 503

    conn
    |> put_resp_content_type("application/json")
    |> send_resp(status, Jason.encode!(%{status: status_label(status), checks: checks}))
    |> halt()
  end

  def call(conn, _opts), do: conn

  defp run_checks do
    %{
      database: check_database(),
      memory: check_memory()
    }
  end

  defp check_database do
    case Ecto.Adapters.SQL.query(MyApp.Repo, "SELECT 1", []) do
      {:ok, _} -> :ok
      {:error, _} -> :error
    end
  end

  defp check_memory do
    # Alert if memory usage exceeds 90%
    case :memsup.get_system_memory_data() do
      [] ->
        :ok
      data ->
        total = Keyword.get(data, :total_memory, 1)
        free = Keyword.get(data, :free_memory, total)
        used_pct = (total - free) / total * 100
        if used_pct < 90, do: :ok, else: :error
    end
  end

  defp status_label(200), do: "ok"
  defp status_label(_), do: "degraded"
end
Enter fullscreen mode Exit fullscreen mode

Plug it in before your router (so it bypasses authentication middleware):

# lib/my_app_web/endpoint.ex
defmodule MyAppWeb.Endpoint do
  use Phoenix.Endpoint, otp_app: :my_app

  plug MyAppWeb.Plugs.HealthCheck  # ← add before the router

  # ... rest of your plugs
  plug MyAppWeb.Router
end
Enter fullscreen mode Exit fullscreen mode

Test it:

mix phx.server
curl http://localhost:4000/health
# {"status":"ok","checks":{"database":"ok","memory":"ok"}}
Enter fullscreen mode Exit fullscreen mode

A non-200 response body tells you exactly which check failed. That precision matters when you're triaging at 2 AM.

Optional: use the plug_checkup library

If you'd rather use a library with built-in checks for Ecto, Redis, and HTTP dependencies:

# mix.exs
{:plug_checkup, "~> 0.6"}
Enter fullscreen mode Exit fullscreen mode
defmodule MyApp.Checks do
  use PlugCheckup, checks: [
    PlugCheckup.Check.new("db", MyApp.Checks.Database),
    PlugCheckup.Check.new("redis", MyApp.Checks.Redis),
  ]
end
Enter fullscreen mode Exit fullscreen mode

Either approach gives you a URL that returns 200 when healthy and 503 with details when not.


Step 2: Set up HTTP monitoring in Vigilmon

Point Vigilmon at your health endpoint:

  1. Sign up at vigilmon.online
  2. Click New Monitor → HTTP
  3. Enter https://yourdomain.com/health
  4. Set check interval: 1 minute (paid) or 5 minutes (free)
  5. Save

Vigilmon pings from multiple geographic regions. This is particularly valuable for Phoenix apps:

Why multi-region checks matter for Elixir:

Phoenix and LiveView apps often run as distributed clusters (libcluster, fly.io regions, Render multi-region). Multi-region monitoring catches split-brain scenarios where your app is reachable from one region but not another — which wouldn't show up in single-probe monitoring.

If you're deploying to Fly.io with multiple regions:

# Add monitors for each regional endpoint
https://app-name.fly.dev/health           # primary
https://lhr.app-name.fly.dev/health       # London
https://ord.app-name.fly.dev/health       # Chicago
Enter fullscreen mode Exit fullscreen mode

Each regional failure alerts independently, so you know whether a failure is local or global.


Step 3: Alerts via Slack

In Vigilmon, go to Notifications → New Channel → Slack and paste your Slack incoming webhook URL.

To create a webhook in Slack:

  1. api.slack.com/appsCreate New App → From scratch
  2. Enable Incoming WebhooksAdd New Webhook
  3. Pick your alerts channel and copy the URL

Enable the Slack channel on your monitor. When Phoenix is unreachable, Vigilmon sends:

🔴 DOWN: yourdomain.com/health
Status: 503 Service Unavailable
Detected from: EU-West, US-East
5 minutes ago
Enter fullscreen mode Exit fullscreen mode

And when it recovers:

✅ RECOVERED: yourdomain.com/health
Downtime: 12 minutes
Enter fullscreen mode Exit fullscreen mode

The recovery notification is often the most important one — it tells you when it's safe to stop firefighting.


Step 4: Heartbeat monitoring for Oban jobs and GenServers

LiveView handles its own process restarts. But scheduled Oban jobs and long-running GenServers can fail silently: the process stays up, the supervisor is happy, but work has stopped happening.

Heartbeat pattern: your job or GenServer pings a unique URL at the end of every successful execution cycle. If Vigilmon doesn't receive a ping within the expected window, it alerts you.

Oban job heartbeat

# lib/my_app/workers/daily_digest_worker.ex
defmodule MyApp.Workers.DailyDigestWorker do
  use Oban.Worker, queue: :default

  require Logger

  @impl Oban.Worker
  def perform(%Oban.Job{}) do
    with :ok <- generate_digest(),
         :ok <- send_digest() do
      ping_heartbeat()
      :ok
    else
      error ->
        Logger.error("DailyDigestWorker failed: #{inspect(error)}")
        {:error, error}
    end
  end

  defp ping_heartbeat do
    url = Application.get_env(:my_app, :vigilmon)[:digest_heartbeat_url]
    if url do
      case Req.get(url, receive_timeout: 5_000) do
        {:ok, _} -> :ok
        {:error, reason} -> Logger.warning("Heartbeat ping failed: #{inspect(reason)}")
      end
    end
  end

  defp generate_digest, do: :ok   # your logic
  defp send_digest, do: :ok        # your logic
end
Enter fullscreen mode Exit fullscreen mode

Add the config:

# config/runtime.exs
config :my_app, :vigilmon,
  digest_heartbeat_url: System.get_env("VIGILMON_DIGEST_HEARTBEAT_URL")
Enter fullscreen mode Exit fullscreen mode

GenServer heartbeat

For long-running GenServers (polling external APIs, syncing data), add a heartbeat on each successful tick:

# lib/my_app/sync_server.ex
defmodule MyApp.SyncServer do
  use GenServer

  require Logger

  @interval :timer.minutes(5)

  def start_link(_), do: GenServer.start_link(__MODULE__, %{}, name: __MODULE__)

  @impl true
  def init(state) do
    schedule_tick()
    {:ok, state}
  end

  @impl true
  def handle_info(:tick, state) do
    case sync_data() do
      :ok ->
        ping_heartbeat()
      {:error, reason} ->
        Logger.error("Sync failed: #{inspect(reason)}")
        # No ping → Vigilmon alerts after the window expires
    end

    schedule_tick()
    {:noreply, state}
  end

  defp schedule_tick, do: Process.send_after(self(), :tick, @interval)

  defp ping_heartbeat do
    url = Application.get_env(:my_app, :vigilmon)[:sync_heartbeat_url]
    if url, do: Req.get(url, receive_timeout: 5_000)
  end

  defp sync_data, do: :ok  # your logic
end
Enter fullscreen mode Exit fullscreen mode

In Vigilmon, create a Heartbeat Monitor for each critical worker:

  1. Click New Monitor → Heartbeat
  2. Set expected interval (e.g. 5 minutes for the sync worker, 24 hours for the digest)
  3. Copy the ping URL
  4. Set it as an env variable in your release config

Now if a worker crashes and its supervisor gives up retrying, you get an alert rather than a silent gap in your data.


Step 5: LiveView deployment health

LiveView uses long-poll WebSocket connections. After a deploy, existing clients reconnect to the new node — and if something goes wrong during that reconnect window, users see a broken interface.

Add a monitor specifically for your LiveView websocket endpoint:

https://yourdomain.com/live/websocket
Enter fullscreen mode Exit fullscreen mode

Vigilmon's HTTP monitor will verify the endpoint responds. This catches port binding failures, SSL termination issues, and misconfigured nginx/Caddy upstreams after a deploy.

You can also use a keyword check: in Vigilmon's HTTP monitor, add a body keyword match for content that should appear on your homepage (like your app name or a page title). If the response is a 200 but the wrong page, the keyword check fails.


Step 6: Status page and badge

Status page:

  1. Go to Status Pages → New Status Page in Vigilmon
  2. Add your monitors
  3. Copy the public URL

Share it in your README, error pages, or in your Slack channel topic so the team can check it first when users report issues.

README badge:

![Uptime](https://vigilmon.online/badge/your-monitor-slug)
Enter fullscreen mode Exit fullscreen mode

As an HTML embed:

<a href="https://status.yourdomain.com">
  <img src="https://vigilmon.online/badge/your-monitor-slug" alt="Uptime">
</a>
Enter fullscreen mode Exit fullscreen mode

The badge shows live status and response time.


What you've built

What How
Health check endpoint Custom HealthCheck plug, zero dependencies
DB + memory checks Ecto.Adapters.SQL.query/3, :memsup
HTTP uptime monitoring Vigilmon HTTP monitor → /health
Multi-region coverage Vigilmon multi-probe checks
Slack downtime alerts Vigilmon Slack notification channel
Oban job monitoring Heartbeat ping on perform/1 success
GenServer monitoring Heartbeat ping on each successful tick
Status page Vigilmon public status page
README badge /badge/{slug} SVG embed

BEAM keeps your processes alive. Vigilmon keeps external eyes on the result.


Next steps

  • Add :memsup and :cpu_sup data to your health response for richer monitoring context
  • Use Vigilmon's response time history to catch slow Ecto queries before they cause timeouts
  • Add separate heartbeat monitors for every Oban queue that processes business-critical jobs
  • If you run multiple Fly.io regions, add a monitor per region to catch split-brain failures

Get started free at vigilmon.online.

Top comments (0)