DEV Community

Vigilmon
Vigilmon

Posted on

Monitoring Your Django REST API with Vigilmon: Health Checks, Workers & Alerts

Your Django REST API is serving hundreds of requests a minute. But is it actually healthy? A 200 from curl on your laptop doesn't tell you whether the background workers are processing jobs, whether the database connection pool is exhausted, or whether a recent deploy quietly broke your serializers. Vigilmon closes that gap. This tutorial shows you how to instrument a DRF app end-to-end with uptime checks, Celery heartbeats, and alert routing.

What You'll Build

  • A /api/health/ endpoint that checks the database, cache, and queue
  • Vigilmon HTTP monitors watching your DRF routes
  • A Celery Beat heartbeat so you know your workers are running
  • A Django management command pattern for background-job heartbeats
  • Email and Slack alert channels

Prerequisites

  • Django 4.x+ with Django REST Framework installed
  • A free Vigilmon account
  • Optionally: Celery + Redis (or RabbitMQ) for async task monitoring

Step 1: Create the Health Check Endpoint

DRF gives you a clean way to expose a health view without touching your existing serializers.

Install django-health-check

pip install django-health-check
Enter fullscreen mode Exit fullscreen mode

Add to INSTALLED_APPS:

INSTALLED_APPS = [
    ...
    "health_check",
    "health_check.db",
    "health_check.cache",
    "health_check.storage",
    # optional: "health_check.contrib.celery",
]
Enter fullscreen mode Exit fullscreen mode

Wire up the URL in urls.py:

from django.urls import path, include

urlpatterns = [
    ...
    path("api/health/", include("health_check.urls")),
]
Enter fullscreen mode Exit fullscreen mode

GET /api/health/ now returns a 200 with a JSON body when all checks pass, or a 500 when any check fails.

Roll Your Own (Lightweight Alternative)

If you'd rather not add a dependency, a plain DRF view works just as well:

# myapp/views.py
from django.db import connections
from django.core.cache import cache
from rest_framework.decorators import api_view
from rest_framework.response import Response
import time

@api_view(["GET"])
def health_check(request):
    checks = {}
    status_code = 200

    # Database
    try:
        connections["default"].cursor().execute("SELECT 1")
        checks["database"] = "ok"
    except Exception as exc:
        checks["database"] = f"error: {exc}"
        status_code = 503

    # Cache
    try:
        cache.set("_health", "1", timeout=5)
        assert cache.get("_health") == "1"
        checks["cache"] = "ok"
    except Exception as exc:
        checks["cache"] = f"error: {exc}"
        status_code = 503

    return Response(
        {"status": "ok" if status_code == 200 else "degraded", "checks": checks},
        status=status_code,
    )
Enter fullscreen mode Exit fullscreen mode

Register it:

# urls.py
from myapp.views import health_check

urlpatterns = [
    ...
    path("api/health/", health_check, name="health-check"),
]
Enter fullscreen mode Exit fullscreen mode

Step 2: Add a Vigilmon HTTP Monitor

Log in to Vigilmon and create a new HTTP monitor:

Field Value
URL https://api.yourdomain.com/api/health/
Method GET
Expected status 200
Check interval 1 minute
Regions Select 2–3 regions

Vigilmon polls your endpoint from multiple regions. If it gets a non-200 or the request times out, you're alerted within seconds.

Tip: also add a second monitor for your main API base URL (/api/) with expected status = 200 OR 401. A 401 proves the server is running; a 404 or 500 means something broke.


Step 3: Monitor Celery Workers with a Heartbeat

This is the part most tutorials skip. Your HTTP endpoint can be healthy while Celery workers are silently dead — queued jobs pile up, emails stop sending, and you find out from a user complaint.

Create a Vigilmon Heartbeat Monitor

In Vigilmon, add a new Heartbeat monitor:

  1. Choose Heartbeat / Cron monitor type
  2. Set the expected ping interval to 10 minutes
  3. Copy the generated ping URL: https://vigilmon.online/api/heartbeat/xxxxxxxx

Ping the Heartbeat from Celery Beat

# myapp/tasks.py
import requests
from celery import shared_task
from django.conf import settings

@shared_task(name="myapp.ping_heartbeat")
def ping_heartbeat():
    url = settings.VIGILMON_HEARTBEAT_URL
    if url:
        try:
            requests.get(url, timeout=5)
        except requests.RequestException:
            pass  # Don't let a monitoring failure crash the task
Enter fullscreen mode Exit fullscreen mode

Schedule it in CELERY_BEAT_SCHEDULE:

# settings.py
from celery.schedules import crontab

CELERY_BEAT_SCHEDULE = {
    "ping-heartbeat": {
        "task": "myapp.ping_heartbeat",
        "schedule": crontab(minute="*/5"),  # every 5 minutes
    },
    # ... your existing tasks
}

VIGILMON_HEARTBEAT_URL = env("VIGILMON_HEARTBEAT_URL", default="")
Enter fullscreen mode Exit fullscreen mode

Set VIGILMON_HEARTBEAT_URL in your .env:

VIGILMON_HEARTBEAT_URL=https://vigilmon.online/api/heartbeat/xxxxxxxx
Enter fullscreen mode Exit fullscreen mode

Now, if your Celery workers die or beat stops scheduling, Vigilmon stops receiving pings and fires an alert after 10 minutes.


Step 4: Management Command Heartbeat Pattern

Many Django projects use manage.py commands for long-running jobs (data imports, report generation, nightly syncs). Here's the pattern for monitoring them:

# myapp/management/commands/import_data.py
import requests
from django.core.management.base import BaseCommand
from django.conf import settings

class Command(BaseCommand):
    help = "Import data from upstream source"

    def handle(self, *args, **options):
        self.stdout.write("Starting import...")
        try:
            self._run_import()
            self._ping_success()
            self.stdout.write(self.style.SUCCESS("Import complete."))
        except Exception as exc:
            self.stderr.write(f"Import failed: {exc}")
            raise  # let the process exit non-zero

    def _run_import(self):
        # your actual import logic here
        pass

    def _ping_success(self):
        url = getattr(settings, "VIGILMON_IMPORT_HEARTBEAT_URL", "")
        if url:
            try:
                requests.get(url, timeout=5)
            except requests.RequestException:
                pass
Enter fullscreen mode Exit fullscreen mode

Create a separate Vigilmon heartbeat monitor for each critical management command, then call its URL at the end of a successful run. If the command starts crashing or isn't being invoked by cron, Vigilmon catches the silence.


Step 5: Configure Alert Channels

In Vigilmon's Alerts settings:

Email Alerts

  1. Go to Alert Channels → Email
  2. Add ops@yourdomain.com (or your personal email)
  3. Attach the channel to both your HTTP monitor and heartbeat monitor

Slack Webhook

  1. In Slack, create an Incoming Webhook for your #ops or #alerts channel
  2. In Vigilmon, go to Alert Channels → Webhook
  3. Paste the Slack Webhook URL
  4. Set the payload template to:
{
  "text": "🚨 *{{ monitor.name }}* is DOWN\nStatus: {{ event.status }}\nURL: {{ monitor.url }}\n<https://vigilmon.online|View in Vigilmon>"
}
Enter fullscreen mode Exit fullscreen mode

Now you'll get an alert in Slack when the API goes down and a separate alert when the Celery heartbeat goes silent.


Step 6: Test It End-to-End

# 1. Verify your health endpoint locally
curl -s http://localhost:8000/api/health/ | python -m json.tool

# 2. Simulate a failure: stop your database and hit the endpoint
# Expected: 503 with {"status": "degraded", "checks": {"database": "error: ..."}}

# 3. Stop Celery beat and wait 10 minutes
# Expected: Vigilmon fires a heartbeat alert

# 4. Trigger a test alert from Vigilmon UI → confirm Slack/email delivery
Enter fullscreen mode Exit fullscreen mode

What You Now Have

Monitor Catches
HTTP /api/health/ Server crashes, bad deploys, DB/cache failures
HTTP /api/ Routing failures, auth middleware breakage
Celery heartbeat Dead workers, stalled queues, beat scheduler crashes
Management command heartbeat Silent cron failures, import crashes

Your team now gets alerted within minutes — not hours — when any of these fail.


Next Steps

  • Add a public status page on Vigilmon and link it from your API docs
  • Set up per-environment monitors (staging vs production) so you catch issues in staging before prod
  • Use DRF's throttle_classes to rate-limit the /api/health/ endpoint if it's publicly accessible

Found this useful? Vigilmon is free to start — sign up here and have your first monitor live in under 5 minutes.

Top comments (0)