DEV Community

Cover image for How to Build a Monitoring Application Using Golang
wole
wole

Posted on

How to Build a Monitoring Application Using Golang

Monitoring is a critical part of running reliable software, yet many teams only discover outages after users complaints starts rolling in. Imagine you get a Slack message at 2 AM, telling you that your APIs are down for over an hour and nobody noticed until customers started complaining. A monitoring service solves this problem by letting you and your team proactively respond to incidents, before problems escalate.

In this tutorial, I will be taking you through the steps on how to build a status monitoring application from scratch. By the end of this article, you will have a system that:

  1. Probes your services on a schedule (HTTP, TCP, DNS, and more)
  2. Detects outages and sends alerts to various communication channels (Teams, Slack, etc)
  3. Tracks incidents with automatic open/close
  4. Exposes metrics for Prometheus and Grafana dashboards
  5. Runs in Docker

For this application, I will be using Go because it is fast, compiles to a single binary for cross platform support, and handles concurrency, which is important for an application that needs to monitor multiple endpoints simultaneously.

What We're Building

We will be building a Go application "StatusD". It reads a config file that has a list of services to monitor, probes them, and creates incidents, fire notifications when something goes wrong.

Tech Stack Used:

  • Golang
  • PostgreSQL
  • Grafana (Prometheus for metric)
  • Docker
  • Nginx

Here's the high-level architecture:

┌─────────────────────────────────────────────────────────────────┐
│                        Docker Compose                           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────────┐ │
│  │ Postgres │  │Prometheus│  │  Grafana │  │      Nginx       │ │
│  │    DB    │  │ (metrics)│  │(dashboard)│  │ (reverse proxy) │ │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────────┬─────────┘ │
│       │             │             │                  │          │
│       └─────────────┴─────────────┴──────────────────┘          │
│                              │                                  │
│                    ┌─────────┴─────────┐                        │
│                    │      StatusD      │                        │
│                    │   (our Go app)    │                        │
│                    └─────────┬─────────┘                        │
│                              │                                  │
└──────────────────────────────┼──────────────────────────────────┘
                               │
              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
         ┌────────┐       ┌────────┐       ┌────────┐
         │Service │       │Service │       │Service │
         │   A    │       │   B    │       │   C    │
         └────────┘       └────────┘       └────────┘
Enter fullscreen mode Exit fullscreen mode

Project Structure

Before we write code, let's understand how the pieces fit together. Below is our project structure:

status-monitor/
├── cmd/statusd/
│   └── main.go              # Application entry point
├── internal/
│   ├── models/
│   │   └── models.go        # Data structures (Asset, Incident, etc.)
│   ├── probe/
│   │   ├── probe.go         # Probe registry
│   │   └── http.go          # HTTP probe implementation
│   ├── scheduler/
│   │   └── scheduler.go     # Worker pool and scheduling
│   ├── alert/
│   │   └── engine.go        # State machine and notifications
│   ├── notifier/
│   │   └── teams.go         # Teams/Slack integration
│   ├── store/
│   │   └── postgres.go      # Database layer
│   ├── api/
│   │   └── handlers.go      # REST API
│   └── config/
│       └── manifest.go      # Config loading
├── config/
│   ├── manifest.json        # Services to monitor
│   └── notifiers.json       # Notification channels
├── migrations/
│   └── 001_init_schema.up.sql
├── docker-compose.yml
├── Dockerfile
└── entrypoint.sh
Enter fullscreen mode Exit fullscreen mode

The Core Data Models

Here we will be defining our 'types', which essentially means we will be defining what a "monitored service" looks like.

We will be defining four 'types':

  1. Asset: This is a service we want to monitor.

  2. ProbeResult: What happens when we check an Asset; the response, latency, etc.

  3. Incident: This tracks when something goes wrong, i.e., when ProbeResult returns an unexpected response (and when the service recovers).

  4. Notification: This is an alert or message sent to the defined communications channel, e.g. Teams, Slack, email, etc.

Lets define the types in code:

// internal/models/models.go
package models

import "time"

// Asset represents a monitored service
type Asset struct {
    ID                  string            `json:"id"`
    AssetType           string            `json:"assetType"` // http, tcp, dns, etc.
    Name                string            `json:"name"`
    Address             string            `json:"address"`
    IntervalSeconds     int               `json:"intervalSeconds"`
    TimeoutSeconds      int               `json:"timeoutSeconds"`
    ExpectedStatusCodes []int             `json:"expectedStatusCodes,omitempty"`
    Metadata            map[string]string `json:"metadata,omitempty"`
}

// ProbeResult contains the outcome of a single health check
type ProbeResult struct {
    AssetID   string
    Timestamp time.Time
    Success   bool
    LatencyMs int64
    Code      int    // HTTP status code
    Message   string // Error message if failed
}

// Incident tracks a service outage
type Incident struct {
    ID        string
    AssetID   string
    StartedAt time.Time
    EndedAt   *time.Time // nil if still open
    Severity  string
    Summary   string
}

// Notification is what we send to Slack/Teams
type Notification struct {
    AssetID   string
    AssetName string
    Event     string    // "DOWN", "RECOVERY", "UP"
    Timestamp time.Time
    Details   string
}
Enter fullscreen mode Exit fullscreen mode

Notice the ExpectedStatusCodes field in the Asset type. Not all endpoints return 200, some may return 204 or a redirect. This lets you define what "healthy" means for each service.

Database Schema

We need a place to store the probe results and incidents. We will be using PostgreSQL for this and here's our schema:

-- migrations/001_init_schema.up.sql

CREATE TABLE IF NOT EXISTS assets (
    id TEXT PRIMARY KEY,
    name TEXT NOT NULL,
    address TEXT NOT NULL,
    asset_type TEXT NOT NULL DEFAULT 'http',
    interval_seconds INTEGER DEFAULT 300,
    timeout_seconds INTEGER DEFAULT 5,
    expected_status_codes TEXT,
    metadata JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE IF NOT EXISTS probe_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    asset_id TEXT NOT NULL REFERENCES assets(id),
    timestamp TIMESTAMP WITH TIME ZONE NOT NULL,
    success BOOLEAN NOT NULL,
    latency_ms BIGINT NOT NULL,
    code INTEGER,
    message TEXT
);

CREATE TABLE IF NOT EXISTS incidents (
    id SERIAL PRIMARY KEY,
    asset_id TEXT NOT NULL REFERENCES assets(id),
    severity TEXT DEFAULT 'INITIAL',
    summary TEXT,
    started_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    ended_at TIMESTAMP
);

-- Indexes for common queries
CREATE INDEX IF NOT EXISTS idx_probe_events_asset_id_timestamp
    ON probe_events(asset_id, timestamp DESC);
CREATE INDEX IF NOT EXISTS idx_incidents_asset_id
    ON incidents(asset_id);
CREATE INDEX IF NOT EXISTS idx_incidents_ended_at
    ON incidents(ended_at);
Enter fullscreen mode Exit fullscreen mode

The key insight is on probe_events(asset_id, timestamp DESC). Here, we are indexing by asset and timestamp (in a descending order), which allows us to quickly query for the probe results of a service.

Building the Probe System

Things begin to get interesting here. We want to support probing over multiple protocol types: HTTPS, TCP, DNS, etc. without having to write a complex switch statement. To solve this, we are using a registry pattern.

First we'll define what a probe looks like:

// internal/probe/probe.go
package probe

import (
    "context"
    "fmt"
    "github.com/yourname/status/internal/models"
)

// Probe defines the interface for checking service health
type Probe interface {
    Probe(ctx context.Context, asset models.Asset) (models.ProbeResult, error)
}

// registry holds all probe types
var registry = make(map[string]func() Probe)

// Register adds a probe type to the registry
func Register(assetType string, factory func() Probe) {
    registry[assetType] = factory
}

// GetProbe returns a probe for the given asset type
func GetProbe(assetType string) (Probe, error) {
    factory, ok := registry[assetType]
    if !ok {
        return nil, fmt.Errorf("unknown asset type: %s", assetType)
    }
    return factory(), nil
}
Enter fullscreen mode Exit fullscreen mode

Now implement the HTTP probe:

// internal/probe/http.go
package probe

import (
    "context"
    "io"
    "net/http"
    "time"
    "github.com/yourname/status/internal/models"
)

func init() {
    Register("http", func() Probe { return &httpProbe{} })
}

type httpProbe struct{}

func (p *httpProbe) Probe(ctx context.Context, asset models.Asset) (models.ProbeResult, error) {
    result := models.ProbeResult{
        AssetID:   asset.ID,
        Timestamp: time.Now(),
    }

    client := &http.Client{
        Timeout: time.Duration(asset.TimeoutSeconds) * time.Second,
    }

    req, err := http.NewRequestWithContext(ctx, http.MethodGet, asset.Address, nil)
    if err != nil {
        result.Success = false
        result.Message = err.Error()
        return result, err
    }

    start := time.Now()
    resp, err := client.Do(req)
    result.LatencyMs = time.Since(start).Milliseconds()

    if err != nil {
        result.Success = false
        result.Message = err.Error()
        return result, err
    }
    defer resp.Body.Close()

    // Read body (limit to 1MB)
    io.ReadAll(io.LimitReader(resp.Body, 1024*1024))

    result.Code = resp.StatusCode

    // Check if status code is expected
    if len(asset.ExpectedStatusCodes) > 0 {
        for _, code := range asset.ExpectedStatusCodes {
            if code == resp.StatusCode {
                result.Success = true
                return result, nil
            }
        }
        result.Success = false
        result.Message = "unexpected status code"
    } else {
        result.Success = resp.StatusCode < 400
    }

    return result, nil
}
Enter fullscreen mode Exit fullscreen mode

The init() function runs automatically when your Go application starts. This adds the HTTP probe to the registry without any code change.

Want to add TCP probes? Create tcp.go, implement the interface, and register it in init().

Scheduling and Concurrency

We need to probe all our Assets on a schedule and for this we will be using a worker pool. A worker pool lets us run multiple probes concurrently without spawning a goroutine for each service.

// internal/scheduler/scheduler.go
package scheduler

import (
    "context"
    "sync"
    "time"
    "github.com/yourname/status/internal/models"
    "github.com/yourname/status/internal/probe"
)

type JobHandler func(result models.ProbeResult)

type Scheduler struct {
    workers int
    jobs    chan models.Asset
    tickers map[string]*time.Ticker
    handler JobHandler
    mu      sync.Mutex
    done    chan struct{}
    wg      sync.WaitGroup
}

func NewScheduler(workerCount int, handler JobHandler) *Scheduler {
    return &Scheduler{
        workers: workerCount,
        jobs:    make(chan models.Asset, 100),
        tickers: make(map[string]*time.Ticker),
        handler: handler,
        done:    make(chan struct{}),
    }
}

func (s *Scheduler) Start(ctx context.Context) {
    for i := 0; i < s.workers; i++ {
        s.wg.Add(1)
        go s.worker(ctx)
    }
}

func (s *Scheduler) ScheduleAssets(assets []models.Asset) error {
    s.mu.Lock()
    defer s.mu.Unlock()

    for _, asset := range assets {
        interval := time.Duration(asset.IntervalSeconds) * time.Second
        ticker := time.NewTicker(interval)
        s.tickers[asset.ID] = ticker

        s.wg.Add(1)
        go s.scheduleAsset(asset, ticker)
    }
    return nil
}

func (s *Scheduler) scheduleAsset(asset models.Asset, ticker *time.Ticker) {
    defer s.wg.Done()
    for {
        select {
        case <-s.done:
            ticker.Stop()
            return
        case <-ticker.C:
            s.jobs <- asset
        }
    }
}

func (s *Scheduler) worker(ctx context.Context) {
    defer s.wg.Done()
    for {
        select {
        case <-s.done:
            return
        case asset := <-s.jobs:
            p, err := probe.GetProbe(asset.AssetType)
            if err != nil {
                continue
            }
            result, _ := p.Probe(ctx, asset)
            s.handler(result)
        }
    }
}

func (s *Scheduler) Stop() {
    close(s.done)
    close(s.jobs)
    s.wg.Wait()
}
Enter fullscreen mode Exit fullscreen mode

Each asset gets its own ticker goroutine that only schedules work. When its time to check an asset, the ticker sends a probe job into a channel. There are a fixed number of worker goroutines that listen on the channel and do the actual probing.

We don't run probes directly in the ticker goroutines because probes can block while waiting for network responses or timeouts. By using workers, we can control concurrency.

For example, with 4 workers and 100 assets, only 4 probes will run at any moment even if tickers fire simultaneously. The channel acts as a buffer for pending jobs, and a sync.WaitGroup ensures all workers shut down cleanly.

Incident Detection: The State Machine

When a probe fails, we don't automatically assume a failure. It could be network glitch. However, if it fails again, we create an incident. When it recovers, we close the incident and notify.

This is a state machine: UP → DOWN → UP.

Lets build the engine:

// internal/alert/engine.go
package alert

import (
    "context"
    "fmt"
    "sync"
    "time"
    "github.com/yourname/status/internal/models"
    "github.com/yourname/status/internal/store"
)

type NotifierFunc func(ctx context.Context, notification models.Notification) error

type AssetState struct {
    IsUp           bool
    LastProbeTime  time.Time
    OpenIncidentID string
}

type Engine struct {
    store      store.Store
    notifiers  map[string]NotifierFunc
    mu         sync.RWMutex
    assetState map[string]AssetState
}

func NewEngine(store store.Store) *Engine {
    return &Engine{
        store:      store,
        notifiers:  make(map[string]NotifierFunc),
        assetState: make(map[string]AssetState),
    }
}

func (e *Engine) RegisterNotifier(name string, fn NotifierFunc) {
    e.mu.Lock()
    defer e.mu.Unlock()
    e.notifiers[name] = fn
}

func (e *Engine) Process(ctx context.Context, result models.ProbeResult, asset models.Asset) error {
    e.mu.Lock()
    defer e.mu.Unlock()

    state := e.assetState[result.AssetID]
    state.LastProbeTime = result.Timestamp

    // State hasn't changed? Nothing to do.
    if state.IsUp == result.Success {
        e.assetState[result.AssetID] = state
        return nil
    }

    // Save probe event
    if err := e.store.SaveProbeEvent(ctx, result); err != nil {
        return err
    }

    if result.Success && !state.IsUp {
        // Recovery!
        return e.handleRecovery(ctx, asset, state)
    } else if !result.Success && state.IsUp {
        // Outage!
        return e.handleOutage(ctx, asset, state, result)
    }

    return nil
}

func (e *Engine) handleOutage(ctx context.Context, asset models.Asset, state AssetState, result models.ProbeResult) error {
    incidentID, err := e.store.CreateIncident(ctx, asset.ID, fmt.Sprintf("Service %s is down", asset.Name))
    if err != nil {
        return err
    }

    state.IsUp = false
    state.OpenIncidentID = incidentID
    e.assetState[asset.ID] = state

    notification := models.Notification{
        AssetID:   asset.ID,
        AssetName: asset.Name,
        Event:     "DOWN",
        Timestamp: result.Timestamp,
        Details:   result.Message,
    }

    return e.sendNotifications(ctx, notification)
}

func (e *Engine) handleRecovery(ctx context.Context, asset models.Asset, state AssetState) error {
    if state.OpenIncidentID != "" {
        e.store.CloseIncident(ctx, state.OpenIncidentID)
    }

    state.IsUp = true
    state.OpenIncidentID = ""
    e.assetState[asset.ID] = state

    notification := models.Notification{
        AssetID:   asset.ID,
        AssetName: asset.Name,
        Event:     "RECOVERY",
        Timestamp: time.Now(),
        Details:   "Service has recovered",
    }

    return e.sendNotifications(ctx, notification)
}

func (e *Engine) sendNotifications(ctx context.Context, notification models.Notification) error {
    for name, notifier := range e.notifiers {
        if err := notifier(ctx, notification); err != nil {
            fmt.Printf("notifier %s failed: %v\n", name, err)
        }
    }
    return nil
}
Enter fullscreen mode Exit fullscreen mode

Key insight: We track the state in memory assetState for fast lookups, but persists incidents to the database for durability. If the process restarts, we can rebuild state from open incidents.

Sending Notifications

In the event that something breaks, people need to know. We need to send the notification to various communication channels.

Let's define our Teams notifier:

// internal/notifier/teams.go
package notifier

import (
    "bytes"
    "context"
    "encoding/json"
    "fmt"
    "net/http"
    "time"
    "github.com/yourname/status/internal/models"
)

type TeamsNotifier struct {
    webhookURL string
    client     *http.Client
}

func NewTeamsNotifier(webhookURL string) *TeamsNotifier {
    return &TeamsNotifier{
        webhookURL: webhookURL,
        client:     &http.Client{Timeout: 10 * time.Second},
    }
}

func (t *TeamsNotifier) Notify(ctx context.Context, n models.Notification) error {
    emoji := "🟢"
    if n.Event == "DOWN" {
        emoji = "🔴"
    }

    card := map[string]interface{}{
        "type": "message",
        "attachments": []map[string]interface{}{
            {
                "contentType": "application/vnd.microsoft.card.adaptive",
                "content": map[string]interface{}{
                    "$schema": "http://adaptivecards.io/schemas/adaptive-card.json",
                    "type":    "AdaptiveCard",
                    "version": "1.4",
                    "body": []map[string]interface{}{
                        {
                            "type":   "TextBlock",
                            "text":   fmt.Sprintf("%s %s - %s", emoji, n.AssetName, n.Event),
                            "weight": "Bolder",
                            "size":   "Large",
                        },
                        {
                            "type": "FactSet",
                            "facts": []map[string]interface{}{
                                {"title": "Service", "value": n.AssetName},
                                {"title": "Status", "value": n.Event},
                                {"title": "Time", "value": n.Timestamp.Format(time.RFC1123)},
                                {"title": "Details", "value": n.Details},
                            },
                        },
                    },
                },
            },
        },
    }

    body, _ := json.Marshal(card)
    req, _ := http.NewRequestWithContext(ctx, "POST", t.webhookURL, bytes.NewReader(body))
    req.Header.Set("Content-Type", "application/json")

    resp, err := t.client.Do(req)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    if resp.StatusCode >= 300 {
        return fmt.Errorf("Teams webhook returned %d", resp.StatusCode)
    }
    return nil
}
Enter fullscreen mode Exit fullscreen mode

Teams uses Adaptive Cards for rich formatting.
You can define various notifiers for other communications channel, e.g. Slack, Discord, etc.

The REST API

We need endpoints to query the status of the services we are monitoring. For this, we will be using Chi, which is a lightweight router that supports route parameters like /assets/{id}.

Lets define the apis:

// internal/api/handlers.go
package api

import (
    "encoding/json"
    "net/http"
    "github.com/go-chi/chi/v5"
    "github.com/go-chi/chi/v5/middleware"
    "github.com/yourname/status/internal/store"
)

type Server struct {
    store store.Store
    mux   *chi.Mux
}

func NewServer(s store.Store) *Server {
    srv := &Server{store: s, mux: chi.NewRouter()}

    srv.mux.Use(middleware.Logger)
    srv.mux.Use(middleware.Recoverer)

    srv.mux.Route("/api", func(r chi.Router) {
        r.Get("/health", srv.health)
        r.Get("/assets", srv.listAssets)
        r.Get("/assets/{id}/events", srv.getAssetEvents)
        r.Get("/incidents", srv.listIncidents)
    })

    return srv
}

func (s *Server) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    s.mux.ServeHTTP(w, r)
}

func (s *Server) health(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(map[string]string{"status": "healthy"})
}

func (s *Server) listAssets(w http.ResponseWriter, r *http.Request) {
    assets, err := s.store.GetAssets(r.Context())
    if err != nil {
        http.Error(w, err.Error(), 500)
        return
    }
    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(assets)
}

func (s *Server) getAssetEvents(w http.ResponseWriter, r *http.Request) {
    id := chi.URLParam(r, "id")
    events, _ := s.store.GetProbeEvents(r.Context(), id, 100)
    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(events)
}

func (s *Server) listIncidents(w http.ResponseWriter, r *http.Request) {
    incidents, _ := s.store.GetOpenIncidents(r.Context())
    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(incidents)
}
Enter fullscreen mode Exit fullscreen mode

The code above define a small HTTP API server, which exposes 4 read-only endpoint:

GET /api/health - Health check (is the service running?)
GET /api/assets - List all monitored services
GET /api/assets/{id}/events - Get probe history for a specific service
GET /api/incidents - List open incidents

Dockerizing the Application

Dockerizing the application is pretty straighforward since Go compiles to a single binary. We are going to be using a multi-stage build to keep the final image small:


# Dockerfile
FROM golang:1.24-alpine AS builder
WORKDIR /app

RUN apk add --no-cache git
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o statusd ./cmd/statusd/

FROM alpine:latest
WORKDIR /app
RUN apk --no-cache add ca-certificates
COPY --from=builder /app/statusd .
COPY entrypoint.sh .
RUN chmod +x /app/entrypoint.sh

EXPOSE 8080
ENTRYPOINT ["/app/entrypoint.sh"]
Enter fullscreen mode Exit fullscreen mode

The builder stage compiles the code. The final stage is just Alpine plus our binary—typically under 20MB.

The entrypoint script builds the database connection string from environment variables:

#!/bin/sh
# entrypoint.sh

DB_HOST=${DB_HOST:-localhost}
DB_PORT=${DB_PORT:-5432}
DB_USER=${DB_USER:-status}
DB_PASSWORD=${DB_PASSWORD:-status}
DB_NAME=${DB_NAME:-status_db}

DB_CONN_STRING="postgres://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:${DB_PORT}/${DB_NAME}"

exec ./statusd \
  -manifest /app/config/manifest.json \
  -notifiers /app/config/notifiers.json \
  -db "$DB_CONN_STRING" \
  -workers 4 \
  -api-port 8080
Enter fullscreen mode Exit fullscreen mode

Docker Compose: Putting It All Together

One file to rule them all:


# docker-compose.yml
version: "3.8"

services:
  postgres:
    image: postgres:15-alpine
    container_name: status_postgres
    environment:
      POSTGRES_USER: status
      POSTGRES_PASSWORD: changeme
      POSTGRES_DB: status_db
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./migrations:/docker-entrypoint-initdb.d
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U status"]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - status_network

  statusd:
    build: .
    container_name: status_app
    environment:
      - DB_HOST=postgres
      - DB_PORT=5432
      - DB_USER=status
      - DB_PASSWORD=changeme
      - DB_NAME=status_db
    volumes:
      - ./config:/app/config:ro
    depends_on:
      postgres:
        condition: service_healthy
    networks:
      - status_network

  prometheus:
    image: prom/prometheus:latest
    container_name: status_prometheus
    volumes:
      - ./docker/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    networks:
      - status_network
    depends_on:
      - statusd

  grafana:
    image: grafana/grafana:latest
    container_name: status_grafana
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: admin
    volumes:
      - grafana_data:/var/lib/grafana
    networks:
      - status_network
    depends_on:
      - prometheus

  nginx:
    image: nginx:alpine
    container_name: status_nginx
    volumes:
      - ./docker/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - ./docker/nginx/conf.d:/etc/nginx/conf.d:ro
    ports:
      - "80:80"
    depends_on:
      - statusd
      - grafana
      - prometheus
    networks:
      - status_network

networks:
  status_network:
    driver: bridge

volumes:
  postgres_data:
  prometheus_data:
  grafana_data:
Enter fullscreen mode Exit fullscreen mode

A few things to note:

  • PostgreSQL healthcheck: The statusd service waits until Postgres is actually ready, not just started. This prevents "connection refused" errors on first boot.
  • Config mount: We mount ./config as read-only. Edit your manifest locally, and the running container sees the changes.
  • Nginx: Routes external traffic to Grafana and Prometheus dashboards.

Configuration Files

The application reads two files: manifest.json and notifiers.json

  1. The manifest.json file lists the assets we want to monitor. Each asset needs an ID, a probe type, and an address. The intervalSeconds controls how often we check (60 = once per minute). expectedStatusCodes lets you define what "healthy" means. Some endpoints return 301 redirects or 204 No Content, and that's fine.
// config/manifest.json
{
  "assets": [
    {
      "id": "api-prod",
      "assetType": "http",
      "name": "Production API",
      "address": "https://api.example.com/health",
      "intervalSeconds": 60,
      "timeoutSeconds": 5,
      "expectedStatusCodes": [200],
      "metadata": {
        "env": "prod",
        "owner": "platform-team"
      }
    },
    {
      "id": "web-prod",
      "assetType": "http",
      "name": "Production Website",
      "address": "https://www.example.com",
      "intervalSeconds": 120,
      "timeoutSeconds": 10,
      "expectedStatusCodes": [200, 301]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode
  1. The notifiers.json controls where to send alerts. You define notification channels (Teams, Slack), then set policies for which channels fire on which events. throttleSeconds: 300 means you won't get spammed more than once every 5 minutes for the same issue.
// config/notifiers.json
{
  "notifiers": {
    "teams": {
      "type": "teams",
      "webhookUrl": "https://outlook.office.com/webhook/your-webhook-url"
    }
  },
  "notificationPolicy": {
    "onDown": ["teams"],
    "onRecovery": ["teams"],
    "throttleSeconds": 300,
    "repeatAlerts": false
  }
}
Enter fullscreen mode Exit fullscreen mode

Running It

docker-compose up -d
Enter fullscreen mode Exit fullscreen mode

That's it. Five services spin up:

  1. PostgreSQL stores your data
  2. StatusD probes your services
  3. Prometheus collects metrics
  4. Grafana displays dashboards (http://localhost:80)
  5. Nginx routes everything

Check the logs:

docker logs -f status_app
Enter fullscreen mode Exit fullscreen mode

You should see:

Loading assets manifest...
Loaded 2 assets
Loading notifiers config...
Loaded 1 notifiers
Connecting to database...
Starting scheduler...
[✓] Production API (api-prod): 45ms
[✓] Production Website (web-prod): 120ms
Enter fullscreen mode Exit fullscreen mode

Summary

You now have a monitoring system that:

  1. Reads services from a JSON config
  2. Probes them on a schedule using a worker pool
  3. Detects outages and creates incidents
  4. Sends notifications to Teams/Slack
  5. Exposes metrics for Prometheus
  6. Runs in Docker with one command

This tutorial will help you deploy a working monitoring system. However, there is more under the hood that we glossed over. In a second part we will talk about the following:

  • Circuit breakers prevent cascading failures when a service is flapping
  • Multi-tier escalation alert managers if the engineer on-call doesn't respond
  • Alert deduplication prevents notification storms
  • Adaptive probe intervals check more frequently during incidents
  • Hot-reload configuration without restarting the service
  • SLA calculations and compliance tracking

Top comments (0)