Monitoring is a critical part of running reliable software, yet many teams only discover outages after users complaints starts rolling in. Imagine you get a Slack message at 2 AM, telling you that your APIs are down for over an hour and nobody noticed until customers started complaining. A monitoring service solves this problem by letting you and your team proactively respond to incidents, before problems escalate.
In this tutorial, I will be taking you through the steps on how to build a status monitoring application from scratch. By the end of this article, you will have a system that:
- Probes your services on a schedule (HTTP, TCP, DNS, and more)
- Detects outages and sends alerts to various communication channels (Teams, Slack, etc)
- Tracks incidents with automatic open/close
- Exposes metrics for Prometheus and Grafana dashboards
- Runs in Docker
For this application, I will be using Go because it is fast, compiles to a single binary for cross platform support, and handles concurrency, which is important for an application that needs to monitor multiple endpoints simultaneously.
What We're Building
We will be building a Go application "StatusD". It reads a config file that has a list of services to monitor, probes them, and creates incidents, fire notifications when something goes wrong.
Tech Stack Used:
- Golang
- PostgreSQL
- Grafana (Prometheus for metric)
- Docker
- Nginx
Here's the high-level architecture:
┌─────────────────────────────────────────────────────────────────┐
│ Docker Compose │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Postgres │ │Prometheus│ │ Grafana │ │ Nginx │ │
│ │ DB │ │ (metrics)│ │(dashboard)│ │ (reverse proxy) │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────────┬─────────┘ │
│ │ │ │ │ │
│ └─────────────┴─────────────┴──────────────────┘ │
│ │ │
│ ┌─────────┴─────────┐ │
│ │ StatusD │ │
│ │ (our Go app) │ │
│ └─────────┬─────────┘ │
│ │ │
└──────────────────────────────┼──────────────────────────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│Service │ │Service │ │Service │
│ A │ │ B │ │ C │
└────────┘ └────────┘ └────────┘
Project Structure
Before we write code, let's understand how the pieces fit together. Below is our project structure:
status-monitor/
├── cmd/statusd/
│ └── main.go # Application entry point
├── internal/
│ ├── models/
│ │ └── models.go # Data structures (Asset, Incident, etc.)
│ ├── probe/
│ │ ├── probe.go # Probe registry
│ │ └── http.go # HTTP probe implementation
│ ├── scheduler/
│ │ └── scheduler.go # Worker pool and scheduling
│ ├── alert/
│ │ └── engine.go # State machine and notifications
│ ├── notifier/
│ │ └── teams.go # Teams/Slack integration
│ ├── store/
│ │ └── postgres.go # Database layer
│ ├── api/
│ │ └── handlers.go # REST API
│ └── config/
│ └── manifest.go # Config loading
├── config/
│ ├── manifest.json # Services to monitor
│ └── notifiers.json # Notification channels
├── migrations/
│ └── 001_init_schema.up.sql
├── docker-compose.yml
├── Dockerfile
└── entrypoint.sh
The Core Data Models
Here we will be defining our 'types', which essentially means we will be defining what a "monitored service" looks like.
We will be defining four 'types':
Asset: This is a service we want to monitor.
ProbeResult: What happens when we check an Asset; the response, latency, etc.
Incident: This tracks when something goes wrong, i.e., when ProbeResult returns an unexpected response (and when the service recovers).
Notification: This is an alert or message sent to the defined communications channel, e.g. Teams, Slack, email, etc.
Lets define the types in code:
// internal/models/models.go
package models
import "time"
// Asset represents a monitored service
type Asset struct {
ID string `json:"id"`
AssetType string `json:"assetType"` // http, tcp, dns, etc.
Name string `json:"name"`
Address string `json:"address"`
IntervalSeconds int `json:"intervalSeconds"`
TimeoutSeconds int `json:"timeoutSeconds"`
ExpectedStatusCodes []int `json:"expectedStatusCodes,omitempty"`
Metadata map[string]string `json:"metadata,omitempty"`
}
// ProbeResult contains the outcome of a single health check
type ProbeResult struct {
AssetID string
Timestamp time.Time
Success bool
LatencyMs int64
Code int // HTTP status code
Message string // Error message if failed
}
// Incident tracks a service outage
type Incident struct {
ID string
AssetID string
StartedAt time.Time
EndedAt *time.Time // nil if still open
Severity string
Summary string
}
// Notification is what we send to Slack/Teams
type Notification struct {
AssetID string
AssetName string
Event string // "DOWN", "RECOVERY", "UP"
Timestamp time.Time
Details string
}
Notice the ExpectedStatusCodes field in the Asset type. Not all endpoints return 200, some may return 204 or a redirect. This lets you define what "healthy" means for each service.
Database Schema
We need a place to store the probe results and incidents. We will be using PostgreSQL for this and here's our schema:
-- migrations/001_init_schema.up.sql
CREATE TABLE IF NOT EXISTS assets (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
address TEXT NOT NULL,
asset_type TEXT NOT NULL DEFAULT 'http',
interval_seconds INTEGER DEFAULT 300,
timeout_seconds INTEGER DEFAULT 5,
expected_status_codes TEXT,
metadata JSONB,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS probe_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
asset_id TEXT NOT NULL REFERENCES assets(id),
timestamp TIMESTAMP WITH TIME ZONE NOT NULL,
success BOOLEAN NOT NULL,
latency_ms BIGINT NOT NULL,
code INTEGER,
message TEXT
);
CREATE TABLE IF NOT EXISTS incidents (
id SERIAL PRIMARY KEY,
asset_id TEXT NOT NULL REFERENCES assets(id),
severity TEXT DEFAULT 'INITIAL',
summary TEXT,
started_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
ended_at TIMESTAMP
);
-- Indexes for common queries
CREATE INDEX IF NOT EXISTS idx_probe_events_asset_id_timestamp
ON probe_events(asset_id, timestamp DESC);
CREATE INDEX IF NOT EXISTS idx_incidents_asset_id
ON incidents(asset_id);
CREATE INDEX IF NOT EXISTS idx_incidents_ended_at
ON incidents(ended_at);
The key insight is on probe_events(asset_id, timestamp DESC). Here, we are indexing by asset and timestamp (in a descending order), which allows us to quickly query for the probe results of a service.
Building the Probe System
Things begin to get interesting here. We want to support probing over multiple protocol types: HTTPS, TCP, DNS, etc. without having to write a complex switch statement. To solve this, we are using a registry pattern.
First we'll define what a probe looks like:
// internal/probe/probe.go
package probe
import (
"context"
"fmt"
"github.com/yourname/status/internal/models"
)
// Probe defines the interface for checking service health
type Probe interface {
Probe(ctx context.Context, asset models.Asset) (models.ProbeResult, error)
}
// registry holds all probe types
var registry = make(map[string]func() Probe)
// Register adds a probe type to the registry
func Register(assetType string, factory func() Probe) {
registry[assetType] = factory
}
// GetProbe returns a probe for the given asset type
func GetProbe(assetType string) (Probe, error) {
factory, ok := registry[assetType]
if !ok {
return nil, fmt.Errorf("unknown asset type: %s", assetType)
}
return factory(), nil
}
Now implement the HTTP probe:
// internal/probe/http.go
package probe
import (
"context"
"io"
"net/http"
"time"
"github.com/yourname/status/internal/models"
)
func init() {
Register("http", func() Probe { return &httpProbe{} })
}
type httpProbe struct{}
func (p *httpProbe) Probe(ctx context.Context, asset models.Asset) (models.ProbeResult, error) {
result := models.ProbeResult{
AssetID: asset.ID,
Timestamp: time.Now(),
}
client := &http.Client{
Timeout: time.Duration(asset.TimeoutSeconds) * time.Second,
}
req, err := http.NewRequestWithContext(ctx, http.MethodGet, asset.Address, nil)
if err != nil {
result.Success = false
result.Message = err.Error()
return result, err
}
start := time.Now()
resp, err := client.Do(req)
result.LatencyMs = time.Since(start).Milliseconds()
if err != nil {
result.Success = false
result.Message = err.Error()
return result, err
}
defer resp.Body.Close()
// Read body (limit to 1MB)
io.ReadAll(io.LimitReader(resp.Body, 1024*1024))
result.Code = resp.StatusCode
// Check if status code is expected
if len(asset.ExpectedStatusCodes) > 0 {
for _, code := range asset.ExpectedStatusCodes {
if code == resp.StatusCode {
result.Success = true
return result, nil
}
}
result.Success = false
result.Message = "unexpected status code"
} else {
result.Success = resp.StatusCode < 400
}
return result, nil
}
The init() function runs automatically when your Go application starts. This adds the HTTP probe to the registry without any code change.
Want to add TCP probes? Create tcp.go, implement the interface, and register it in init().
Scheduling and Concurrency
We need to probe all our Assets on a schedule and for this we will be using a worker pool. A worker pool lets us run multiple probes concurrently without spawning a goroutine for each service.
// internal/scheduler/scheduler.go
package scheduler
import (
"context"
"sync"
"time"
"github.com/yourname/status/internal/models"
"github.com/yourname/status/internal/probe"
)
type JobHandler func(result models.ProbeResult)
type Scheduler struct {
workers int
jobs chan models.Asset
tickers map[string]*time.Ticker
handler JobHandler
mu sync.Mutex
done chan struct{}
wg sync.WaitGroup
}
func NewScheduler(workerCount int, handler JobHandler) *Scheduler {
return &Scheduler{
workers: workerCount,
jobs: make(chan models.Asset, 100),
tickers: make(map[string]*time.Ticker),
handler: handler,
done: make(chan struct{}),
}
}
func (s *Scheduler) Start(ctx context.Context) {
for i := 0; i < s.workers; i++ {
s.wg.Add(1)
go s.worker(ctx)
}
}
func (s *Scheduler) ScheduleAssets(assets []models.Asset) error {
s.mu.Lock()
defer s.mu.Unlock()
for _, asset := range assets {
interval := time.Duration(asset.IntervalSeconds) * time.Second
ticker := time.NewTicker(interval)
s.tickers[asset.ID] = ticker
s.wg.Add(1)
go s.scheduleAsset(asset, ticker)
}
return nil
}
func (s *Scheduler) scheduleAsset(asset models.Asset, ticker *time.Ticker) {
defer s.wg.Done()
for {
select {
case <-s.done:
ticker.Stop()
return
case <-ticker.C:
s.jobs <- asset
}
}
}
func (s *Scheduler) worker(ctx context.Context) {
defer s.wg.Done()
for {
select {
case <-s.done:
return
case asset := <-s.jobs:
p, err := probe.GetProbe(asset.AssetType)
if err != nil {
continue
}
result, _ := p.Probe(ctx, asset)
s.handler(result)
}
}
}
func (s *Scheduler) Stop() {
close(s.done)
close(s.jobs)
s.wg.Wait()
}
Each asset gets its own ticker goroutine that only schedules work. When its time to check an asset, the ticker sends a probe job into a channel. There are a fixed number of worker goroutines that listen on the channel and do the actual probing.
We don't run probes directly in the ticker goroutines because probes can block while waiting for network responses or timeouts. By using workers, we can control concurrency.
For example, with 4 workers and 100 assets, only 4 probes will run at any moment even if tickers fire simultaneously. The channel acts as a buffer for pending jobs, and a sync.WaitGroup ensures all workers shut down cleanly.
Incident Detection: The State Machine
When a probe fails, we don't automatically assume a failure. It could be network glitch. However, if it fails again, we create an incident. When it recovers, we close the incident and notify.
This is a state machine: UP → DOWN → UP.
Lets build the engine:
// internal/alert/engine.go
package alert
import (
"context"
"fmt"
"sync"
"time"
"github.com/yourname/status/internal/models"
"github.com/yourname/status/internal/store"
)
type NotifierFunc func(ctx context.Context, notification models.Notification) error
type AssetState struct {
IsUp bool
LastProbeTime time.Time
OpenIncidentID string
}
type Engine struct {
store store.Store
notifiers map[string]NotifierFunc
mu sync.RWMutex
assetState map[string]AssetState
}
func NewEngine(store store.Store) *Engine {
return &Engine{
store: store,
notifiers: make(map[string]NotifierFunc),
assetState: make(map[string]AssetState),
}
}
func (e *Engine) RegisterNotifier(name string, fn NotifierFunc) {
e.mu.Lock()
defer e.mu.Unlock()
e.notifiers[name] = fn
}
func (e *Engine) Process(ctx context.Context, result models.ProbeResult, asset models.Asset) error {
e.mu.Lock()
defer e.mu.Unlock()
state := e.assetState[result.AssetID]
state.LastProbeTime = result.Timestamp
// State hasn't changed? Nothing to do.
if state.IsUp == result.Success {
e.assetState[result.AssetID] = state
return nil
}
// Save probe event
if err := e.store.SaveProbeEvent(ctx, result); err != nil {
return err
}
if result.Success && !state.IsUp {
// Recovery!
return e.handleRecovery(ctx, asset, state)
} else if !result.Success && state.IsUp {
// Outage!
return e.handleOutage(ctx, asset, state, result)
}
return nil
}
func (e *Engine) handleOutage(ctx context.Context, asset models.Asset, state AssetState, result models.ProbeResult) error {
incidentID, err := e.store.CreateIncident(ctx, asset.ID, fmt.Sprintf("Service %s is down", asset.Name))
if err != nil {
return err
}
state.IsUp = false
state.OpenIncidentID = incidentID
e.assetState[asset.ID] = state
notification := models.Notification{
AssetID: asset.ID,
AssetName: asset.Name,
Event: "DOWN",
Timestamp: result.Timestamp,
Details: result.Message,
}
return e.sendNotifications(ctx, notification)
}
func (e *Engine) handleRecovery(ctx context.Context, asset models.Asset, state AssetState) error {
if state.OpenIncidentID != "" {
e.store.CloseIncident(ctx, state.OpenIncidentID)
}
state.IsUp = true
state.OpenIncidentID = ""
e.assetState[asset.ID] = state
notification := models.Notification{
AssetID: asset.ID,
AssetName: asset.Name,
Event: "RECOVERY",
Timestamp: time.Now(),
Details: "Service has recovered",
}
return e.sendNotifications(ctx, notification)
}
func (e *Engine) sendNotifications(ctx context.Context, notification models.Notification) error {
for name, notifier := range e.notifiers {
if err := notifier(ctx, notification); err != nil {
fmt.Printf("notifier %s failed: %v\n", name, err)
}
}
return nil
}
Key insight: We track the state in memory assetState for fast lookups, but persists incidents to the database for durability. If the process restarts, we can rebuild state from open incidents.
Sending Notifications
In the event that something breaks, people need to know. We need to send the notification to various communication channels.
Let's define our Teams notifier:
// internal/notifier/teams.go
package notifier
import (
"bytes"
"context"
"encoding/json"
"fmt"
"net/http"
"time"
"github.com/yourname/status/internal/models"
)
type TeamsNotifier struct {
webhookURL string
client *http.Client
}
func NewTeamsNotifier(webhookURL string) *TeamsNotifier {
return &TeamsNotifier{
webhookURL: webhookURL,
client: &http.Client{Timeout: 10 * time.Second},
}
}
func (t *TeamsNotifier) Notify(ctx context.Context, n models.Notification) error {
emoji := "🟢"
if n.Event == "DOWN" {
emoji = "🔴"
}
card := map[string]interface{}{
"type": "message",
"attachments": []map[string]interface{}{
{
"contentType": "application/vnd.microsoft.card.adaptive",
"content": map[string]interface{}{
"$schema": "http://adaptivecards.io/schemas/adaptive-card.json",
"type": "AdaptiveCard",
"version": "1.4",
"body": []map[string]interface{}{
{
"type": "TextBlock",
"text": fmt.Sprintf("%s %s - %s", emoji, n.AssetName, n.Event),
"weight": "Bolder",
"size": "Large",
},
{
"type": "FactSet",
"facts": []map[string]interface{}{
{"title": "Service", "value": n.AssetName},
{"title": "Status", "value": n.Event},
{"title": "Time", "value": n.Timestamp.Format(time.RFC1123)},
{"title": "Details", "value": n.Details},
},
},
},
},
},
},
}
body, _ := json.Marshal(card)
req, _ := http.NewRequestWithContext(ctx, "POST", t.webhookURL, bytes.NewReader(body))
req.Header.Set("Content-Type", "application/json")
resp, err := t.client.Do(req)
if err != nil {
return err
}
defer resp.Body.Close()
if resp.StatusCode >= 300 {
return fmt.Errorf("Teams webhook returned %d", resp.StatusCode)
}
return nil
}
Teams uses Adaptive Cards for rich formatting.
You can define various notifiers for other communications channel, e.g. Slack, Discord, etc.
The REST API
We need endpoints to query the status of the services we are monitoring. For this, we will be using Chi, which is a lightweight router that supports route parameters like /assets/{id}.
Lets define the apis:
// internal/api/handlers.go
package api
import (
"encoding/json"
"net/http"
"github.com/go-chi/chi/v5"
"github.com/go-chi/chi/v5/middleware"
"github.com/yourname/status/internal/store"
)
type Server struct {
store store.Store
mux *chi.Mux
}
func NewServer(s store.Store) *Server {
srv := &Server{store: s, mux: chi.NewRouter()}
srv.mux.Use(middleware.Logger)
srv.mux.Use(middleware.Recoverer)
srv.mux.Route("/api", func(r chi.Router) {
r.Get("/health", srv.health)
r.Get("/assets", srv.listAssets)
r.Get("/assets/{id}/events", srv.getAssetEvents)
r.Get("/incidents", srv.listIncidents)
})
return srv
}
func (s *Server) ServeHTTP(w http.ResponseWriter, r *http.Request) {
s.mux.ServeHTTP(w, r)
}
func (s *Server) health(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(map[string]string{"status": "healthy"})
}
func (s *Server) listAssets(w http.ResponseWriter, r *http.Request) {
assets, err := s.store.GetAssets(r.Context())
if err != nil {
http.Error(w, err.Error(), 500)
return
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(assets)
}
func (s *Server) getAssetEvents(w http.ResponseWriter, r *http.Request) {
id := chi.URLParam(r, "id")
events, _ := s.store.GetProbeEvents(r.Context(), id, 100)
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(events)
}
func (s *Server) listIncidents(w http.ResponseWriter, r *http.Request) {
incidents, _ := s.store.GetOpenIncidents(r.Context())
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(incidents)
}
The code above define a small HTTP API server, which exposes 4 read-only endpoint:
GET /api/health - Health check (is the service running?)
GET /api/assets - List all monitored services
GET /api/assets/{id}/events - Get probe history for a specific service
GET /api/incidents - List open incidents
Dockerizing the Application
Dockerizing the application is pretty straighforward since Go compiles to a single binary. We are going to be using a multi-stage build to keep the final image small:
# Dockerfile
FROM golang:1.24-alpine AS builder
WORKDIR /app
RUN apk add --no-cache git
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o statusd ./cmd/statusd/
FROM alpine:latest
WORKDIR /app
RUN apk --no-cache add ca-certificates
COPY --from=builder /app/statusd .
COPY entrypoint.sh .
RUN chmod +x /app/entrypoint.sh
EXPOSE 8080
ENTRYPOINT ["/app/entrypoint.sh"]
The builder stage compiles the code. The final stage is just Alpine plus our binary—typically under 20MB.
The entrypoint script builds the database connection string from environment variables:
#!/bin/sh
# entrypoint.sh
DB_HOST=${DB_HOST:-localhost}
DB_PORT=${DB_PORT:-5432}
DB_USER=${DB_USER:-status}
DB_PASSWORD=${DB_PASSWORD:-status}
DB_NAME=${DB_NAME:-status_db}
DB_CONN_STRING="postgres://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:${DB_PORT}/${DB_NAME}"
exec ./statusd \
-manifest /app/config/manifest.json \
-notifiers /app/config/notifiers.json \
-db "$DB_CONN_STRING" \
-workers 4 \
-api-port 8080
Docker Compose: Putting It All Together
One file to rule them all:
# docker-compose.yml
version: "3.8"
services:
postgres:
image: postgres:15-alpine
container_name: status_postgres
environment:
POSTGRES_USER: status
POSTGRES_PASSWORD: changeme
POSTGRES_DB: status_db
volumes:
- postgres_data:/var/lib/postgresql/data
- ./migrations:/docker-entrypoint-initdb.d
healthcheck:
test: ["CMD-SHELL", "pg_isready -U status"]
interval: 10s
timeout: 5s
retries: 5
networks:
- status_network
statusd:
build: .
container_name: status_app
environment:
- DB_HOST=postgres
- DB_PORT=5432
- DB_USER=status
- DB_PASSWORD=changeme
- DB_NAME=status_db
volumes:
- ./config:/app/config:ro
depends_on:
postgres:
condition: service_healthy
networks:
- status_network
prometheus:
image: prom/prometheus:latest
container_name: status_prometheus
volumes:
- ./docker/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
networks:
- status_network
depends_on:
- statusd
grafana:
image: grafana/grafana:latest
container_name: status_grafana
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: admin
volumes:
- grafana_data:/var/lib/grafana
networks:
- status_network
depends_on:
- prometheus
nginx:
image: nginx:alpine
container_name: status_nginx
volumes:
- ./docker/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
- ./docker/nginx/conf.d:/etc/nginx/conf.d:ro
ports:
- "80:80"
depends_on:
- statusd
- grafana
- prometheus
networks:
- status_network
networks:
status_network:
driver: bridge
volumes:
postgres_data:
prometheus_data:
grafana_data:
A few things to note:
-
PostgreSQL healthcheck: The
statusdservice waits until Postgres is actually ready, not just started. This prevents "connection refused" errors on first boot. -
Config mount: We mount
./configas read-only. Edit your manifest locally, and the running container sees the changes. - Nginx: Routes external traffic to Grafana and Prometheus dashboards.
Configuration Files
The application reads two files: manifest.json and notifiers.json
- The manifest.json file lists the assets we want to monitor. Each asset needs an ID, a probe type, and an address. The
intervalSecondscontrols how often we check (60 = once per minute).expectedStatusCodeslets you define what "healthy" means. Some endpoints return 301 redirects or 204 No Content, and that's fine.
// config/manifest.json
{
"assets": [
{
"id": "api-prod",
"assetType": "http",
"name": "Production API",
"address": "https://api.example.com/health",
"intervalSeconds": 60,
"timeoutSeconds": 5,
"expectedStatusCodes": [200],
"metadata": {
"env": "prod",
"owner": "platform-team"
}
},
{
"id": "web-prod",
"assetType": "http",
"name": "Production Website",
"address": "https://www.example.com",
"intervalSeconds": 120,
"timeoutSeconds": 10,
"expectedStatusCodes": [200, 301]
}
]
}
- The notifiers.json controls where to send alerts. You define notification channels (Teams, Slack), then set policies for which channels fire on which events.
throttleSeconds: 300means you won't get spammed more than once every 5 minutes for the same issue.
// config/notifiers.json
{
"notifiers": {
"teams": {
"type": "teams",
"webhookUrl": "https://outlook.office.com/webhook/your-webhook-url"
}
},
"notificationPolicy": {
"onDown": ["teams"],
"onRecovery": ["teams"],
"throttleSeconds": 300,
"repeatAlerts": false
}
}
Running It
docker-compose up -d
That's it. Five services spin up:
- PostgreSQL stores your data
- StatusD probes your services
- Prometheus collects metrics
- Grafana displays dashboards (http://localhost:80)
- Nginx routes everything
Check the logs:
docker logs -f status_app
You should see:
Loading assets manifest...
Loaded 2 assets
Loading notifiers config...
Loaded 1 notifiers
Connecting to database...
Starting scheduler...
[✓] Production API (api-prod): 45ms
[✓] Production Website (web-prod): 120ms
Summary
You now have a monitoring system that:
- Reads services from a JSON config
- Probes them on a schedule using a worker pool
- Detects outages and creates incidents
- Sends notifications to Teams/Slack
- Exposes metrics for Prometheus
- Runs in Docker with one command
This tutorial will help you deploy a working monitoring system. However, there is more under the hood that we glossed over. In a second part we will talk about the following:
- Circuit breakers prevent cascading failures when a service is flapping
- Multi-tier escalation alert managers if the engineer on-call doesn't respond
- Alert deduplication prevents notification storms
- Adaptive probe intervals check more frequently during incidents
- Hot-reload configuration without restarting the service
- SLA calculations and compliance tracking
Top comments (0)