DEV Community

ahmet gedik
ahmet gedik

Posted on

Monitoring and Alerting for Video Platform Infrastructure with Go

Monitoring Across Time Zones

TrendVidStream serves UAE (+4 UTC), Finland (+2), Czech Republic (+1), Denmark (+1), Belgium (+1), UK (+0), Switzerland (+1), and US (-5 to -8). A cron failure in Finland at 3am local is 1am UTC — monitoring must alert before users notice stale content.

Health Check Endpoint

package main

import (
    "database/sql"
    "encoding/json"
    "net/http"
    "time"
)

type HealthResponse struct {
    Overall string            `json:"overall"`
    Checks  map[string]string `json:"checks"`
    Uptime  string            `json:"uptime"`
    Regions []string          `json:"regions_active"`
}

var (
    startTime     = time.Now()
    activeRegions = []string{"AE", "FI", "CZ", "DK", "BE", "CH", "GB", "US"}
)

func healthHandler(db *sql.DB) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        checks := map[string]string{}
        overall := "ok"

        if err := db.Ping(); err != nil {
            checks["database"] = "error: " + err.Error()
            overall = "degraded"
        } else {
            checks["database"] = "ok"
        }

        for _, region := range activeRegions {
            var count int
            db.QueryRow(
                "SELECT COUNT(*) FROM videos WHERE region=? AND fetched_at > datetime('now','-8 hours')",
                region,
            ).Scan(&count)
            if count == 0 {
                checks["region_"+region] = "stale"
                overall = "degraded"
            } else {
                checks["region_"+region] = "ok"
            }
        }

        w.Header().Set("Content-Type", "application/json")
        if overall != "ok" {
            w.WriteHeader(http.StatusServiceUnavailable)
        }
        json.NewEncoder(w).Encode(HealthResponse{
            Overall: overall, Checks: checks,
            Uptime: time.Since(startTime).Round(time.Second).String(),
            Regions: activeRegions,
        })
    }
}
Enter fullscreen mode Exit fullscreen mode

Prometheus Exporter

import "github.com/prometheus/client_golang/prometheus"

var (
    fetchedTotal = prometheus.NewCounterVec(prometheus.CounterOpts{
        Name: "tvs_videos_fetched_total",
        Help: "Total videos fetched per region.",
    }, []string{"region"})

    fetchDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
        Name: "tvs_fetch_duration_seconds", Help: "Fetch duration per region.",
        Buckets: prometheus.DefBuckets,
    }, []string{"region"})

    regionFreshness = prometheus.NewGaugeVec(prometheus.GaugeOpts{
        Name: "tvs_region_freshness_seconds",
        Help: "Seconds since last successful fetch per region.",
    }, []string{"region"})
)

func init() { prometheus.MustRegister(fetchedTotal, fetchDuration, regionFreshness) }
Enter fullscreen mode Exit fullscreen mode

Uptime Monitor with Per-Timezone Context

var probes = []struct {
    URL    string
    Region string
    UTCOff int
}{
    {"https://trendvidstream.com/?region=AE", "AE", +4},
    {"https://trendvidstream.com/?region=FI", "FI", +2},
    {"https://trendvidstream.com/?region=CZ", "CZ", +1},
    {"https://trendvidstream.com/?region=DK", "DK", +1},
    {"https://trendvidstream.com/?region=GB", "GB",  0},
}

func (m *UptimeMonitor) Start(interval time.Duration) {
    for _, p := range probes {
        go func(probe struct{ URL, Region string; UTCOff int }) {
            ticker := time.NewTicker(interval)
            for range ticker.C {
                start := time.Now()
                resp, err := m.client.Get(probe.URL)
                latency := time.Since(start)

                status := 0
                if resp != nil { status = resp.StatusCode; resp.Body.Close() }

                if err != nil || status >= 500 {
                    localHour := (time.Now().UTC().Hour() + probe.UTCOff + 24) % 24
                    msg := fmt.Sprintf(
                        ":red_circle: *%s DOWN* (HTTP %d, %.0fms) — local %02d:xx",
                        probe.Region, status, latency.Seconds()*1000, localHour,
                    )
                    sendSlack(m.slack, msg)
                }
            }
        }(p)
    }
}
Enter fullscreen mode Exit fullscreen mode

Including local time in alerts helps on-call engineers understand whether the affected region is in peak hours — a UAE failure at 21:00 local (prime time) is more urgent than a Finnish failure at 04:00.

Grafana Dashboard Queries

Panel PromQL
Videos fetched rate rate(tvs_videos_fetched_total[5m]) per region
P95 fetch latency histogram_quantile(0.95, rate(tvs_fetch_duration_seconds_bucket[5m]))
Region freshness tvs_region_freshness_seconds — alert if > 28800 (8h)
Quota remaining 10000 - sum(tvs_api_quota_used)

The freshness alert is the most valuable metric. When a region goes stale due to API quota exhaustion, Grafana pages on-call before any user reports missing content on TrendVidStream.


This article is part of the Building TrendVidStream series. Check out TrendVidStream to see these techniques in action.

Top comments (0)