DEV Community

Jakub Korečko
Jakub Korečko

Posted on

Part 5: Full Observability for Free — Grafana Alloy, Prometheus, and Loki

What you'll learn:

  • Why Grafana Alloy replaces both Prometheus Agent and Promtail
  • What metrics are scraped from each service and why
  • How cardinality management keeps you inside Grafana Cloud's free tier
  • How Docker log discovery works with the Alloy pipeline
  • What to build in Grafana dashboards for this stack

One Container for Everything

Before Grafana Alloy, a standard observability setup required:

  • Prometheus or Prometheus Agent for metrics scraping and remote write
  • Promtail for log collection and Loki shipping
  • Separate configs, separate containers, separate log rotation concerns

Grafana Alloy is the successor to both. One container, one config file (grafana_alloy.alloy), and it handles metrics and logs in a single pipeline.

Why not a full Prometheus stack? Self-hosted Prometheus needs storage, retention config, and an alertmanager. For a single server, that's more infrastructure than the apps it monitors. Grafana Cloud's free tier gives you 14-day metric retention, 30-day log retention, and managed alerting — without running any of that yourself.

The free tier limits that matter:

  • 10,000 active metric series
  • 50GB logs/month
  • 14 days metric retention

Aggressive metric filtering (covered below) keeps this stack well under those limits.


The Alloy Config Structure

Alloy uses a config language called River (similar to HCL). The full config is at apps/monitoring/grafana_alloy.alloy.

The pipeline follows this pattern:

prometheus.scrape → prometheus.relabel → prometheus.remote_write
loki.source       → loki.process      → loki.write
Enter fullscreen mode Exit fullscreen mode

Each component is declared with a name and wired together via forward_to references. This makes the data flow explicit and easy to trace.


Metrics: What Gets Scraped

Traefik

prometheus.scrape "traefik" {
  targets = [{ __address__ = "traefik:8899" }]
  forward_to      = [prometheus.relabel.traefik.receiver]
  scrape_interval = "30s"
}

prometheus.relabel "traefik" {
  forward_to = [prometheus.remote_write.default.receiver]

  rule {
    source_labels = ["__name__"]
    regex = "(traefik_open_connections|traefik_entrypoint_requests_total|traefik_entrypoint_request_duration_seconds_sum|traefik_entrypoint_request_duration_seconds_bucket|traefik_entrypoint_request_duration_seconds_count|traefik_service_requests_total|traefik_service_request_duration_seconds_bucket|traefik_service_request_duration_seconds_sum|traefik_service_request_duration_seconds_count|traefik_service_requests_bytes_total|traefik_service_responses_bytes_total)"
    action = "keep"
  }
}
Enter fullscreen mode Exit fullscreen mode

Traefik exposes metrics on port :8899 (the metrics entrypoint defined in static config). Raw Traefik output is ~50+ metric series. The relabeling rule keeps only 11 specific metrics:

Metric What it tells you
traefik_open_connections Active connections right now
traefik_entrypoint_requests_total Total requests per entrypoint
traefik_entrypoint_request_duration_seconds_* Latency distribution (histogram)
traefik_service_requests_total Total requests per backend service
traefik_service_request_duration_seconds_* Per-service latency histogram
traefik_service_requests_bytes_total Request bytes per service
traefik_service_responses_bytes_total Response bytes per service

This is enough to build:

  • Request rate dashboards (requests/sec per service)
  • Latency percentile panels (P50, P95, P99)
  • Error rate panels (compare 2xx vs 4xx/5xx from Traefik access logs)
  • Active connection gauges

Host Metrics (node-exporter)

prometheus.scrape "node_exporter" {
  targets = [{ __address__ = "node-exporter:9100" }]
  forward_to      = [prometheus.relabel.node_exporter.receiver]
  scrape_interval = "30s"
}

prometheus.relabel "node_exporter" {
  forward_to = [prometheus.remote_write.default.receiver]

  rule {
    source_labels = ["__name__"]
    regex = "(node_cpu_seconds_total|node_memory_(MemTotal|MemFree|MemAvailable|Buffers|Cached|SReclaimable|SwapTotal|SwapFree)_bytes|node_filesystem_(size|avail)_bytes|node_network_(receive|transmit)_bytes_total|node_load(1|5|15)|node_time_seconds|node_boot_time_seconds|node_disk_(read_bytes_total|written_bytes_total|io_time_seconds_total))"
    action = "keep"
  }
}
Enter fullscreen mode Exit fullscreen mode

node-exporter by default exposes hundreds of metrics covering every kernel subsystem. The regex keep-rule filters to the essentials:

Category Metrics kept
CPU node_cpu_seconds_total (all modes)
Memory Total, free, available, buffers, cached, swap
Filesystem Size and available bytes per mount
Network Receive/transmit bytes per interface
Load 1m, 5m, 15m load averages
Disk I/O Read bytes, write bytes, I/O time per device
System Uptime (boot time), clock

The node-exporter compose file further filters which collectors run:

# apps/monitoring/node_exporter.yaml
command:
  - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|run|var/lib/docker).*'
  - '--collector.filesystem.fs-types-exclude=^(sysfs|procfs|autofs|cgroup|devtmpfs|devpts|tmpfs|nsfs|overlay|securityfs|tracefs)$'
  - '--collector.netdev.device-exclude=^(veth|docker|br-).*'
Enter fullscreen mode Exit fullscreen mode

Docker-internal network interfaces (veth, docker0, bridge interfaces) are excluded. Virtual filesystems (tmpfs, overlay, etc.) are excluded. This prevents cardinality explosion from Docker's ephemeral per-container veth interfaces — each container creates a new veth, and without filtering you'd accumulate thousands of metric series over time.

Container Metrics (cAdvisor)

prometheus.relabel "cadvisor" {
  forward_to = [prometheus.remote_write.default.receiver]

  // Drop root/aggregate metrics (no container label)
  rule {
    source_labels = ["name"]
    regex = "^$"
    action = "drop"
  }

  // Extract container name from Docker Swarm task format
  rule {
    source_labels = ["name"]
    regex = "(.+)\\.(\\d+)\\.[a-zA-Z0-9]+$"
    target_label  = "container_name"
    replacement   = "$1.$2"
  }

  // Add service name label
  rule {
    source_labels = ["container_label_com_docker_swarm_service_name"]
    target_label  = "service_name"
  }

  // Keep only essential container metrics
  rule {
    source_labels = ["__name__"]
    regex = "(container_memory_usage_bytes|container_last_seen|container_cpu_user_seconds_total|container_network_(receive|transmit)_bytes_total|container_memory_cache|container_fs_(reads|writes)_bytes_total|container_cpu_usage_seconds_total)"
    action = "keep"
  }

  // Drop unused labels
  rule {
    regex  = "(__name__|container_name|service_name|job|instance)"
    action = "labelkeep"
  }
}
Enter fullscreen mode Exit fullscreen mode

cAdvisor exports metrics for every container on the system, including Docker's internal containers. The first rule drops "aggregate" metrics that have no name label (cAdvisor's root-level stats). The final labelkeep rule drops all the verbose Docker label metadata that cAdvisor attaches — keeping only the labels that matter for querying.

The result: per-container CPU, memory, network, and disk I/O, labeled with service name and container name.


Logs: Docker and System

System Logs

local.file_match "system" {
  path_targets = [{
    __path__ = "/var/log/**/*log",
    job       = "varlogs",
  }]
}

loki.process "system" {
  forward_to = [loki.write.default.receiver]

  stage.drop {
    older_than = "1h0m0s"
  }
}
Enter fullscreen mode Exit fullscreen mode

Alloy reads all *log files under /var/log/ (mounted from the host as read-only). The stage.drop rule discards log lines older than 1 hour — this prevents Alloy from re-shipping old logs after a restart, which would generate duplicate log entries in Grafana Cloud.

Docker Container Logs

discovery.docker "docker_swarm" {
  host             = "unix:///var/run/docker.sock"
  refresh_interval = "5s"
}

discovery.relabel "docker_swarm" {
  targets = []

  rule {
    source_labels = ["__meta_docker_container_name"]
    regex         = "(.+)\\.(\\d+)\\.[a-zA-Z0-9]+$"
    target_label  = "container_name"
  }

  rule {
    source_labels = ["__meta_docker_service_name"]
    target_label  = "service_name"
  }

  rule {
    source_labels = ["__meta_docker_service_label_com_docker_stack_namespace"]
    target_label  = "stack_name"
  }

  rule {
    source_labels = ["__meta_docker_container_id"]
    target_label  = "__path__"
    replacement   = "/var/lib/docker/containers/$1/$1-json.log"
  }
}

loki.source.docker "docker_swarm" {
  host          = "unix:///var/run/docker.sock"
  targets       = discovery.docker.docker_swarm.targets
  forward_to    = [loki.process.docker_swarm.receiver]
  relabel_rules = discovery.relabel.docker_swarm.rules
}
Enter fullscreen mode Exit fullscreen mode

Docker Swarm container names follow the pattern stackname_servicename.replicanumber.taskid. The regex (.+)\.(\d+)\.[a-zA-Z0-9]+$ extracts a stable name (without the random task ID) so log streams from the same service don't fragment across container restarts.

The stack_name, service_name, and container_name labels are added to every log line, making Loki queries like {stack_name="traefik"} or {service_name="traefik_traefik"} work correctly.


cAdvisor and node-exporter Compose Files

Both exporters have carefully tuned compose configurations.

cAdvisor

# apps/monitoring/cadvisor.yaml (key sections)
services:
  cadvisor:
    image: ghcr.io/google/cadvisor:0.56.2
    command:
      - --docker_only=true
      - --housekeeping_interval=30s
      - --max_housekeeping_interval=35s
      - --global_housekeeping_interval=10m
      - --storage_duration=10m
    deploy:
      resources:
        limits:
          memory: 128M
        reservations:
          memory: 64M
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
Enter fullscreen mode Exit fullscreen mode

--docker_only=true restricts cAdvisor to Docker containers only (no system processes). --housekeeping_interval=30s matches the Alloy scrape interval — collecting at a higher frequency than the scrape interval provides no benefit. Memory is capped at 128M because cAdvisor has a known tendency to grow unbounded on busy systems.

restart_policy: condition: on-failure, max_attempts: 3 rather than always — repeated failures indicate a persistent issue that warrants investigation rather than indefinite restart attempts.

node-exporter

services:
  node-exporter:
    image: prom/node-exporter:v1.10.2
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
Enter fullscreen mode Exit fullscreen mode

node-exporter needs read-only access to /proc, /sys, and the filesystem root to collect host metrics. It runs in the host PID namespace to see all processes. The :ro mounts ensure it can only read, never write.


Remote Write to Grafana Cloud

Both Prometheus and Loki ship to Grafana Cloud via basic auth:

prometheus.remote_write "default" {
  endpoint {
    url = "<YOUR_GRAFANA_CLOUD_PROMETHEUS_URL>/api/prom/push"

    basic_auth {
      username      = "<YOUR_GRAFANA_CLOUD_METRICS_USERNAME>"
      password_file = "/run/secrets/grafana_cloud_passwd"
    }
  }
}

loki.write "default" {
  endpoint {
    url = "<YOUR_GRAFANA_CLOUD_LOKI_URL>/loki/api/v1/push"

    basic_auth {
      username      = "<YOUR_GRAFANA_CLOUD_LOGS_USERNAME>"
      password_file = "/run/secrets/grafana_cloud_passwd"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The grafana_cloud_passwd is an API key from Grafana Cloud. It's stored encrypted in the Git repo (SOPS) and decrypted by SwarmCD at deploy time into a Docker secret. The same password is shared between the Prometheus and Loki write endpoints (Grafana Cloud uses the same credential for both).


Suggested Grafana Dashboard Panels

Once data is flowing to Grafana Cloud, here are the most useful panels to build:

Server Overview:

  • CPU usage (% of total, per-core breakdown)
  • Memory usage (total / available / used)
  • Disk usage per mount point
  • Network traffic (bytes/sec in/out)
  • System uptime

Traefik:

  • Requests/sec per service (rate of traefik_service_requests_total)
  • Error rate (4xx + 5xx as % of total)
  • P95 latency per service (histogram from traefik_service_request_duration_seconds_bucket)
  • Active open connections

Container Resources:

  • Memory usage per service (container_memory_usage_bytes by service_name)
  • CPU usage per service (rate of container_cpu_usage_seconds_total)
  • Container restarts (changes in container_last_seen gaps)

Logs (Grafana Explore or dashboard panels):

  • {stack_name="traefik"} — Traefik access logs
  • {service_name="swarmcd_swarmcd"} — SwarmCD deploy events
  • {job="varlogs"} — System logs (SSH, fail2ban bans, etc.)

Summary

The observability stack in this setup achieves:

  • Metrics from 3 sources (Traefik, node-exporter, cAdvisor) shipped to Grafana Cloud Prometheus
  • Logs from 2 sources (Docker containers, system) shipped to Grafana Cloud Loki
  • One container (Grafana Alloy) handling all of the above
  • Zero cost — Grafana Cloud free tier is sufficient with aggressive metric filtering
  • GitOps-deployed — changes to grafana_alloy.alloy trigger automatic redeployment via SwarmCD

The metric filtering is the key insight: Prometheus exporters expose far more data than you need. Being selective about what you ship keeps you within free tier limits and makes dashboards faster to query.


All source code: gitlab.com/sakonn/docker-swarm-gitops

Top comments (0)