Jakub Korečko

Posted on May 30

Part 5: Full Observability for Free — Grafana Alloy, Prometheus, and Loki

What you'll learn:

Why Grafana Alloy replaces both Prometheus Agent and Promtail
What metrics are scraped from each service and why
How cardinality management keeps you inside Grafana Cloud's free tier
How Docker log discovery works with the Alloy pipeline
What to build in Grafana dashboards for this stack

One Container for Everything

Before Grafana Alloy, a standard observability setup required:

Prometheus or Prometheus Agent for metrics scraping and remote write
Promtail for log collection and Loki shipping
Separate configs, separate containers, separate log rotation concerns

Grafana Alloy is the successor to both. One container, one config file (grafana_alloy.alloy), and it handles metrics and logs in a single pipeline.

Why not a full Prometheus stack? Self-hosted Prometheus needs storage, retention config, and an alertmanager. For a single server, that's more infrastructure than the apps it monitors. Grafana Cloud's free tier gives you 14-day metric retention, 30-day log retention, and managed alerting — without running any of that yourself.

The free tier limits that matter:

10,000 active metric series
50GB logs/month
14 days metric retention

Aggressive metric filtering (covered below) keeps this stack well under those limits.

The Alloy Config Structure

Alloy uses a config language called River (similar to HCL). The full config is at apps/monitoring/grafana_alloy.alloy.

The pipeline follows this pattern:

prometheus.scrape → prometheus.relabel → prometheus.remote_write
loki.source       → loki.process      → loki.write

Each component is declared with a name and wired together via forward_to references. This makes the data flow explicit and easy to trace.

Metrics: What Gets Scraped

Traefik

prometheus.scrape "traefik" {
  targets = [{ __address__ = "traefik:8899" }]
  forward_to      = [prometheus.relabel.traefik.receiver]
  scrape_interval = "30s"
}

prometheus.relabel "traefik" {
  forward_to = [prometheus.remote_write.default.receiver]

  rule {
    source_labels = ["__name__"]
    regex = "(traefik_open_connections|traefik_entrypoint_requests_total|traefik_entrypoint_request_duration_seconds_sum|traefik_entrypoint_request_duration_seconds_bucket|traefik_entrypoint_request_duration_seconds_count|traefik_service_requests_total|traefik_service_request_duration_seconds_bucket|traefik_service_request_duration_seconds_sum|traefik_service_request_duration_seconds_count|traefik_service_requests_bytes_total|traefik_service_responses_bytes_total)"
    action = "keep"
  }
}

Traefik exposes metrics on port :8899 (the metrics entrypoint defined in static config). Raw Traefik output is ~50+ metric series. The relabeling rule keeps only 11 specific metrics:

Metric	What it tells you
`traefik_open_connections`	Active connections right now
`traefik_entrypoint_requests_total`	Total requests per entrypoint
`traefik_entrypoint_request_duration_seconds_*`	Latency distribution (histogram)
`traefik_service_requests_total`	Total requests per backend service
`traefik_service_request_duration_seconds_*`	Per-service latency histogram
`traefik_service_requests_bytes_total`	Request bytes per service
`traefik_service_responses_bytes_total`	Response bytes per service

This is enough to build:

Request rate dashboards (requests/sec per service)
Latency percentile panels (P50, P95, P99)
Error rate panels (compare 2xx vs 4xx/5xx from Traefik access logs)
Active connection gauges

Host Metrics (node-exporter)

prometheus.scrape "node_exporter" {
  targets = [{ __address__ = "node-exporter:9100" }]
  forward_to      = [prometheus.relabel.node_exporter.receiver]
  scrape_interval = "30s"
}

prometheus.relabel "node_exporter" {
  forward_to = [prometheus.remote_write.default.receiver]

  rule {
    source_labels = ["__name__"]
    regex = "(node_cpu_seconds_total|node_memory_(MemTotal|MemFree|MemAvailable|Buffers|Cached|SReclaimable|SwapTotal|SwapFree)_bytes|node_filesystem_(size|avail)_bytes|node_network_(receive|transmit)_bytes_total|node_load(1|5|15)|node_time_seconds|node_boot_time_seconds|node_disk_(read_bytes_total|written_bytes_total|io_time_seconds_total))"
    action = "keep"
  }
}

node-exporter by default exposes hundreds of metrics covering every kernel subsystem. The regex keep-rule filters to the essentials:

Category	Metrics kept
CPU	`node_cpu_seconds_total` (all modes)
Memory	Total, free, available, buffers, cached, swap
Filesystem	Size and available bytes per mount
Network	Receive/transmit bytes per interface
Load	1m, 5m, 15m load averages
Disk I/O	Read bytes, write bytes, I/O time per device
System	Uptime (boot time), clock

The node-exporter compose file further filters which collectors run:

# apps/monitoring/node_exporter.yaml
command:
  - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|run|var/lib/docker).*'
  - '--collector.filesystem.fs-types-exclude=^(sysfs|procfs|autofs|cgroup|devtmpfs|devpts|tmpfs|nsfs|overlay|securityfs|tracefs)$'
  - '--collector.netdev.device-exclude=^(veth|docker|br-).*'

Docker-internal network interfaces (veth, docker0, bridge interfaces) are excluded. Virtual filesystems (tmpfs, overlay, etc.) are excluded. This prevents cardinality explosion from Docker's ephemeral per-container veth interfaces — each container creates a new veth, and without filtering you'd accumulate thousands of metric series over time.

Container Metrics (cAdvisor)

prometheus.relabel "cadvisor" {
  forward_to = [prometheus.remote_write.default.receiver]

  // Drop root/aggregate metrics (no container label)
  rule {
    source_labels = ["name"]
    regex = "^$"
    action = "drop"
  }

  // Extract container name from Docker Swarm task format
  rule {
    source_labels = ["name"]
    regex = "(.+)\\.(\\d+)\\.[a-zA-Z0-9]+$"
    target_label  = "container_name"
    replacement   = "$1.$2"
  }

  // Add service name label
  rule {
    source_labels = ["container_label_com_docker_swarm_service_name"]
    target_label  = "service_name"
  }

  // Keep only essential container metrics
  rule {
    source_labels = ["__name__"]
    regex = "(container_memory_usage_bytes|container_last_seen|container_cpu_user_seconds_total|container_network_(receive|transmit)_bytes_total|container_memory_cache|container_fs_(reads|writes)_bytes_total|container_cpu_usage_seconds_total)"
    action = "keep"
  }

  // Drop unused labels
  rule {
    regex  = "(__name__|container_name|service_name|job|instance)"
    action = "labelkeep"
  }
}

cAdvisor exports metrics for every container on the system, including Docker's internal containers. The first rule drops "aggregate" metrics that have no name label (cAdvisor's root-level stats). The final labelkeep rule drops all the verbose Docker label metadata that cAdvisor attaches — keeping only the labels that matter for querying.

The result: per-container CPU, memory, network, and disk I/O, labeled with service name and container name.

Logs: Docker and System

System Logs

local.file_match "system" {
  path_targets = [{
    __path__ = "/var/log/**/*log",
    job       = "varlogs",
  }]
}

loki.process "system" {
  forward_to = [loki.write.default.receiver]

  stage.drop {
    older_than = "1h0m0s"
  }
}

Alloy reads all *log files under /var/log/ (mounted from the host as read-only). The stage.drop rule discards log lines older than 1 hour — this prevents Alloy from re-shipping old logs after a restart, which would generate duplicate log entries in Grafana Cloud.

Docker Container Logs

discovery.docker "docker_swarm" {
  host             = "unix:///var/run/docker.sock"
  refresh_interval = "5s"
}

discovery.relabel "docker_swarm" {
  targets = []

  rule {
    source_labels = ["__meta_docker_container_name"]
    regex         = "(.+)\\.(\\d+)\\.[a-zA-Z0-9]+$"
    target_label  = "container_name"
  }

  rule {
    source_labels = ["__meta_docker_service_name"]
    target_label  = "service_name"
  }

  rule {
    source_labels = ["__meta_docker_service_label_com_docker_stack_namespace"]
    target_label  = "stack_name"
  }

  rule {
    source_labels = ["__meta_docker_container_id"]
    target_label  = "__path__"
    replacement   = "/var/lib/docker/containers/$1/$1-json.log"
  }
}

loki.source.docker "docker_swarm" {
  host          = "unix:///var/run/docker.sock"
  targets       = discovery.docker.docker_swarm.targets
  forward_to    = [loki.process.docker_swarm.receiver]
  relabel_rules = discovery.relabel.docker_swarm.rules
}

Docker Swarm container names follow the pattern stackname_servicename.replicanumber.taskid. The regex (.+)\.(\d+)\.[a-zA-Z0-9]+$ extracts a stable name (without the random task ID) so log streams from the same service don't fragment across container restarts.

The stack_name, service_name, and container_name labels are added to every log line, making Loki queries like {stack_name="traefik"} or {service_name="traefik_traefik"} work correctly.

cAdvisor and node-exporter Compose Files

Both exporters have carefully tuned compose configurations.

cAdvisor

# apps/monitoring/cadvisor.yaml (key sections)
services:
  cadvisor:
    image: ghcr.io/google/cadvisor:0.56.2
    command:
      - --docker_only=true
      - --housekeeping_interval=30s
      - --max_housekeeping_interval=35s
      - --global_housekeeping_interval=10m
      - --storage_duration=10m
    deploy:
      resources:
        limits:
          memory: 128M
        reservations:
          memory: 64M
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3

--docker_only=true restricts cAdvisor to Docker containers only (no system processes). --housekeeping_interval=30s matches the Alloy scrape interval — collecting at a higher frequency than the scrape interval provides no benefit. Memory is capped at 128M because cAdvisor has a known tendency to grow unbounded on busy systems.

restart_policy: condition: on-failure, max_attempts: 3 rather than always — repeated failures indicate a persistent issue that warrants investigation rather than indefinite restart attempts.

node-exporter

services:
  node-exporter:
    image: prom/node-exporter:v1.10.2
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'

node-exporter needs read-only access to /proc, /sys, and the filesystem root to collect host metrics. It runs in the host PID namespace to see all processes. The :ro mounts ensure it can only read, never write.

Remote Write to Grafana Cloud

Both Prometheus and Loki ship to Grafana Cloud via basic auth:

prometheus.remote_write "default" {
  endpoint {
    url = "<YOUR_GRAFANA_CLOUD_PROMETHEUS_URL>/api/prom/push"

    basic_auth {
      username      = "<YOUR_GRAFANA_CLOUD_METRICS_USERNAME>"
      password_file = "/run/secrets/grafana_cloud_passwd"
    }
  }
}

loki.write "default" {
  endpoint {
    url = "<YOUR_GRAFANA_CLOUD_LOKI_URL>/loki/api/v1/push"

    basic_auth {
      username      = "<YOUR_GRAFANA_CLOUD_LOGS_USERNAME>"
      password_file = "/run/secrets/grafana_cloud_passwd"
    }
  }
}

The grafana_cloud_passwd is an API key from Grafana Cloud. It's stored encrypted in the Git repo (SOPS) and decrypted by SwarmCD at deploy time into a Docker secret. The same password is shared between the Prometheus and Loki write endpoints (Grafana Cloud uses the same credential for both).

Suggested Grafana Dashboard Panels

Once data is flowing to Grafana Cloud, here are the most useful panels to build:

Server Overview:

CPU usage (% of total, per-core breakdown)
Memory usage (total / available / used)
Disk usage per mount point
Network traffic (bytes/sec in/out)
System uptime

Traefik:

Requests/sec per service (rate of traefik_service_requests_total)
Error rate (4xx + 5xx as % of total)
P95 latency per service (histogram from traefik_service_request_duration_seconds_bucket)
Active open connections

Container Resources:

Memory usage per service (container_memory_usage_bytes by service_name)
CPU usage per service (rate of container_cpu_usage_seconds_total)
Container restarts (changes in container_last_seen gaps)

Logs (Grafana Explore or dashboard panels):

{stack_name="traefik"} — Traefik access logs
{service_name="swarmcd_swarmcd"} — SwarmCD deploy events
{job="varlogs"} — System logs (SSH, fail2ban bans, etc.)

Summary

The observability stack in this setup achieves:

Metrics from 3 sources (Traefik, node-exporter, cAdvisor) shipped to Grafana Cloud Prometheus
Logs from 2 sources (Docker containers, system) shipped to Grafana Cloud Loki
One container (Grafana Alloy) handling all of the above
Zero cost — Grafana Cloud free tier is sufficient with aggressive metric filtering
GitOps-deployed — changes to grafana_alloy.alloy trigger automatic redeployment via SwarmCD

The metric filtering is the key insight: Prometheus exporters expose far more data than you need. Being selective about what you ship keeps you within free tier limits and makes dashboards faster to query.

All source code: gitlab.com/sakonn/docker-swarm-gitops