What you'll learn:
- Why Grafana Alloy replaces both Prometheus Agent and Promtail
- What metrics are scraped from each service and why
- How cardinality management keeps you inside Grafana Cloud's free tier
- How Docker log discovery works with the Alloy pipeline
- What to build in Grafana dashboards for this stack
One Container for Everything
Before Grafana Alloy, a standard observability setup required:
- Prometheus or Prometheus Agent for metrics scraping and remote write
- Promtail for log collection and Loki shipping
- Separate configs, separate containers, separate log rotation concerns
Grafana Alloy is the successor to both. One container, one config file (grafana_alloy.alloy), and it handles metrics and logs in a single pipeline.
Why not a full Prometheus stack? Self-hosted Prometheus needs storage, retention config, and an alertmanager. For a single server, that's more infrastructure than the apps it monitors. Grafana Cloud's free tier gives you 14-day metric retention, 30-day log retention, and managed alerting — without running any of that yourself.
The free tier limits that matter:
- 10,000 active metric series
- 50GB logs/month
- 14 days metric retention
Aggressive metric filtering (covered below) keeps this stack well under those limits.
The Alloy Config Structure
Alloy uses a config language called River (similar to HCL). The full config is at apps/monitoring/grafana_alloy.alloy.
The pipeline follows this pattern:
prometheus.scrape → prometheus.relabel → prometheus.remote_write
loki.source → loki.process → loki.write
Each component is declared with a name and wired together via forward_to references. This makes the data flow explicit and easy to trace.
Metrics: What Gets Scraped
Traefik
prometheus.scrape "traefik" {
targets = [{ __address__ = "traefik:8899" }]
forward_to = [prometheus.relabel.traefik.receiver]
scrape_interval = "30s"
}
prometheus.relabel "traefik" {
forward_to = [prometheus.remote_write.default.receiver]
rule {
source_labels = ["__name__"]
regex = "(traefik_open_connections|traefik_entrypoint_requests_total|traefik_entrypoint_request_duration_seconds_sum|traefik_entrypoint_request_duration_seconds_bucket|traefik_entrypoint_request_duration_seconds_count|traefik_service_requests_total|traefik_service_request_duration_seconds_bucket|traefik_service_request_duration_seconds_sum|traefik_service_request_duration_seconds_count|traefik_service_requests_bytes_total|traefik_service_responses_bytes_total)"
action = "keep"
}
}
Traefik exposes metrics on port :8899 (the metrics entrypoint defined in static config). Raw Traefik output is ~50+ metric series. The relabeling rule keeps only 11 specific metrics:
| Metric | What it tells you |
|---|---|
traefik_open_connections |
Active connections right now |
traefik_entrypoint_requests_total |
Total requests per entrypoint |
traefik_entrypoint_request_duration_seconds_* |
Latency distribution (histogram) |
traefik_service_requests_total |
Total requests per backend service |
traefik_service_request_duration_seconds_* |
Per-service latency histogram |
traefik_service_requests_bytes_total |
Request bytes per service |
traefik_service_responses_bytes_total |
Response bytes per service |
This is enough to build:
- Request rate dashboards (requests/sec per service)
- Latency percentile panels (P50, P95, P99)
- Error rate panels (compare 2xx vs 4xx/5xx from Traefik access logs)
- Active connection gauges
Host Metrics (node-exporter)
prometheus.scrape "node_exporter" {
targets = [{ __address__ = "node-exporter:9100" }]
forward_to = [prometheus.relabel.node_exporter.receiver]
scrape_interval = "30s"
}
prometheus.relabel "node_exporter" {
forward_to = [prometheus.remote_write.default.receiver]
rule {
source_labels = ["__name__"]
regex = "(node_cpu_seconds_total|node_memory_(MemTotal|MemFree|MemAvailable|Buffers|Cached|SReclaimable|SwapTotal|SwapFree)_bytes|node_filesystem_(size|avail)_bytes|node_network_(receive|transmit)_bytes_total|node_load(1|5|15)|node_time_seconds|node_boot_time_seconds|node_disk_(read_bytes_total|written_bytes_total|io_time_seconds_total))"
action = "keep"
}
}
node-exporter by default exposes hundreds of metrics covering every kernel subsystem. The regex keep-rule filters to the essentials:
| Category | Metrics kept |
|---|---|
| CPU |
node_cpu_seconds_total (all modes) |
| Memory | Total, free, available, buffers, cached, swap |
| Filesystem | Size and available bytes per mount |
| Network | Receive/transmit bytes per interface |
| Load | 1m, 5m, 15m load averages |
| Disk I/O | Read bytes, write bytes, I/O time per device |
| System | Uptime (boot time), clock |
The node-exporter compose file further filters which collectors run:
# apps/monitoring/node_exporter.yaml
command:
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|run|var/lib/docker).*'
- '--collector.filesystem.fs-types-exclude=^(sysfs|procfs|autofs|cgroup|devtmpfs|devpts|tmpfs|nsfs|overlay|securityfs|tracefs)$'
- '--collector.netdev.device-exclude=^(veth|docker|br-).*'
Docker-internal network interfaces (veth, docker0, bridge interfaces) are excluded. Virtual filesystems (tmpfs, overlay, etc.) are excluded. This prevents cardinality explosion from Docker's ephemeral per-container veth interfaces — each container creates a new veth, and without filtering you'd accumulate thousands of metric series over time.
Container Metrics (cAdvisor)
prometheus.relabel "cadvisor" {
forward_to = [prometheus.remote_write.default.receiver]
// Drop root/aggregate metrics (no container label)
rule {
source_labels = ["name"]
regex = "^$"
action = "drop"
}
// Extract container name from Docker Swarm task format
rule {
source_labels = ["name"]
regex = "(.+)\\.(\\d+)\\.[a-zA-Z0-9]+$"
target_label = "container_name"
replacement = "$1.$2"
}
// Add service name label
rule {
source_labels = ["container_label_com_docker_swarm_service_name"]
target_label = "service_name"
}
// Keep only essential container metrics
rule {
source_labels = ["__name__"]
regex = "(container_memory_usage_bytes|container_last_seen|container_cpu_user_seconds_total|container_network_(receive|transmit)_bytes_total|container_memory_cache|container_fs_(reads|writes)_bytes_total|container_cpu_usage_seconds_total)"
action = "keep"
}
// Drop unused labels
rule {
regex = "(__name__|container_name|service_name|job|instance)"
action = "labelkeep"
}
}
cAdvisor exports metrics for every container on the system, including Docker's internal containers. The first rule drops "aggregate" metrics that have no name label (cAdvisor's root-level stats). The final labelkeep rule drops all the verbose Docker label metadata that cAdvisor attaches — keeping only the labels that matter for querying.
The result: per-container CPU, memory, network, and disk I/O, labeled with service name and container name.
Logs: Docker and System
System Logs
local.file_match "system" {
path_targets = [{
__path__ = "/var/log/**/*log",
job = "varlogs",
}]
}
loki.process "system" {
forward_to = [loki.write.default.receiver]
stage.drop {
older_than = "1h0m0s"
}
}
Alloy reads all *log files under /var/log/ (mounted from the host as read-only). The stage.drop rule discards log lines older than 1 hour — this prevents Alloy from re-shipping old logs after a restart, which would generate duplicate log entries in Grafana Cloud.
Docker Container Logs
discovery.docker "docker_swarm" {
host = "unix:///var/run/docker.sock"
refresh_interval = "5s"
}
discovery.relabel "docker_swarm" {
targets = []
rule {
source_labels = ["__meta_docker_container_name"]
regex = "(.+)\\.(\\d+)\\.[a-zA-Z0-9]+$"
target_label = "container_name"
}
rule {
source_labels = ["__meta_docker_service_name"]
target_label = "service_name"
}
rule {
source_labels = ["__meta_docker_service_label_com_docker_stack_namespace"]
target_label = "stack_name"
}
rule {
source_labels = ["__meta_docker_container_id"]
target_label = "__path__"
replacement = "/var/lib/docker/containers/$1/$1-json.log"
}
}
loki.source.docker "docker_swarm" {
host = "unix:///var/run/docker.sock"
targets = discovery.docker.docker_swarm.targets
forward_to = [loki.process.docker_swarm.receiver]
relabel_rules = discovery.relabel.docker_swarm.rules
}
Docker Swarm container names follow the pattern stackname_servicename.replicanumber.taskid. The regex (.+)\.(\d+)\.[a-zA-Z0-9]+$ extracts a stable name (without the random task ID) so log streams from the same service don't fragment across container restarts.
The stack_name, service_name, and container_name labels are added to every log line, making Loki queries like {stack_name="traefik"} or {service_name="traefik_traefik"} work correctly.
cAdvisor and node-exporter Compose Files
Both exporters have carefully tuned compose configurations.
cAdvisor
# apps/monitoring/cadvisor.yaml (key sections)
services:
cadvisor:
image: ghcr.io/google/cadvisor:0.56.2
command:
- --docker_only=true
- --housekeeping_interval=30s
- --max_housekeeping_interval=35s
- --global_housekeeping_interval=10m
- --storage_duration=10m
deploy:
resources:
limits:
memory: 128M
reservations:
memory: 64M
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
--docker_only=true restricts cAdvisor to Docker containers only (no system processes). --housekeeping_interval=30s matches the Alloy scrape interval — collecting at a higher frequency than the scrape interval provides no benefit. Memory is capped at 128M because cAdvisor has a known tendency to grow unbounded on busy systems.
restart_policy: condition: on-failure, max_attempts: 3 rather than always — repeated failures indicate a persistent issue that warrants investigation rather than indefinite restart attempts.
node-exporter
services:
node-exporter:
image: prom/node-exporter:v1.10.2
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
node-exporter needs read-only access to /proc, /sys, and the filesystem root to collect host metrics. It runs in the host PID namespace to see all processes. The :ro mounts ensure it can only read, never write.
Remote Write to Grafana Cloud
Both Prometheus and Loki ship to Grafana Cloud via basic auth:
prometheus.remote_write "default" {
endpoint {
url = "<YOUR_GRAFANA_CLOUD_PROMETHEUS_URL>/api/prom/push"
basic_auth {
username = "<YOUR_GRAFANA_CLOUD_METRICS_USERNAME>"
password_file = "/run/secrets/grafana_cloud_passwd"
}
}
}
loki.write "default" {
endpoint {
url = "<YOUR_GRAFANA_CLOUD_LOKI_URL>/loki/api/v1/push"
basic_auth {
username = "<YOUR_GRAFANA_CLOUD_LOGS_USERNAME>"
password_file = "/run/secrets/grafana_cloud_passwd"
}
}
}
The grafana_cloud_passwd is an API key from Grafana Cloud. It's stored encrypted in the Git repo (SOPS) and decrypted by SwarmCD at deploy time into a Docker secret. The same password is shared between the Prometheus and Loki write endpoints (Grafana Cloud uses the same credential for both).
Suggested Grafana Dashboard Panels
Once data is flowing to Grafana Cloud, here are the most useful panels to build:
Server Overview:
- CPU usage (% of total, per-core breakdown)
- Memory usage (total / available / used)
- Disk usage per mount point
- Network traffic (bytes/sec in/out)
- System uptime
Traefik:
- Requests/sec per service (rate of
traefik_service_requests_total) - Error rate (4xx + 5xx as % of total)
- P95 latency per service (histogram from
traefik_service_request_duration_seconds_bucket) - Active open connections
Container Resources:
- Memory usage per service (container_memory_usage_bytes by service_name)
- CPU usage per service (rate of container_cpu_usage_seconds_total)
- Container restarts (changes in container_last_seen gaps)
Logs (Grafana Explore or dashboard panels):
-
{stack_name="traefik"}— Traefik access logs -
{service_name="swarmcd_swarmcd"}— SwarmCD deploy events -
{job="varlogs"}— System logs (SSH, fail2ban bans, etc.)
Summary
The observability stack in this setup achieves:
- Metrics from 3 sources (Traefik, node-exporter, cAdvisor) shipped to Grafana Cloud Prometheus
- Logs from 2 sources (Docker containers, system) shipped to Grafana Cloud Loki
- One container (Grafana Alloy) handling all of the above
- Zero cost — Grafana Cloud free tier is sufficient with aggressive metric filtering
-
GitOps-deployed — changes to
grafana_alloy.alloytrigger automatic redeployment via SwarmCD
The metric filtering is the key insight: Prometheus exporters expose far more data than you need. Being selective about what you ship keeps you within free tier limits and makes dashboards faster to query.
All source code: gitlab.com/sakonn/docker-swarm-gitops
Top comments (0)