Scaling Observability: Designing a Resilient Multi-Node Monitoring Stack with Docker, Prometheus & Grafana

#docker #devops #prometheus #grafana

Building a monitoring environment on a local machine is a great weekend project, but scaling it up to look after a live fleet of remote servers requires shifts in how you handle configuration stability, dashboard variables, and hardware persistence.
In this post, I want to walk through how I configured and optimized a multi-node monitoring stack utilizing Prometheus, Node Exporter, and Grafana deployed entirely via Docker Compose.

The Deployment Architecture To keep things clean and modular, the entire monitoring core runs as separate containerized microservices. The telemetry relies on bind-mounts to guarantee that if a container is wiped or updated, the custom target definitions stay safe on disk. Here is the structural framework of the modern docker-compose.yml layout used to spin it up: version: '3.8'

services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: always
volumes:
- ./prometheus:/etc/prometheus
ports:
- "9090:9090"

grafana:
image: grafana/grafana:latest
container_name: grafana
restart: always
ports:
- "3000:3000"

Solving the High-Availability Problem
A common issue with basic Docker deployments is that if the physical or virtual host undergoes a sudden reboot or power failure, your container instances drop offline into an Exited state.
By applying the restart: always policy under our services, the Docker daemon automatically handles relaunching the infrastructure as soon as the system initializes. No manual ssh intervention required.

Scraping Multiple Remote Targets Inside the prometheus.yml target profile, I pooled our infrastructure assets into distinct target blocks. Rather than hardcoding distinct jobs for every server, grouping identical server profiles under a singular array makes filtering exponentially cleaner global: scrape_interval: 15s

scrape_configs:

job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
job_name: 'remote_ubuntu_nodes'
static_configs:
- targets:
  - '192.168.23.87:9110'
  - '192.168.23.88:9100'
  - '192.168.23.89:9100'
  - '192.168.23.90:9100' Transitioning to a Fleet View in Grafana Standard configurations for public dashboards (like the classic Node Exporter Full) default to strict single-select filters. When checking on multiple nodes like load balancers or app-services, clicking down an endless dropdown isn't sustainable. To move to a comprehensive fleet view, we can tap into Dashboard Settings (s shortcut in Grafana) and adjust the query variables: Multi-value selection: Enabled. Include All option: Enabled. To prevent the gauges from blending the metrics into a confusing average, you can open the row settings for your graphs and toggle Repeat For: Instance. Grafana will then dynamically duplicate that entire row of health metrics for every machine checking into the cluster.

DEV Community

Scaling Observability: Designing a Resilient Multi-Node Monitoring Stack with Docker, Prometheus & Grafana

Top comments (0)