How I Built a Centralized Monitoring System for 600–700 Servers — Using Open Source Tools

#devops #aws #monitoring #opensource

What happens when you're responsible for hundreds of client servers — and have no single pane of glass to see them all? You build one.

Managing infrastructure at scale is one of those challenges that seems simple until you're actually doing it. Scattered dashboards, siloed logs, missed alerts — the classic observability nightmare. Here's how I designed and deployed a centralized monitoring and logging stack that now covers over 600 client servers, all from a single control plane on AWS.

The Problem

Our team manages infrastructure for multiple clients — each with their own servers, environments, and applications. Before this implementation, monitoring was fragmented: each client had different tooling, visibility was reactive, and log investigation meant SSHing into individual machines. We needed something better.

The goal: one place to see everything — metrics, logs, alerts — with zero vendor lock-in and minimal cost.

The Stack I Chose

All open source. All containerized. Deployed on a single Amazon EC2 instance using Docker Compose.

Why this over the ELK stack? Loki is index-free by design —

it stores logs as compressed chunks and queries them using labels. For our scale,

this means dramatically lower storage and compute costs without sacrificing searchability.

The Architecture

Each client server runs three lightweight agents.

They ship everything to a central EC2 instance where Prometheus, Loki, and Grafana live inside Docker containers.

// System Architecture Flow

Client Servers (600–700)
  │
  ├── Node Exporter     → CPU, RAM, Disk, Network metrics
  ├── cAdvisor          → Container resource metrics
  └── Promtail          → App, Nginx, System & Docker logs
  │
  ▼
Central EC2 — Dockerized Stack
  │
  ├── Prometheus        → Scrapes & stores time-series metrics
  ├── Loki              → Aggregates & indexes logs by labels
  └── Grafana           → Unified dashboards + alerting
  │
  ▼
Alerts → Email / Slack / Webhook

The Multi-Client Challenge

600+ servers means many different clients. The key design decision was label-based isolation. Every Promtail agent tags its logs with:

In Grafana, dashboard variables are templated —

so an operator can switch between client views instantly,

or drill down to a specific server, environment,

or log level with a single dropdown.

No extra dashboards to maintain.

Alerting That Actually Works

Alerts are configured directly in Prometheus with rules that fire when thresholds are crossed.

Every alert routes to the right channel — email, Slack, or

webhook — depending on severity and client.

Key Results & Learnings

Reduced incident detection time from hours to minutes with real-time alerting
Centralized log search eliminated the need to SSH into individual servers for debugging
Loki proved significantly cheaper than ELK at this scale — no full-text indexing overhead
Dashboard templating was the game-changer for multi-client visibility
Docker Compose made the entire stack reproducible and easy to upgrade
Label discipline from day one made filtering and querying effortless at 600+ servers

06What's Next?

The stack is already handling 600–700 servers comfortably, but there's always room to grow.

Next on the roadmap: scaling Prometheus with federation or Thanos for long-term metric storage,

and exploring Grafana Alloy to replace Promtail for a unified telemetry agent.

If you're dealing with fragmented infrastructure visibility, I genuinely believe this open-source stack is one of the best investments you can make.

Zero license cost. Massive flexibility. Battle-tested at scale.

Building something similar or facing observability challenges at scale?

Drop a comment or connect — happy to share configs, lessons learned, or go deeper on any part of this architecture. Let's talk observability.