You Don't Need Kubernetes to Monitor 20 Linux VMs

#devops #linux #monitoring #selfhosted

If you've ever tried to set up Prometheus by following the official getting-started path, you're likely to find a path that does not follow your infrastructure model. Out of the gate, page one mentions kube-prometheus-stack. Page two wants you to install a Helm chart, and page three assumes you already have a cluster running. The documentation for monitoring plain Linux servers is in there somewhere, but you have to dig for it. When you do find it, the tone suggests you are doing something slightly old-fashioned.

If that sounds like your setup, the tooling is making this harder than it actually is. Monitoring a fleet of Linux VMs is fairly simple and has been for years. It is just obscured behind documentation that would prefer to sell you something bigger.

Modern infrastructure tooling has quietly decided everyone runs Kubernetes. If you don't, the assumption is that you eventually will. Meanwhile, most real-world infrastructure still runs on VMs.

TL;DR: Modern observability documentation often assumes you're running Kubernetes. Most small teams aren't. If you're managing a fleet of Linux VMs, node_exporter plus Prometheus gives you everything you need for infrastructure monitoring with a single lightweight agent and a straightforward deployment model. No cluster required.

VMs are often the answer

For most small businesses, running VMs instead of Kubernetes does not mean you failed to evolve. Most workloads under a certain scale perform better on VMs:

One process per box, predictable resource limits, and the ability to ssh in and look at what's happening, which makes it easier to keep track of the infrastructure as a whole.
They're cheaper, both financially and in the mental overhead of running them.
Backups and snapshots are straightforward in a way stateful Kubernetes still isn't.
There's no control plane that itself needs monitoring and upgrades and care.

Kubernetes solves problems that mostly pertain to companies with dozens of engineers and hundreds of services. For platforms that consist of 20 VMs, Kubernetes is the wrong tool, and being told you need it before you're allowed to have monitoring is the wrong approach.

What node_exporter actually is

What you need is called node_exporter, a lightweight systemd process.

It's a single Go binary, around 25 MB. It runs as one process on each VM, reads metrics from the kernel through /proc and /sys, and exposes them on an HTTP endpoint, normally port 9100. It's very uncomplicated: there's no daemon set, operator, sidecar, CRD, cluster, or control plane. It runs quietly in the background and answers HTTP on port 9100 with a plain-text list of numbers. You can curl it yourself and read it:

curl http://<localhost or IP>:9100/metrics

What comes back is a few hundred lines of metrics containing CPU time per core per mode, memory broken down by category, disk space per mountpoint, network bytes per interface, load, uptime, and open file handles. It tells you everything the kernel knows about the server, in a format Prometheus reads directly.

The agent the big observability vendors want to install on your servers is doing this same job. It reads from /proc and exposes metrics, but they've wrapped it in a config model and an update mechanism and a logo. The core of it is what node_exporter has been doing for over a decade. You are not missing out on some sophisticated technology by over-complicating your system. The simple, plain version is the technology.

Setting up one VM

Here's the actual setup on a single box. Check the releases page for the current version before you run this, the version string changes.

# Download the binary
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz

# Extract and install
tar xzf node_exporter-1.8.2.linux-amd64.tar.gz
sudo mv node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin/

# Run it as its own unprivileged user
sudo useradd --no-create-home --shell /bin/false node_exporter

# systemd unit
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Start it
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

# Confirm it's alive
curl http://localhost:9100/metrics | head

With these ten commands, you can have it running in under five minutes. It sits at roughly 20 MB of RAM and you'll likely forget it's there. One thing you should do is lock down port 9100. Leave it open to your monitoring server and nothing else. node_exporter exposes details about your system and it shouldn't be reachable from the public internet. It should be behind your firewall.

It is a little repetitive

The same setup runs on every machine, so there are a few ways to deploy it if you have more than 5 to 10 servers to monitor. The setup is the same for almost all Linux distributions.

If you're already using Ansible, the node_exporter playbook is about 30 lines and is one of the most copy-pasted snippets out there. The cloudalchemy.node_exporter role does it for you with reasonable defaults if you'd rather not write your own.

You can also use a shell loop over ssh if you don't want to add new tooling. Walk your hostnames, ssh in, run the commands above. Twenty boxes will probably take around ten minutes.

If you spin servers up and down often using a VM image or cloud-init, you can just include node_exporter in the base image. Every new VM will show up already monitoring itself.

The monitoring side is one Prometheus instance pointed at the list of servers you want to monitor:

# prometheus/prometheus.yml
scrape_configs:
  - job_name: 'linux-vms'
    static_configs:
      - targets:
          - vm1.example.com:9100
          - vm2.example.com:9100
          - vm3.example.com:9100
          # ...the rest of them
        labels:
          environment: production

For 20 boxes, that static list is genuinely fine. If you add and remove servers a lot, file_sd_configs lets Prometheus pick up target changes from a file without a restart, which carries you much further. The setup isn't too much more complicated:

# prometheus/prometheus.yml
scrape_configs:
  - job_name: 'linux-vms'
    file_sd_configs:
      - files:
          - /etc/prometheus/file_sd/linux-vms.yml
        refresh_interval: 30s

The file structure requires that you add a file_sd directory to the prometheus folder:

prometheus/
├── prometheus.yml
└── file_sd/
    └── linux-vms.yml

# file_sd/linux-vms.yml
- targets:
    - vm1.example.com:9100
    - vm2.example.com:9100
  labels:
    environment: production
    role: web

- targets:
    - db1.example.com:9100
  labels:
    environment: production
    role: database

- targets:
    - staging1.example.com:9100
  labels:
    environment: staging
    role: web

If you put each server directly into prometheus.yml, you have to restart Prometheus every time you add one. By putting your servers in the file under file_sd, Prometheus picks them up automatically on the refresh interval. That's a little extra structure up front, so if your infrastructure is largely static it isn't really worth it. If you're constantly onboarding or removing servers, the extra layer removes a lot of the maintenance.

What you can actually see

With node_exporter on every VM and one Prometheus pulling from them, here are real questions you can answer:

CPU across the whole fleet for the last hour: one query over node_cpu_seconds_total, split by instance.
Which box is closest to full: node_filesystem_avail_bytes against node_filesystem_size_bytes.
When vm7 last rebooted: node_boot_time_seconds.
Which box is dropping the most packets: a rate over node_network_receive_drop_total.
Whether memory has been slowly tightening on anything over the past week: node_memory_MemAvailable_bytes plotted across all instances.

Everything can be viewed in Grafana using queries written in PromQL. I wrote up the five basic queries you need to monitor a Linux server separately, with each one explained in detail.

That covers what a small fleet typically needs. Monitoring doesn't require Kubernetes, or giant vendors like Datadog, or agent vendors. A Go binary on each box and one instance of Prometheus and Grafana.

Maintenance costs

Getting node_exporter onto 20 VMs and setting up Prometheus and Grafana is relatively easy. It's all open source and available to anyone. But most teams underestimate dashboard design, alert tuning, retention planning, and long-term maintenance. Making sure Prometheus stays healthy and the prometheus.yml and file_sd/*.yml files are all up to date, building functional dashboards, writing alert rules that fire on real problems without creating noise, sorting out retention, getting alerts somewhere a human will actually see them, and keeping all of it patched as each piece ships new versions: that becomes ongoing operational work somebody has to own. All of it grows in complexity with the fleet. On top of that, the monitoring stack itself can go down, which takes time and effort to troubleshoot and fix.

If you like that sort of work, or you have dedicated people who can take on the additional load, node_exporter, Prometheus, and Grafana are excellent. If you have the money to spend, Datadog is a great company.

Where Irin comes in

Because maintaining the monitoring stack is a burden most small businesses don't have the time or resources for, I built Irin Observability. You keep your attention on running your business and keep an eye on it through dashboards and alerts that are already built and tuned. Instead of node_exporter, Irin uses Grafana Alloy as the agent. It covers the same infrastructure metrics, ships your logs, supports additional telemetry pipelines, and installs with a single bootstrap command. Instead of a pull-based model that requires you to open a port to your monitoring server, it pushes your data out through an encrypted Cloudflare tunnel. Your dashboards, alerts, and retention live on Irin's infrastructure. The only thing on your boxes is the agent, and it stays out of the way.

The pitch really isn't the point, though, and I'm only scratching the surface of what node_exporter or Alloy can do. The point is that the docs may be telling you a story that isn't true for your situation. You do not need Kubernetes to watch a handful of Linux servers. You need a small binary on each box and something to scrape it. Run that something yourself or pay someone to run it, either is fine. The architecture underneath is simple no matter who operates it, and it's been sitting in plain sight the whole time under a pile of cloud-native marketing.