DEV Community: Justyn Larry

Migrating from Promtail to Alloy for Log Collection

Justyn Larry — Thu, 23 Jul 2026 14:44:33 +0000

TL;DR: Promtail reached end of life in March 2026, making Alloy the supported replacement for shipping logs to Loki. The migration is mostly a matter of replacing Promtail's scrape configuration with loki.source.*, loki.process, and loki.write components. The biggest pitfall isn't the configuration itself—it's assuming how a host writes logs. Detect whether a machine uses the systemd journal or log files before configuring Alloy, or you may end up silently collecting nothing.

In this article I discussed moving host metrics from a scraped node_exporter to a push-based Alloy agent, and followed it up with a piece comparing Alloy against Prometheus Agent mode. Both of those articles discussed metrics and how they move between the host and client nodes. This post covers the log-shipping portion of the same story, and it has a deadline attached that the metrics migration didn't.

Promtail reached end of life on March 2, 2026. It continues to function, but it no longer receives bug fixes or security updates. If you are shipping logs to Loki with Promtail today, you are running an unmaintained agent, and the replacement Grafana points you at is the same Alloy you may already be running for metrics. If you're already running Alloy for metrics, the log migration is mostly adding components to a config that already exists. If you're not, this article will strengthen the case for consolidating onto Alloy rather than running a separate, now-unmaintained log agent.

             Promtail

  Journal/File Logs
         │
         ▼
     Promtail
         │
         ▼
       Loki


             Alloy

 Journal/File Logs
         │
         ▼
  loki.source.*
         │
         ▼
   loki.process
         │
         ▼
    loki.write
         │
         ▼
        Loki

What actually changes

Promtail is a standalone binary with its own YAML config: scrape_configs, clients, and a positions file, built for one job. Alloy replaces that YAML configuration with a component-based pipeline. The host produces log entries, Alloy processes them if required, and forwards them to Loki. There are three component types, all wired together with forward_to references.

Promtail	Alloy
Journal scrape	`loki.source.journal`
File scrape	`loki.source.file`
Pipeline stages	`loki.process`
`clients`	`loki.write`

Between the source and destination, loki.process handles parsing, filtering, label manipulation, and other pipeline stages that previously lived inside Promtail.

The journal case

loki.source.journal "systemd_journal" {
  forward_to = [loki.process.add_labels.receiver]
}

The source reads the journal and hands each line to a loki.process component named add_labels, which is where I attach the same tenant, cluster, environment, and role labels that ride along with the metrics from that host to help maintain separation between clients. You can read more about my tenant isolation model here. Labeling logs and metrics identically at the edge is what lets me line them up later in Grafana, and it is worth getting consistent from the first host rather than fixing it in queries forever after.

The file-based case

local.file_match "logs" {
  path_targets = [
    {"__path__" = "/var/log/syslog"},
    {"__path__" = "/var/log/auth.log"},
    {"__path__" = "/var/log/messages"},
    {"__path__" = "/var/log/secure"},
  ]
}

loki.source.file "log_scrape" {
  targets    = local.file_match.logs.targets
  forward_to = [loki.process.add_labels.receiver]
}

local.file_match resolves glob patterns into concrete targets, and loki.source.file tails whatever it finds. I list both Debian-family paths and RHEL-family paths because which files exist depends on the distro, and missing paths are simply skipped. In practice I detect which log method a host actually uses at install time and inject only the matching block.

The trap: don't guess the log method by distro

I learned this lesson the hard way. I assumed Debian meant /var/log/syslog and RHEL meant journald.

I thought the obvious way to decide between journal and file collection was to base it on distro family. That doesn't cover edge cases. Plenty of Debian systems have rsyslog disabled or absent and log only to the journal, while some hosts have been configured differently by whoever set them up. Alloy starts cleanly, reports itself healthy, and quietly ships no logs because it is watching the wrong source. The first indication anything is wrong is often during an incident, when the logs you expected simply aren't there.

What actually works is an empirical check. Determine whether the host is actively writing journal entries or log files, then configure Alloy accordingly.

Migrating without a gap

Stand up Alloy's log collection alongside Promtail rather than cutting over immediately. Both can ship to the same Loki, and the extra resource overhead is negligible. You will get some duplicate lines during the overlap, which is a much better failure mode than a hole in your logs. Confirm in Grafana that the Alloy-sourced logs are arriving with the correct labels, then remove Promtail.

Loki deduplicates identical log entries that share the same labels. If Alloy adds even one different label, the same log line becomes part of a different stream and both copies remain visible. Match your label set before tearing the old agent down.

Where this leaves you

If you already migrated metrics to Alloy, adding log collection is just a handful of additional components. If you are still on Promtail, the March 2026 end-of-life date makes this migration worth prioritizing. The migration itself is straightforward. The only genuinely dangerous part is the silent-failure trap, and that's avoidable if you verify how each host actually writes logs instead of assuming based on its distro.

If you've already made this move, I'm curious whether you ran into the journal-versus-file mismatch too, or whether your fleet was uniform enough that the distro heuristic held.

Monitoring Docker Containers with Grafana Alloy and cAdvisor

Justyn Larry — Tue, 21 Jul 2026 12:29:15 +0000

TL;DR\
Host metrics tell you whether a server is healthy. cAdvisor tells you
which container isn't. This article shows how to integrate cAdvisor
into a Grafana Alloy push architecture, avoid the common cardinality
trap, and build a few alerts that catch real problems.

Container Monitoring

In my previous posts, I compared Prometheus Agent Mode with Grafana
Alloy and walked through migrating from node_exporter to Alloy. Both
focused on the agent responsible for shipping host metrics and logs
upstream.

The natural next step is monitoring Docker containers.

One of my servers runs eighteen Docker containers spread across multiple
Compose files. Host-level CPU and memory metrics can tell me the server
is healthy, but they cannot tell me which container is consuming all
of the memory or unexpectedly restarting.

cAdvisor solves that problem. Originally developed by Google, it reads
container resource usage directly from Linux cgroups and namespaces
without requiring any instrumentation inside the containers themselves.
In this article I'll show how it fits cleanly into a push-based Alloy
architecture.

Deployment

cAdvisor runs as its own container with several read-only mounts so it
can inspect Docker and the host's cgroup state:

docker run \
  --name cadvisor \
  --privileged \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:ro \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker:/var/lib/docker:ro \
  --volume=/dev/disk:/dev/disk:ro \
  --publish=<port>:8080 \
  gcr.io/cadvisor/cadvisor:latest

Although --privileged often raises eyebrows, every mounted volume is
read-only. cAdvisor is observing host state, not modifying it.

I also recommend avoiding a hardcoded 8080 mapping. It is one of the
most frequently occupied ports on development machines and small
servers. My installer probes for an available port and falls back to
9338, then reports the selected port back so Alloy can generate the
correct scrape target automatically.

Wiring it into Alloy

Rather than exposing cAdvisor to a central Prometheus server, Alloy
scrapes it locally over localhost and forwards those metrics using the
same remote_write pipeline that already carries host metrics and logs.

                    Docker Host
┌──────────────────────────────────────────┐
│                                          │
│  Docker Containers                       │
│        │                                 │
│        ▼                                 │
│    cAdvisor                              │
│        │ localhost scrape                │
│        ▼                                 │
│   Grafana Alloy                          │
│        │ remote_write                    │
└────────┼─────────────────────────────────┘
         │
         ▼
┌───────────────────────────────┐
│ Central Monitoring            │
│                               │
│ Prometheus                    │
│ Loki                          │
│ Grafana                       │
│ Alertmanager                  │
└───────────────────────────────┘

From the backend's perspective, container metrics arrive exactly like
host metrics: already labeled, already authenticated, and without
requiring any inbound connectivity to the client.

This becomes especially valuable for clients behind CGNAT or dynamic
residential IP addresses. Once a push-based pipeline exists, adding
another local exporter is simply another scrape target---not another
monitoring system.

Avoiding the Cardinality Trap

One detail that many getting-started guides overlook is that cAdvisor
exports metrics for every cgroup it can see, not just your Docker
containers.

Without filtering, dashboards become cluttered with unnamed
infrastructure cgroups while your active series count grows for little
benefit.

The simplest fix is to consistently filter container metrics using a
PromQL label matcher:

container_cpu_usage_seconds_total{name!=""}

Alternatively, you can drop unnamed series at scrape time using metric
relabeling in Alloy or Prometheus.

Three Useful Alerts

Restart loops:

changes(container_start_time_seconds{name!=""}[1h]) > 3

High sustained CPU:

rate(container_cpu_usage_seconds_total{name!="",name!="POD"}[5m]) * 100 > 80

Container memory as a percentage of total host memory:

(container_memory_usage_bytes{name!=""}
 / on(instance, tenant) group_left node_memory_MemTotal_bytes) * 100 > 85

That final query uses a group_left join to compare a container-level
metric against host memory, producing a percentage that's much easier to
reason about than raw bytes.

Where This Fits

In Irin, Docker monitoring is implemented as an optional module, but the
architecture described here works with any push-based monitoring stack.
Once you've adopted a push-based agent, adding exporters like cAdvisor
becomes incremental work.

The difficult part isn't collecting the metrics---it's deciding which
ones are worth keeping.

If you're already running cAdvisor, I'd be interested to hear whether
you've run into the unnamed cgroup problem or found a filtering strategy
that works even better.

Why LLM Decisions Should Be Deterministic

Justyn Larry — Wed, 15 Jul 2026 14:07:14 +0000

TL;DR: I originally treated deterministic boundaries around LLMs as a consistency mechanism. I now think their real value is auditability. If the system's decisions are made by deterministic code rather than the model, every decision has a reproducible implementation. The LLM can explain that decision to humans, but it should never be the source of the decision itself.

In two previous posts I argued for keeping narration and decision-making separate. One covered monthly reporting. The other covered a real-time alert annotator. Both focused on consistency. This post argues that consistency is only the visible benefit; auditability is the deeper one. By auditability, I mean that a third party can inspect how a decision was reached and reproduce it from the implementation rather than relying on an after-the-fact explanation.

The Deterministic Layer

The alert annotator classifies every alert into one of eight values, enforced in Python against a fixed set, not requested in a prompt. If the model's output does not match one of the eight strings exactly, the field falls back to unknown rather than accepting whatever the model produced. That validation step is more important than the enum itself.

ALLOWED_CAUSES = {
    "memory_pressure", "cpu_saturation", "disk_pressure",
    "service_unavailability", "network_issue",
    "configuration_error", "external_dependency", "unknown",
}

def resolve_cause(model_output: str) -> str:
    cause = model_output.strip()
    if cause not in ALLOWED_CAUSES:
        return "unknown"
    return cause

Given the same input, this function always produces the same output. That means every classification is reproducible from the source code alone.

This is the shape of the check, not a copy of the production code, but the principle is that you should never trust an external system's output by letting it pass through unchecked. This is not an LLM-specific idea, it is the same discipline that should apply to a third-party webhook payload or an API response before it touches your data model. The model just happens to be the least predictable external system I integrate with, which makes the validation step the most visibly valuable.

I originally wrote about this as a consistency fix, and failed to discuss its full value. Before the enum existed, the same alert produced different category strings across separate runs, which made cross-tenant pattern analysis impossible. What the validation step actually guarantees is that every classification belongs to a bounded, deterministic set of outcomes. resolve_cause() is a pure function, and every valid result is reproducible from its implementation. It creates consistency in the input and output every time, and the entire decision is just sixteen lines of Python.

Traditional pipeline:

Metrics
   |
   v
Deterministic rules
   |
   |-- classification
   |
   v
LLM narration
   |
   v
Client

LLM-centric pipeline:

Metrics
   |
   v
LLM
   |
   |-- classification
   |-- explanation
   |
   v
Client

The important distinction isn't whether an LLM appears accurate most of the time. It's whether a reviewer can reproduce the decision months later from the same inputs. In the first architecture, the decision exists independently of the model's narration. In the second, the classification and its explanation originate from the same probabilistic process, making them difficult to separate during an audit.

Industry Trends

It turns out a great deal of current AI governance work is aimed at exactly this property, approached from the opposite direction. A large part of the LLM governance tooling market exists because production language model behavior does not offer the guarantee classical software does. The same prompt can produce different answers across runs, so teams are building trace-level evidence systems, output evaluators, and audit logging specifically to reconstruct what a probabilistic system did and why after the fact.

The regulatory backdrop makes the stakes explicit rather than abstract. Under the EU AI Act, certain classes of LLM application are treated as high-risk and come with mandatory logging and human-oversight requirements, with an explicit standard that records must be sufficient to reconstruct the system's operation, not just gesture at it. NIST's AI Risk Management Framework leans on the same assumption from a different angle, treating a reliable and auditable record of system behavior as a prerequisite for its governance functions.

None of that applies to the projects that I'm working on directly. Advisory infrastructure monitoring for small server fleets is not a high-risk category, and I am not building toward EU AI Act compliance. But the underlying problem is the same problem, just at a different scale and a different level of legal consequence. If a decision matters enough that someone might reasonably ask "why did the system do that," the honest answer needs to survive more scrutiny than a model's own account of itself.

Self-Reporting Is Not an Explanation

You can ask a language model why it produced a given output, and it will give you a fluent, plausible answer, but it will not give you the actual causal mechanism. The explanation is generated during a new inference over the conversation history, not by inspecting the internal computation that produced the original answer. The model generates a plausible explanation rather than retrieving the causal process behind its earlier output. The model will make a guess and present it as a first-hand account.

Recent AI governance research states the practical consequence of this directly, stating that systems that need genuinely reproducible decisions should not rely on a probabilistic layer alone, they should sit a deterministic enforcement layer underneath it as a secondary safeguard. That is a formal way of describing something a lot of people building on LLMs arrive at independently once they hit production. I arrived at it because an enum kept coming back spelled three different ways, before I had even thought about what the industry was doing. Other people are arriving at it because a regulator asked for a record that would hold up. We're both coming to the same conclusion, but the reasons behind it are different.

Classical software doesn't explain itself, we inspect the implementation. A pure function doesn't tell us why it returned cpu_saturation, we read the code that produced that output. LLMs invert that relationship. They readily generate explanations, but those explanations are themselves model outputs rather than evidence of the underlying computation.

What Does Narration Actually Do?

None of this makes the narration layer pointless, it just refines its allowed job. The prose the model writes for a client report, or the plain-English line attached to an alert, is still probabilistic. I can log exactly what it said. I cannot claim a rigorous account of why it phrased something one way over another, but I do not need one, because that text is not load-bearing. It explains a decision, but it doesn't make one. If the narration layer disappeared entirely tomorrow, every report and every alert would still carry a correct, if blunter, classification, because the classification does not depend on the prose existing.

That is the actual shape of the boundary. Not "the model is untrustworthy so keep it away from important things," which is too broad to be useful, but "know exactly which outputs in your pipeline need to be reproducible, and make sure none of those outputs pass through a step you cannot validate against a fixed answer."

Where This Is Heading

Before the narrative layer ships to a client, it needs a short, plain statement of what the model is permitted to produce, how its output is labeled in the report, and what happens when it is wrong or unavailable. If a system needs to justify a decision months later, the justification should be found in deterministic code and recorded inputs, not in asking the model what it thinks it did. The model narrates a state that was already decided; it never gets to decide the state itself.

Prometheus Agent Mode vs Grafana Alloy: Choosing the Right Push Agent in 2026

Justyn Larry — Tue, 14 Jul 2026 12:36:22 +0000

TL;DR: If you only collect metrics, Prometheus Agent mode is lightweight, familiar, and difficult to beat. If you collect metrics, logs, or traces together, or expect to in the future, Grafana Alloy's unified pipeline is usually worth the additional complexity.

Once you've decided to move from pull-based scraping to a push architecture, the next question is which agent should actually run on each host. In 2026, the two strongest choices are Prometheus Agent mode and Grafana Alloy. I run Alloy across my production fleet, but that doesn't automatically make it the right answer for everyone.

The Shift in the Monitoring Landscape

Over the last couple of years, Grafana has consolidated both metrics and log collection into Grafana Alloy. Grafana Agent reached end of life on November 1, 2025, and Promtail followed on March 2, 2026. Neither receives security fixes anymore.

The practical choice moving forward:

Feature	Prometheus Agent	Grafana Alloy
Metrics	✅	✅
Logs	❌	✅
Traces	❌	✅
Config	Prometheus YAML	Alloy components
Footprint	Smaller	Larger
Learning curve	Low	Moderate
Future direction	Metrics agent	Unified telemetry

The table gives the short answer. The rest of this article explains where those differences actually matter in practice.

Prometheus Agent mode. Run the Prometheus binary with the --agent flag and it stops acting as a full Prometheus server. It no longer stores local TSDB blocks, evaluates alerting rules, or serves queries. Instead, it scrapes targets, buffers samples in a write-ahead log, and forwards them upstream via remote_write. It is Prometheus with the storage and query layers removed.

Grafana Alloy. A single agent that collects metrics, logs, and traces, processes them in a component pipeline, and pushes each signal to its backend. It embeds many exporters directly, so a line like prometheus.exporter.unix "node_exporter" {} gives you full node_exporter functionality without installing a separate binary.

The Case for Prometheus Agent

If you only need metrics, agent mode is hard to argue with.

The configuration is the Prometheus config you are probably already familiar with. The scrape_configs, relabeling, and service discovery are all the same. If your team is fluent in Prometheus YAML, there is nothing new to learn, and every Stack Overflow answer from the last decade is still applicable.

The resource footprint is small and predictable. Agent mode exists specifically to reduce Prometheus to one job: scrape metrics and forward them upstream. On a constrained edge box collecting a modest number of series, it is the lighter option, and it is maintained by the Prometheus project itself. If your logs already have a home, or you don't collect them at all, adding Alloy adds complexity you won't use.

Where Alloy Wins

If you want to monitor logs, the landscape changes dramatically. With agent mode you need a second agent for log shipping, and the tool most people used for that was Promtail, which is now end-of-life. You would probably end up running agent mode plus Alloy, at which point you may as well run one agent instead of two.

That consolidation is what sold me. On every host I monitor, one systemd service collects host metrics through the embedded node_exporter component, tails the journal for logs, and pushes both upstream over the same authenticated tunnel. One binary to install, one service to health-check, one config to manage per host. When I later added container metrics and disk health collection, those became new components in the same pipeline instead of new daemons.

The pipeline model streamlines the operation on the processing side too. Labels get attached at the edge before data ever leaves the client machine: every sample arrives already tagged with tenant, cluster, environment, and role, which is what makes multi-tenant isolation by label possible. That means routing, dashboards, and alerting can all rely on the same label set without additional processing upstream. Doing the equivalent in agent mode means metric relabeling rules, and applying it to logs means a second tool entirely.

Alloy has become Grafana Labs' strategic collection agent following the retirement of Grafana Agent and Promtail. It has first-class OTLP support, so when I added tracing, I was able to add the receiver as a config block instead of installing a new agent. Everything Grafana folded in from Promtail and Grafana Agent now lives here, and this is where Grafana Labs is focusing new collection features.

Although Alloy is developed by Grafana Labs, it isn't tied to Grafana Cloud. It speaks standard protocols such as Prometheus remote_write and OTLP, so it works just as well with self-hosted Prometheus, Loki, Tempo, Mimir, or other compatible backends.

Alloy's Hidden Costs

The configuration language is a real learning curve. Alloy configs are components wired together with forward_to references, in a Terraform-like syntax. I think it is genuinely better than YAML once it clicks, because the pipeline is explicit, and you can read a config top to bottom and see exactly where data flows. The learning curve is steep, and small syntax details can create headaches. Alloy has a larger runtime footprint because it bundles a much broader telemetry pipeline, including OpenTelemetry Collector capabilities, embedded exporters, and support for multiple signal types. For metrics-only work on tiny hosts, agent mode is leaner.

Fleet management is also more complicated. Alloy configs are per-host and declarative, which is great until you have dozens of them and a label schema change means touching every one. The method I used to streamline this process was to generate configs from a template, then build a sync mechanism where hosts pull updated configs on a schedule.

Choose the Tool That Fits

If your infrastructure is only tracking metrics and your team is already fluent in Prometheus config, running Prometheus Agent mode is probably the right choice.

If you need to track both metrics and logs or traces, or if you plan to in the future, Alloy is probably the better choice. The single-agent model pays for its learning curve quickly, especially if your business is growing and your infrastructure is expanding.

If you're already running Grafana Agent or Promtail, you don't have a choice anymore, and alloy convert will translate your existing config as a starting point. Treat the output as a draft and verify it against the running system, not the migration guide.

I went with Alloy because I ship logs alongside metrics for every host, embedding node_exporter meant one less binary on client machines, and because edge labeling is load-bearing for how we isolate tenant data. Those reasons are specific to my environment. If I only needed metrics from a handful of systems, I would probably choose Prometheus Agent mode instead. The decision isn't really Prometheus versus Alloy. It's whether you want a dedicated metrics forwarder or a unified telemetry pipeline. Once you know which problem you're solving, the choice becomes much clearer.

Migrating from node_exporter to Grafana Alloy, One Server at a Time

Justyn Larry — Wed, 08 Jul 2026 12:46:19 +0000

If you've been monitoring Linux servers for any length of time, there's a good chance node_exporter was the first thing you installed. It's lightweight, reliable, and exposes a huge amount of machine metrics for Prometheus to scrape. For years, it has been the default answer.
As your infrastructure grows, though, your monitoring stack usually grows with it. First comes log collection. Then traces. Before long you're running node_exporter, a log shipper, and maybe another telemetry agent. Each component has its own configuration, service unit, upgrade cycle, and failure modes.

Grafana Alloy changes that by consolidating those responsibilities into a single telemetry agent.
This post walks through migrating from node_exporter to Alloy on a real fleet, one server at a time, while maintaining continuous visibility throughout the process. These are the exact steps that survived contact with production on the Irin monitoring stack, not the idealized version that looks clean in a diagram.

TL;DR If you're already running node_exporter, don't replace it overnight. Install Grafana Alloy alongside it, configure Alloy's built-in prometheus.exporter.unix component, verify that metrics are reaching your remote Prometheus instance, and only then retire node_exporter.
Migrating one server at a time minimizes risk, preserves visibility, and positions your infrastructure for logs, traces, and future telemetry without deploying additional agents.

The real difference is the direction of travel

Before getting started, it's worth understanding what actually changes.
This isn't simply replacing one monitoring agent with another.
node_exporter is a server. It listens on a port, typically 9100,and waits for Prometheus to connect and scrape metrics. That means every monitored machine needs an open endpoint, network connectivity from Prometheus, firewall rules, and scrape configurations.

Alloy flips that model around.

Instead of waiting for Prometheus to connect, Alloy collects metrics locally and pushes them to a remote endpoint using Prometheus Remote Write.

On my stack, that outbound traffic travels through a Cloudflare Tunnel. Nothing reaches into the monitored servers. There are no metrics ports exposed to the LAN, no inbound firewall rules to maintain, and no scrape network that has to remain routable. The user’s metrics are exposed through the secure tunnel, and the monitoring stack has no access to the server.

That shift is the real migration, you're not replacing a binary, you're changing the direction your telemetry flows. Once you frame it that way, the rest of the migration makes much more sense.

The component that replaces node_exporter

Alloy is configured using River, which is less like a traditional configuration file and more like a telemetry pipeline. Each component performs one task before handing data to the next component. As you begin to put your model together, the configuration becomes surprisingly readable.

The component that replaces node_exporter is prometheus.exporter.unix.
Under the hood it's using the same collector code as node_exporter, so the metrics themselves remain familiar.
A minimal configuration looks like this:

// Collect host metrics.
prometheus.exporter.unix "host" {
}

// Scrape those locally collected metrics.
prometheus.scrape "host" {
  targets    = prometheus.exporter.unix.host.targets
  forward_to = [prometheus.remote_write.default.receiver]
}

// Push metrics to Prometheus Remote Write.
prometheus.remote_write "default" {
  endpoint {
    url = "https://metrics.example.internal/api/v1/write"

    # Production deployments typically authenticate here using
    # Basic Auth, bearer tokens, or mTLS.
  }
}

Read from top to bottom, it tells a story:

The exporter gathers metrics.
The scraper collects those metrics internally.
The remote write component sends them to Prometheus.

Instead of maintaining a YAML file full of scrape targets, you're wiring components together into a pipeline. That same pattern extends naturally to logs, traces, profiling, and nearly every other telemetry signal Alloy supports.

One thing to remember during migration: the metrics themselves stay the same, but some labels—particularly the instance label—may change depending on how Alloy identifies the host.
That's expected, and it's important when you start verifying your migration.

Why migrate one server at a time?

The temptation is to deploy Alloy everywhere and immediately disable node_exporter, but the best way is to do it slowly. The safest migration pattern is a canary. Pick a non-critical server, install Alloy beside node_exporter, and let both run simultaneously while you verify that the new telemetry path is working correctly. The resource overhead is negligible, but the confidence you gain is enormous.

Running both agents briefly means you always have a known-good monitoring path while validating the new one. Only after you've confirmed that Alloy is producing fresh, accurate metrics should you retire node_exporter. Once you've done that successfully a few times, batching servers becomes much less stressful.

The migration

1. Pick a canary host

Choose a stable server where a few minutes of metric oddities wouldn't be catastrophic, but would still be noticeable.
Install Grafana Alloy using your distribution's package manager.
Most Linux distributions install Alloy as a systemd service automatically, with the primary configuration file located at:
/etc/alloy/config.alloy
Enable the service, but leave node_exporter exactly as it is.
sudo systemctl enable alloy
sudo systemctl start alloy
At this point nothing has changed from Prometheus' perspective, you’re just running two collector concurrently.

2. Configure Alloy

Create your River configuration and point the remote_write endpoint at your Prometheus receiver.
Start Alloy and verify the service is healthy.

sudo systemctl status alloy --no-pager

The --no-pager option simply prints the status and returns you to the shell instead of opening the interactive pager, making it much friendlier for automation and scripting. Alloy also exposes a built-in UI on port 12345. Opening it gives you a live view of every component in your telemetry pipeline.
If prometheus.exporter.unix and prometheus.remote_write are healthy, your data is flowing.
If remote_write is unhealthy, the problem is usually one of three things:
• Incorrect endpoint URL
• Network connectivity
• Authentication
...in roughly that order.

3. Verify the new telemetry

Now both node_exporter and Alloy are reporting metrics.
First, verify Alloy locally.

curl http://localhost:12345/metrics | head

or check for a specific metric:

curl http://localhost:12345/metrics | grep node_memory_MemAvailable_bytes

Then open Prometheus or Grafana and confirm the new series is arriving. Since Alloy pushes via remote_write instead of being scraped, the standard up metric won't reflect this host anymore — up is generated per scrape target, and there's no scrape target here. Checking it after cutover will look like the host vanished, even though everything is working.
Instead, check for freshness directly:

time() - timestamp(node_uname_info{instance="canary-host"})

A small, stable number means data is arriving on schedule. A growing number means the host has stopped pushing.
Then compare a real metric such as available memory:

node_memory_MemAvailable_bytes

Don't expect the labels to match perfectly — the important part is that the values agree and the new series continues updating.
If your dashboards, recording rules, or alerting rules explicitly reference labels such as job="node_exporter" — or reference up for these hosts — now is a good time to identify them before removing the old exporter.

4. Retire node_exporter

Once you're confident Alloy is working correctly, stop the old exporter.

sudo systemctl stop node_exporter
sudo systemctl disable node_exporter

stop ends the running process.
disable prevents it from quietly returning after the next reboot.
Then remove the server's scrape job from Prometheus and reload the configuration.

curl -X POST http://localhost:9090/-/reload

Reloading avoids interrupting metric collection for every other host while updating the configuration.

5. Watch the canary

Give the server at least one scrape interval under the new path.
Verify that:
• Metrics remain fresh.
• Alerts continue behaving normally.
• Dashboards still populate.
• No recording rules broke because of label changes.

Once everything looks healthy, repeat the process on the next server.
After a handful of successful migrations, you'll have enough confidence to migrate small batches instead of individual hosts.

Common migration mistakes

There are a few issues that show up repeatedly during migrations:
• Removing node_exporter before verifying Alloy.
• Forgetting to reload Prometheus after removing scrape targets.
• Alert rules or dashboards still referencing the old job label.
• Firewall rules blocking outbound Remote Write traffic.
• Assuming label changes won't affect existing dashboards.

None of these are difficult to fix, but catching them during a canary migration is far less stressful than discovering them after migrating twenty servers.

What you gain

The obvious benefit is consolidation.
Instead of deploying separate agents for metrics, logs, and traces, Alloy provides a single telemetry pipeline that grows with your infrastructure.
If you need logs, you can add a Loki component, for OTLP traces, add another component. The overall architecture doesn't change.
The less obvious benefit is security. Once every monitored machine pushes telemetry outward, you can close your metrics ports entirely. There's no longer a listening endpoint on every server, no scrape network to maintain, and no inbound firewall rule whose only purpose is monitoring.

For anyone managing customer infrastructure—or simply trying to reduce attack surface—that's arguably the biggest improvement Alloy brings.

Final thoughts

Five years ago, exposing port 9100 across an internal network wasn't unusual, but today we're steadily moving toward zero-trust networking, outbound-only connectivity, and centralized telemetry pipelines.
Grafana Alloy isn't compelling because it replaces node_exporter. It's compelling because it aligns your monitoring architecture with where modern infrastructure is already heading, and strengthens your security posture by removing exposed enpoints.

Take the migration slowly.
Run both agents for a while.
Prove the new telemetry path.
Then remove the old one.

Done this way, there should never be a moment when a server isn't being watched.

The LLM narrates. The code decides.

Justyn Larry — Tue, 07 Jul 2026 12:51:40 +0000

Most of the "AI for observability" work I see right now hands the language model the judgment. I think that's backwards. Feed it the alert, feed it some metrics, ask it what's wrong, what should be done, and let it make the judgement call. Based on my experience working with language models, I decided that inverting the process provides better results.
The short version: in my alerting pipeline, the set of allowable classifications is fixed in deterministic Python, and the model has to pick from it. The LLM's only job is to turn a structured verdict into an easily digestible sentence. It never decides whether something is bad, how bad it is, or what category of problem it is. It narrates within a decision space the code has already locked down.

TL;DR: Instead of letting an LLM decide what's wrong with an alert, I let deterministic Python make every operational decision and restrict the model to explaining the result in plain English. The code classifies, validates, and aggregates; the LLM only narrates. That keeps the data consistent, prevents hallucinated classifications, and ensures the monitoring pipeline continues working even if the model fails.

The problem

I run a small managed monitoring service. Alertmanager fires, a webhook lands, and historically that webhook produced a line like HighMemoryUsage on host web-vm, severity warning, which is accurate, but not terribly helpful. The person reading it still has to know what HighMemoryUsage implies, whether this host always runs hot, and whether to care. I wanted plain-English context attached to the alerts without altering the alert delivery process.

The obvious move was to throw the whole alert at an LLM and ask it to explain. I tried that in the first iteration of this experiment, expecting it to be somewhat accurate, but not entirely reliable, and it did not disappoint, the model was confidently inconsistent. The same alert, fired three times, produced three different "root cause" categories. One run called a test alert a "Configuration or setup issue," the next called it "Configuration/Testing," the next something else again. If you're storing that output to do any kind of aggregation later (I am, I want to know when three different clients hit the same class of problem in the same week), free-form model output fragments into noise. Grouping on a field the model changes at random won't work.

I kind of knew from the onset that I wouldn't get amazing results, and that it would be harder than it looked. So I started doing some research, and decided to flip the design. The little voice in the back of my head was right all along, don't let the model make the decisions.

The Split

I created a pipeline that has a hard wall down the middle.
On the deterministic side, Python does the classifying. The output is constrained to an eight-value enum: memory_pressure, cpu_saturation, disk_pressure, service_unavailability, network_issue, configuration_error, external_dependency, unknown. I aggregate on that field because it can only ever be one of eight strings. If nothing fits, the answer is unknown, which is itself a useful signal rather than a hallucinated/variant guess.
On the narrative side, the LLM (llama3:8b, running locally on a box on my own LAN, data/network secure) must choose its classification from that fixed eight-value set, and it writes two short fields alongside it: what the alert is, in plain English, and what it means operationally. The code defines the shape of the answer; the model only fills in a slot that already exists. It is explicitly instructed not to suggest fixes and not to invent a cause, so it performs translation instead of analysis.
The prompt returns strict JSON, grammar-constrained, so I get {what, means, likely_cause_class} every time and the enum value is validated against the allowed set on the way out. If the model returns something off-list, I capture the bug instead of storing a row.

Context hydration

A naive version of even the narration step gets you alarmist prose. When tuning the system I received a DiskFillPredicted alert, which on its face looked worth investigating. Then I looked at the host in question, which has had a flat disk-utilization baseline for months. The prediction was a rounding artifact, "your disk is about to fill" is actively misleading. The model had no way to know that from the alert alone, so it just wrote something.

I fixed it by giving the model the same context a human would look at before reacting. Prior to the LLM call, Python does a fast lookup against the metrics backend for that host's recent baseline, and the prompt carries an explicit rule: if historical context is provided, weigh it over the alert's literal text. A predicted-disk-fill on a host with a stable months-long baseline is informational, not urgent.

The latency budget for that lookup is five seconds. The LLM call itself takes about eighty seconds, because this runs deliberately on modest CPU-only hardware, a 4th generation Intel i7 with 16GB RAM and no GPU. That is a choice, not a constraint I am apologizing for: the whole posture of the service is that nothing leaves my LAN, so a slow local model beats a fast remote one. And the eighty seconds never reaches the person being alerted. Because the annotator rides alongside the existing path (more on that below), the raw alert lands in Slack and email instantly; the narrated version shows up as a separate annotation a minute or so later. Nothing is ever waiting on the model. Against that eighty-second call, five seconds of pre-fetch is under seven percent overhead and invisible. What mattered most was that if the metrics backend is slow or unreachable, the system fails immediately and falls back to the un-hydrated path, so the enrichment step can never block the result.

Fail-Closed

That fallback instinct runs through the whole process, the ethos is that the annotator is additive, a 'nice-to-have.' Alertmanager routes to it with continue: true, so it sits alongside the existing Slack and email delivery processes, and can never block them. The webhook always returns 200, even when the LLM box is down or when the JSON is malformed, and a static fallback annotation gets used instead. The worst case is that the end-user gets a less flowery alert, but never misses one. The narration is an amenity layered on top of a delivery path that doesn't depend on it at all.

Advice, For What it's Worth

"Let the model narrate, not analyze" is the easy version of the lesson, and it's true, but it isn't the hard part. The hard part is the enforcement: constraining the output to a fixed set you can validate, and keeping the model off the critical path so its failures cost you prose and never an alert. The model's value is fluency, not judgment, and fluency is the most replaceable thing in the stack. Push every actual decision into code you can test, constrain anything you'll later aggregate down to an enum, and treat the model as the last, most replaceable stage in the pipeline. If you can swap the model out tomorrow and your data stays clean, you've drawn the line in the right place.
Mine's been running against live infrastructure for a few weeks. The prose is good, I'm sure when the hardware is upgraded and a more powerful model is put in place it will be better, but the reason I trust (tentatively) it is that the prose isn't doing the work.

Stop Relying Entirely on Uptime Kuma for Incident Response

Justyn Larry — Thu, 25 Jun 2026 14:32:45 +0000

Before I get into this, it is not a knock on Uptime Kuma. It's a genuinely amazing, easy-to-use piece of software. If you run a homelab or a small fleet and you're not using it, you probably should be. It's free, self-hosted, beautiful, and it does the thing it was built to do better than almost anything else at any price.

There's always a "but," though, so before we get to it I want to spend a little time on what Uptime Kuma does well.

TL;DR:

Uptime Kuma is excellent at telling you when a service becomes unreachable, but it cannot explain why a service is slow or unhealthy while still responding. That requires internal metrics from tools like Prometheus, Grafana, and Alloy. Reachability monitoring and systems monitoring solve different problems, and mature environments typically use both.

Where Uptime Kuma excels

Uptime Kuma answers one question extremely well: is this reachable? It'll ping a host, hit an HTTP endpoint and check the status code, watch a TCP port, validate a TLS cert's expiry, query a DNS record, check a keyword on a page, watch a Docker container, even poke a game server. It checks on a tight interval, shows you a clean history, and when something stops responding it fires a notification through basically any channel you can name. Ninety-plus notification integrations. Status pages you can hand to your users. Two-factor auth. A genuinely nice UI.

For "tell me the moment my website, my reverse proxy, my Plex, or my Home Assistant stops answering," it's close to perfect. The interval is short, setup is measured in minutes, and there's practically no maintenance. It has earned a famously loyal userbase for a reason.

I'm not here to tell you it isn't the answer, or to convince you to ditch it for something else. I still think everyone running infrastructure of any size should have something like it watching their endpoints. I have it running in a Proxmox container on my own homelab. But there's a gap I noticed while using it, and this post is about the specific moment when you ask Uptime Kuma a question it wasn't designed to answer, and what you do when that moment arrives.

Growing pains

There usually comes a time, as your homelab or business grows, when your database starts feeling slow. Queries that used to be instant are taking a little longer. It's not a real problem yet, but you can tell something is off.

The Uptime Kuma dashboard is all green. The database port is answering, the HTTP healthcheck returns 200, every light on the board is on. Uptime Kuma is correctly reporting that the service is up.

And it's not wrong. That's the thing. The service is reachable. But "reachable" and "healthy" mean different things, and you've just walked into the space between them. If the disk that database lives on is pinned at 100% IO utilization because a backup job and a big query are fighting over it, your queries are queuing behind that contention, and from the outside the port still answers in time to pass the check. The board is green, the database is slow, and there's no contradiction.

Uptime Kuma doesn't see any of that, and the reason it can't isn't a missing feature, it's the architecture. It checks your systems from the outside looking in. It has no way to see what's happening inside your servers. What are the disk, memory, CPU, and kernel actually doing?

What you need at that moment is something standing inside the box, reading the system from within. That's a different category of tool.

Reachability versus internals

There are two kinds of monitoring, and once you see the split you understand why mature setups end up running both.

Reachability monitoring (Uptime Kuma) asks the basic questions:

Can I get to it?
Is the port open, the page loading, the cert valid, the container running?

It reports what it can see from the outside, which is exactly what you want for "is my service up and can my users reach it." It's easy, simple, and honest about what it knows.

Systems monitoring (the Prometheus world) asks questions that are a little more involved:

What's going on inside the machine?
How busy is each CPU core?
How much memory is actually available once you account for cache?
What's the disk IO utilization, the queue depth, the read and write latency?
How much network throughput, how many dropped packets?
Is memory slowly leaking over days?

It's an internal view, and it answers why a service is behaving the way it is.

Neither replaces the other. Reachability tells you that something is wrong. Systems metrics tell you why. The database scenario above needs both: Uptime Kuma to eventually notice if the slowness becomes an actual outage, and system metrics to explain the slowness long before it gets there.

The internal view

The standard way to get the inside view on Linux is a tiny agent called node_exporter. It's a small binary that runs on the box, reads metrics straight from the kernel, and exposes them for a time-series database (Prometheus) to collect. Pair it with Grafana for dashboards, and for logs, pair Loki with a shipper. The traditional choice there was Promtail, though Grafana has since moved Promtail into long-term support and now steers you toward Grafana Alloy, which handles both metrics and logs in a single agent. (I wrote a comparison of those two separately.)

With either node_exporter or Alloy running, the database scenario stops being a mystery. The exact moment things felt slow, you can pull up:

Disk IO utilization on that box, and watch it pin to 100% right when the slowness started.
The specific disk and the read/write split, so you can see it was the backup volume contending with queries.
CPU broken out by mode, so you can rule out CPU as the cause.
Memory availability over the past week, so you can see whether pressure had been building.

And if you have Loki collecting logs alongside the metrics, you can line up the disk IO spike against the log line where the backup job kicked off, and the whole story assembles itself in one view. Uptime Kuma told you the service was up. The system metrics tell you the backup job is strangling your database disk, which is what you actually need to know to fix it before it hits production.

(If the PromQL behind those dashboards is unfamiliar, I wrote up the five queries you actually need to monitor a Linux server separately.)

The honest part: this is more work

Standing up a stack to see inside your servers is not as simple as setting up Uptime Kuma, and that simplicity is a real part of why Uptime Kuma is so loved. Moving to system metrics comes with a cost.

node_exporter or Alloy goes on every server, with Prometheus running somewhere to collect from them. Grafana dashboards have to be built, or imported from the community and then tweaked until they're readable instead of overwhelming. Alert rules have to be written to fire on real problems without crying wolf. Metric and log retention have to be configured. And then the whole thing needs ongoing maintenance.

This is the irony nobody warns you about: you now have a monitoring stack that itself needs monitoring, which is partly why you want predictive disk alerts on the box running Prometheus.

None of it is hard, exactly. But it's an ongoing process with no end, and it's a different commitment than the near-zero maintenance of an Uptime Kuma container you set up once and edit when new services come online. Uptime Kuma is the right tool for reachability and status pages, and it costs almost nothing to run. System metrics cost more, but they become relevant the moment you start wondering why services aren't behaving, even though they're still showing green.

So what do you do with this

The real takeaway here isn't a product, it's the distinction, because that understanding outlives any particular tool. Outside-in tells you something broke. Inside-out tells you why. Uptime Kuma is one of the best outside-in tools ever made, and it'll happily keep doing that job for you forever. It just wasn't built to explain the why.

When you do need the why, you've got two options: run the inside-view stack yourself (node_exporter or Alloy, Prometheus, Grafana, Loki), which is completely viable and a great way to learn, or hand it to someone who runs it for you.

For full disclosure, that second path is the reason I built Irin Observability, a managed version of that inside-view stack for small teams and homelabs that have outgrown pure reachability checks but don't want a second full-time job maintaining a metrics pipeline. It's meant to sit alongside something like Uptime Kuma, not replace it, because reachability and internals are different jobs and the mature answer is to run both.

Either way, keep the green board. Just add the view from inside the box when you start asking why.

You Don't Need Kubernetes to Monitor 20 Linux VMs

Justyn Larry — Tue, 23 Jun 2026 12:42:27 +0000

If you've ever tried to set up Prometheus by following the official getting-started path, you're likely to find a path that does not follow your infrastructure model. Out of the gate, page one mentions kube-prometheus-stack. Page two wants you to install a Helm chart, and page three assumes you already have a cluster running. The documentation for monitoring plain Linux servers is in there somewhere, but you have to dig for it. When you do find it, the tone suggests you are doing something slightly old-fashioned.

If that sounds like your setup, the tooling is making this harder than it actually is. Monitoring a fleet of Linux VMs is fairly simple and has been for years. It is just obscured behind documentation that would prefer to sell you something bigger.

Modern infrastructure tooling has quietly decided everyone runs Kubernetes. If you don't, the assumption is that you eventually will. Meanwhile, most real-world infrastructure still runs on VMs.

TL;DR: Modern observability documentation often assumes you're running Kubernetes. Most small teams aren't. If you're managing a fleet of Linux VMs, node_exporter plus Prometheus gives you everything you need for infrastructure monitoring with a single lightweight agent and a straightforward deployment model. No cluster required.

VMs are often the answer

For most small businesses, running VMs instead of Kubernetes does not mean you failed to evolve. Most workloads under a certain scale perform better on VMs:

One process per box, predictable resource limits, and the ability to ssh in and look at what's happening, which makes it easier to keep track of the infrastructure as a whole.
They're cheaper, both financially and in the mental overhead of running them.
Backups and snapshots are straightforward in a way stateful Kubernetes still isn't.
There's no control plane that itself needs monitoring and upgrades and care.

Kubernetes solves problems that mostly pertain to companies with dozens of engineers and hundreds of services. For platforms that consist of 20 VMs, Kubernetes is the wrong tool, and being told you need it before you're allowed to have monitoring is the wrong approach.

What node_exporter actually is

What you need is called node_exporter, a lightweight systemd process.

It's a single Go binary, around 25 MB. It runs as one process on each VM, reads metrics from the kernel through /proc and /sys, and exposes them on an HTTP endpoint, normally port 9100. It's very uncomplicated: there's no daemon set, operator, sidecar, CRD, cluster, or control plane. It runs quietly in the background and answers HTTP on port 9100 with a plain-text list of numbers. You can curl it yourself and read it:

curl http://<localhost or IP>:9100/metrics

What comes back is a few hundred lines of metrics containing CPU time per core per mode, memory broken down by category, disk space per mountpoint, network bytes per interface, load, uptime, and open file handles. It tells you everything the kernel knows about the server, in a format Prometheus reads directly.

The agent the big observability vendors want to install on your servers is doing this same job. It reads from /proc and exposes metrics, but they've wrapped it in a config model and an update mechanism and a logo. The core of it is what node_exporter has been doing for over a decade. You are not missing out on some sophisticated technology by over-complicating your system. The simple, plain version is the technology.

Setting up one VM

Here's the actual setup on a single box. Check the releases page for the current version before you run this, the version string changes.

# Download the binary
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz

# Extract and install
tar xzf node_exporter-1.8.2.linux-amd64.tar.gz
sudo mv node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin/

# Run it as its own unprivileged user
sudo useradd --no-create-home --shell /bin/false node_exporter

# systemd unit
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Start it
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

# Confirm it's alive
curl http://localhost:9100/metrics | head

With these ten commands, you can have it running in under five minutes. It sits at roughly 20 MB of RAM and you'll likely forget it's there. One thing you should do is lock down port 9100. Leave it open to your monitoring server and nothing else. node_exporter exposes details about your system and it shouldn't be reachable from the public internet. It should be behind your firewall.

It is a little repetitive

The same setup runs on every machine, so there are a few ways to deploy it if you have more than 5 to 10 servers to monitor. The setup is the same for almost all Linux distributions.

If you're already using Ansible, the node_exporter playbook is about 30 lines and is one of the most copy-pasted snippets out there. The cloudalchemy.node_exporter role does it for you with reasonable defaults if you'd rather not write your own.

You can also use a shell loop over ssh if you don't want to add new tooling. Walk your hostnames, ssh in, run the commands above. Twenty boxes will probably take around ten minutes.

If you spin servers up and down often using a VM image or cloud-init, you can just include node_exporter in the base image. Every new VM will show up already monitoring itself.

The monitoring side is one Prometheus instance pointed at the list of servers you want to monitor:

# prometheus/prometheus.yml
scrape_configs:
  - job_name: 'linux-vms'
    static_configs:
      - targets:
          - vm1.example.com:9100
          - vm2.example.com:9100
          - vm3.example.com:9100
          # ...the rest of them
        labels:
          environment: production

For 20 boxes, that static list is genuinely fine. If you add and remove servers a lot, file_sd_configs lets Prometheus pick up target changes from a file without a restart, which carries you much further. The setup isn't too much more complicated:

# prometheus/prometheus.yml
scrape_configs:
  - job_name: 'linux-vms'
    file_sd_configs:
      - files:
          - /etc/prometheus/file_sd/linux-vms.yml
        refresh_interval: 30s

The file structure requires that you add a file_sd directory to the prometheus folder:

prometheus/
├── prometheus.yml
└── file_sd/
    └── linux-vms.yml

# file_sd/linux-vms.yml
- targets:
    - vm1.example.com:9100
    - vm2.example.com:9100
  labels:
    environment: production
    role: web

- targets:
    - db1.example.com:9100
  labels:
    environment: production
    role: database

- targets:
    - staging1.example.com:9100
  labels:
    environment: staging
    role: web

If you put each server directly into prometheus.yml, you have to restart Prometheus every time you add one. By putting your servers in the file under file_sd, Prometheus picks them up automatically on the refresh interval. That's a little extra structure up front, so if your infrastructure is largely static it isn't really worth it. If you're constantly onboarding or removing servers, the extra layer removes a lot of the maintenance.

What you can actually see

With node_exporter on every VM and one Prometheus pulling from them, here are real questions you can answer:

CPU across the whole fleet for the last hour: one query over node_cpu_seconds_total, split by instance.
Which box is closest to full: node_filesystem_avail_bytes against node_filesystem_size_bytes.
When vm7 last rebooted: node_boot_time_seconds.
Which box is dropping the most packets: a rate over node_network_receive_drop_total.
Whether memory has been slowly tightening on anything over the past week: node_memory_MemAvailable_bytes plotted across all instances.

Everything can be viewed in Grafana using queries written in PromQL. I wrote up the five basic queries you need to monitor a Linux server separately, with each one explained in detail.

That covers what a small fleet typically needs. Monitoring doesn't require Kubernetes, or giant vendors like Datadog, or agent vendors. A Go binary on each box and one instance of Prometheus and Grafana.

Maintenance costs

Getting node_exporter onto 20 VMs and setting up Prometheus and Grafana is relatively easy. It's all open source and available to anyone. But most teams underestimate dashboard design, alert tuning, retention planning, and long-term maintenance. Making sure Prometheus stays healthy and the prometheus.yml and file_sd/*.yml files are all up to date, building functional dashboards, writing alert rules that fire on real problems without creating noise, sorting out retention, getting alerts somewhere a human will actually see them, and keeping all of it patched as each piece ships new versions: that becomes ongoing operational work somebody has to own. All of it grows in complexity with the fleet. On top of that, the monitoring stack itself can go down, which takes time and effort to troubleshoot and fix.

If you like that sort of work, or you have dedicated people who can take on the additional load, node_exporter, Prometheus, and Grafana are excellent. If you have the money to spend, Datadog is a great company.

Where Irin comes in

Because maintaining the monitoring stack is a burden most small businesses don't have the time or resources for, I built Irin Observability. You keep your attention on running your business and keep an eye on it through dashboards and alerts that are already built and tuned. Instead of node_exporter, Irin uses Grafana Alloy as the agent. It covers the same infrastructure metrics, ships your logs, supports additional telemetry pipelines, and installs with a single bootstrap command. Instead of a pull-based model that requires you to open a port to your monitoring server, it pushes your data out through an encrypted Cloudflare tunnel. Your dashboards, alerts, and retention live on Irin's infrastructure. The only thing on your boxes is the agent, and it stays out of the way.

The pitch really isn't the point, though, and I'm only scratching the surface of what node_exporter or Alloy can do. The point is that the docs may be telling you a story that isn't true for your situation. You do not need Kubernetes to watch a handful of Linux servers. You need a small binary on each box and something to scrape it. Run that something yourself or pay someone to run it, either is fine. The architecture underneath is simple no matter who operates it, and it's been sitting in plain sight the whole time under a pile of cloud-native marketing.

The Only 5 PromQL Queries You Really Need to Monitor a Linux Server

Justyn Larry — Tue, 16 Jun 2026 14:45:14 +0000

PromQL has its quirks, and can be difficult, but basic monitoring of a Linux server is not. I’ve boiled it down to five queries that will give you the basic outline of how your system is performing. This article discusses the queries for CPU, memory, disk space, disk IO, and network, with a plain explanation of how each one works.

PromQL has a reputation for being intimidating, and the reputation is half-earned. The full language is genuinely deep, with subtleties around ranges, rates, and vector matching that take a while to learn and understand. What nobody tells you when you are starting out is that monitoring a single Linux box well does not require a comprehensive grasp of the language. It requires about five questions, asked correctly.

TL;DR

You don’t need hundreds of metrics to monitor a Linux server effectively. Five PromQL queries covering CPU, memory, disk space, disk IO, and network traffic will catch the most common server issues. This article explains each query, how it works, and why it belongs on your dashboard.
These queries work with both node_exporter and Grafana Alloy and are commonly used in Grafana dashboards, Prometheus alert rules, and Linux server monitoring setups. If you're looking for practical PromQL examples rather than a full PromQL tutorial, start here.

Quick Reference:

These are the exact PromQL queries used to monitor CPU usage, memory utilization, disk space, disk IO, and network throughput on Linux servers running node_exporter or Grafana Alloy.

CPU Usage

100 - (avg by (instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory Usage

100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

Disk Space 100

(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"} * 100)

Disk IO Saturation rate

(node_disk_io_time_seconds_total[5m])

Network Throughput rate

(node_network_receive_bytes_total{device!~"lo|veth.*"}[5m])

This article assumes you have node_exporter or Grafana Alloy running and Prometheus scraping it. Alloy’s metrics are identical to node_exporter’s, Alloy's prometheus.exporter.unix component is node_exporter under the hood, so every query below works for both. If you are still deciding between the two or would like to learn more, we wrote a separate comparison of Alloy and node_exporter that discusses the two and when each makes sense that can be found here.

Before we dive into the queries, it’s important to point out the difference between a gauge and a counter on the Grafana dashboard. A gauge is a value that goes up and down, like memory in use right now or CPU temperature. It shows you what’s happening now, and you read a gauge directly. A counter only ever goes up, like total bytes received since boot or total seconds the CPU has spent working. It’s a count over time. You almost never read a counter directly, because "847 billion bytes since the machine booted" is useless. The relevant question to ask yourself when looking at counters is: how fast it is climbing? That’s what rate() tells you. Three of the five queries discussed below are counters, and once you see why they all use rate(), the pattern makes sense and PromQL starts making a little more sense.

1. CPU usage (percent busy)

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

When I created my first Grafana dashboard, I expected this to be the easiest query to write. I think most people expect it to be simple and, like me, are confused when it’s not.

node_exporter does not expose a "CPU percent" metric, because there is no honest single number for it. What it exposes is node_cpu_seconds_total, a counter that tracks how many seconds each CPU core has spent in each mode: idle, user, system, iowait, and a few others. The machine is always doing one of these, so the modes always add up to 100 percent of available CPU time.
The cleanest way to ask "how busy is the CPU" is to measure how much it is not idle, so we work from the idle mode and subtract from 100. Reading the query from the inside out:

node_cpu_seconds_total{mode="idle"} selects just the idle counter, for every core.
rate(...[5m]) is the key piece. It looks at how that counter changed over the last 5 minutes and returns a per-second rate. For the idle counter, the rate is "idle seconds accumulated per second," which is a number between 0 and 1 per core: 1.0 means a core was fully idle, 0.0 means fully busy.
avg by (instance) averages that across all the cores on the machine, so a 4-core box gives you one number instead of four.
* 100 turns the 0-to-1 fraction into a percentage, and 100 - (...) flips "percent idle" into "percent busy."

The [5m] window is a smoothing choice, not a magic number. A wider window like [5m] smooths out brief spikes and shows the server sustained load. If you use a narrower window like [1m] it’s twitchier and catches short bursts. For alerting on a server, sustained load is usually what matters, which is why our own default alert fires on CPU above 80 percent for five-plus minutes rather than reacting to every momentary peak. By extending the window from [1m] to [5m] you’re able to reduce noise, but can still see when there’s a problem.

The production query adds label filters for multi-tenant use; the core logic is identical.

2. Memory usage (percent used)

100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

Memory is best read as a gauge, so no rate() is used for this query. The values are read at a glance. There is one trap worth understanding, because it’s fairly common.

The naive instinct is to use node_memory_MemFree_bytes, the amount of completely unused memory. It’s a baseline metric that node_exporter provides, and it seems like it makes perfect sense to pull it directly to the panel. On a healthy Linux system, "free" memory is often very low by design. Linux uses otherwise-idle RAM for the page cache, holding recently-read files in memory so it does not have to hit the disk again. That memory looks "used" but is instantly reclaimable the moment a program actually needs it. If you track and alert on low MemFree, you’ll get unnecessary alerts on servers that are working as intended.
The number you need to track is node_memory_MemAvailable_bytes. The kernel calculates this for you. It is the memory genuinely available for new programs to use, after accounting for the cache it can reclaim.

So the query reads: take available memory divided by total memory, which gives you the fraction available. Subtract that from 1 to get the fraction used, and multiply by 100 for a percentage. A good threshold for this panel is 85 percent, or when available memory drops below 15 percent.

The production query adds label filters for multi-tenant use; the core logic is identical.

3. Disk space (percent full)

100 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"} * 100)

Disk space is also a gauge, and structurally this is the same shape as the memory query. Take the available divided by total, and turn it into percent used. What makes this query tricky is the label filter, because disk monitoring using unfiltered queries get noisy.
A Linux machine reports many "filesystems" that are not real disks. Tracking every single one would create a massive amount of noise, and make it difficult to parse out which disks are likely to cause a problem in the near future. tmpfs is memory-backed temporary storage, overlay filesystems belong to running containers, and there are others. If you monitor all of them, your "disk full" dashboard lights up over ephemeral mounts that are largely irrelevant. The filter fstype!~"tmpfs|overlay" strips those out:

fstype is the label node_exporter attaches describing the filesystem type.
!~ means "does not match this regular expression." (=~ would be "does match.")
"tmpfs|overlay"is the regex: the | is an OR, so this matches either type, and !~ excludes both.

This query leaves you with the actual disks on your server. This is also the first time a regex-matching operator has popped up in this article. These two operators, =~ and !~ are how to do most of the flexible filtering in PromQL. Once you can include or exclude by pattern, you can filter metrics any way you need.

One caveat: this query returns one result per mounted disk, which will show metrics for each mounted drive on your server. A server with a separate / and /data should show you both, because either can fill independently. Setting the threshold limit to something like 85 percent full gives you time to act before the disk is full.

The production query adds label filters for multi-tenant use; the core logic is identical. The production query uses max by (instance) rather than the simplified version described above.

4. Disk IO (how saturated the disk is)

rate(node_disk_io_time_seconds_total[5m])

The fourth query uses a counter, so rate() returns. This query answers a question that disk-space monitoring doesn’t address. Your disk can have plenty of free space and still create problems because it cannot read and write fast enough to keep up with what the system is demanding of it.
node_disk_io_time_seconds_total counts the total seconds the disk spent actively busy with input/output (IO). Because it is a counter, you wrap it in rate(...[5m]) to get "seconds of IO activity per second," which is effectively a utilization fraction. A result near 1.0 means the disk was busy essentially the entire time, which tells you that the disk is saturated. A result near 0.1 means it was busy about 10 percent of the time, with plenty of headroom.

This is the metric that can help to identify where slowdowns are coming from. When a database gets sluggish, or backups drag but CPU and memory look fine, disk IO saturation is very often the culprit. It’s the kind of problem that simple up-or-down monitoring won’t tell you.

The production query adds label filters for multi-tenant use; the core logic is identical.

5. Network throughput (bytes per second)

rate(node_network_receive_bytes_total{device!~"lo|veth.*"}[5m])

The fifth query is a counter and a regex filter together, which is why I saved it for last. If you understand this one, you’ll have a better understanding of the pattern behind all five.
node_network_receive_bytes_total is a counter of total bytes received on each network interface since boot. rate(...[5m]) turns it into bytes per second, your live inbound throughput. To watch outbound traffic, swap in node_network_transmit_bytes_total, or create a second query in your panel to view the two side by side.
The filter handles the same noise problem that the disk query does. A Linux host has interfaces that you typically don’t need to keep an eye on: lo is the loopback (the machine talking to itself), and veth interfaces are the virtual ethernet links Docker and other container runtimes create, often dozens of them. device!~"lo|veth.*" excludes them:

lo matches the loopback exactly.
veth.* is a regex where . means "any character" and * means "zero or more of the preceding," so veth.* matches veth followed by anything: veth1a2b3c, vethABCD`, all of them.

This leaves your physical or primary virtual interface(s), the one(s) carrying traffic that actually matters. The output is in bytes per second, so if you would rather see bits per second to compare against your network provider's numbers, multiply by 8.

_The production query adds label filters for multi-tenant use; the core logic is identical. TX is shown as negative so RX and TX can share one panel without overlap.
_

The pattern

Looking over the five queries discussed here, there are only a few moving parts. Gauges (memory, disk space) you read directly as available-over-total. Counters (CPU, disk IO, network) you wrap in rate() to ask how fast they are climbing. And label filters with =~ and !~ let you cut out the noise so you are watching real disks and real interfaces instead of container ephemera. Five queries, three ideas that give you basic coverage of your servers.

Where this leaves you

If you put these five on a dashboard with sensible thresholds, you have covered the large majority of what goes wrong on a single Linux server: the CPU is overworked, it runs out of memory, a disk fills up, a disk chokes, or its network saturates. There’s always more that you can monitor, node_exporter and Alloy provide a massive amount of system metrics, but anything fancier is a refinement of these fundamentals. You can view Irin’s System Health dashboard here to see these five queries alongside a few others.
Going from "five queries in an expression browser" to "a real monitoring setup" is more work than it looks. Some of the gauge queries are modified and used as time series, so you can see what’s happening over time, not just in that instant. Prometheus needs to be set up to store the data with appropriate retention, Grafana dashboards need to be built around these queries, alert rules wired to thresholds that don’t create noise, and somewhere for the alerts to actually go. Setup isn’t overwhelming, but it is an ongoing process to keep it running and tuned. The monitoring system itself needs to be kept healthy, and thresholds/alerts need to be tuned to your system. Then, there’s always the danger of over-monitoring, the first dashboard I created years ago was an endless scroll, it had EVERYTHING, which turned out to be too much. Looking at the dashboard was overwhelming, and I couldn’t just take a glance at it to see how the system was doing, which is the goal.
That recurring chore is the gap Irin Observability exists to fill. We ship these exact queries, pre-built dashboards, and tuned alert thresholds (the 80-percent-CPU, 15-percent-memory, 85-percent-disk defaults referenced above are ours) as a flat-rate managed service, so you get the visibility without becoming the person who maintains the monitoring stack. But whether you run it yourself or hand it off, the five queries above are the foundation either way.

_Want the next step? Once you’re familiar with these, the natural follow-up is wiring node_exporter or Alloy up properly and understanding what it can and cannot tell you on its own.
_

Metrics Tell You Something Broke. Tracing Tells You What, Where, and Why.

Justyn Larry — Thu, 04 Jun 2026 15:00:00 +0000

Complacency is a killer. The monitoring stack that I built works, and it’s reliable, so leaving it alone seems like the most obvious thing to do. Focusing on marketing, documentation, taking time away from it all seem like good options, but there’s always a better way to do something, to solve a problem you didn’t realize you had.

In my spare time, I look through Reddit and Dev.to for ideas or inspiration. Systems that others are using that I’m not, or that I’m not aware of. Distributed traces jumped out at me from both forums — I can tie a system event to the metrics, instead of stumbling around logs? This is a monitoring goldmine. How had I missed this?

WHAT EXACTLY IS DISTRIBUTED TRACING?

For any kind of multi-step processes running on your system, distributed tracing provides a timeline of exactly what happened, and how long each step took. It’s like getting a receipt for the work showing you where time and resources were spent. Each request or job gets a trace ID, and every step records a span — a named block with a start time, end time, and any attributes you want to attach. Those spans assemble into a waterfall, and you can see at a glance where time was spent, what succeeded, and what failed.

This added visibility can take a technical team from “this seems slow” to a detailed accounting of how long a process took and what the system was actually doing when the process was lagging.

THE ORIGINAL CORE STACK

Irin Observability runs on Prometheus, Grafana, Loki, Grafana Alloy, and Alertmanager. I’ve built a robust monitoring stack that tracks metrics for request rates, error rates, LLM call counts, and report generation status. There are also logs flowing from all the services through Loki, so overall, I believed that the stack was well-instrumented and very readable.

The alert system that I built runs through five internal services to process each alert through an alert annotator and to generate a monthly report in sequence:

An alert comes in from a client’s infrastructure
The alert annotator calls a local LLM to add a plain-English explanation for a panel on one of the dashboards
The annotated result gets pushed back into Loki
At the end of the month, the aggregation script gathers all findings for report generation
The LLM narrative layer writes a summary
The report generator assembles everything into a PDF and sends it

Each of those steps runs in a different process. Some run as Docker containers, some as host Python scripts. When auditing the reports and something didn’t look right, I had to check the logs on the Loki Log Exporter Dashboard or grep logs across multiple services, correlate timestamps manually, and piece together what happened. This was both frustrating and time-consuming. The platform should be telling me what the problem is in addition to telling me that something is wrong.

THE SOLUTION: OPENTELEMETRY

OpenTelemetry (OTel) is an open source standard for collecting telemetry data — traces, metrics, and logs — from applications. It’s vendor-neutral, well-maintained, and has solid Python libraries.

Grafana Tempo is an open source backend for storing and querying traces. It integrates directly with Grafana, so once it’s running you can navigate from a log line to a trace, or from a trace to the logs that were happening at the same time.

Getting this running involved three parts. First, I deployed Tempo as a Docker Compose service, with a config file and a Grafana datasource. The second step was to wire up Grafana Alloy as the collector. Since Alloy is the agent already running on my servers to ship metrics and logs, I was able to add an OTLP receiver block to accept traces from internal services and forward them to Tempo — one config change, and the heartbeat API distributed the updated config files to all the monitored servers. The final step was to instrument the Python services. This is where things got a little more difficult, but it also taught me some valuable lessons.

THE PYTHON IMPLEMENTATION

The OTel Python SDK has two modes. The first is auto-instrumentation, which handles the common cases automatically. If you’re running a Flask or FastAPI app, importing two libraries and calling .instrument() captures every HTTP request with no further changes. If you’re using psycopg2 for Postgres queries, one more library call and every query becomes a span.

The second, manual spans, are for the logic your code owns — units of work that typical instrumentation frameworks can’t see automatically. I used these to capture the LLM call itself (duration, prompt size, whether the response parsed cleanly), each section of the aggregation script so I can see which Prometheus query is slow, and the overall per-tenant run so every trace carries a tenant name.

LESSONS LEARNED

Short-lived scripts need an explicit flush.

The aggregation script and report generator run once and exit. The default OTel exporter batches spans and sends them on a timer. If the process exits before the batch fires, you lose all your spans. I fixed it by adding two lines: force_flush() and shutdown() in a try/finally block before exit. I lost my first few test traces before I figured this out.
The psycopg2-binary package breaks auto-instrumentation silently.

The OTel instrumentation library checks for a package literally named psycopg2. If you installed psycopg2-binary — the same library, different distribution name — the check fails and you receive no database spans, no error message, nothing reported. The fix is one parameter: Psycopg2Instrumentor().instrument(skip_dep_check=True).
Background tasks break parent-child trace linkage.

My alert annotator returns a 200 response immediately and processes the alert in a background thread. The HTTP span closes when the response is sent, but before the real work begins, which means each alert generates two separate traces — a brief HTTP span and an orphaned processing span. The model behavior was correct, not a bug, but it looked confusing until I understood the threading model. I accepted it and correlate the two traces by alert fingerprint when necessary.

THE BIG DIFFERENCE

This is where things get interesting, and how the original monitoring stack differs from its current iteration.

Prior to integrating distributed tracing, I knew that the report pipeline ran. That’s it — pass/fail, true/false. If something went wrong, where did it happen, and why? What was the system state at the time of the failure? Now I can open a trace in Grafana Tempo and see:

report.generate: total duration 4m 12s
  db.get_contacts: 41ms
  aggregation.run (per tenant): 2m 18s
    aggregation.stability: 39ms
    aggregation.resources: 1.2s  (slow Prometheus query range)
    aggregation.alerts: 88ms
  llm.narrative_generation: 1m 44s
    llm.build_prompt: 12ms
    llm.call attempt 1: 119s  (timeout)
    llm.call attempt 2: 44s   (success)
    llm.parse: 3ms
  report.build_pdf: 8s
  report.send_email: 2s

That waterfall tells me that the Ollama model timed out on the first attempt and succeeded on the second. I don’t have to go digging through logs in an approximate time frame to figure out what happened. The Prometheus query for resource metrics was the slow step in aggregation. PDF build and email delivery were fast. The problem isn’t solved, but I know exactly what the problem is.

Through the alert annotator, I can now see every alert as a trace. The system shows me the dedup check against Loki, the LLM call, the result push. I can filter by tenant, by alert name, by whether the LLM call succeeded. A 55-second LLM call that I used to see only as a latency spike in a Prometheus histogram is now a named span with the prompt size, the response size, and whether the JSON parsed cleanly.

THE IMPLICATIONS

If you have any experience with monitoring, you have almost certainly hit the “something seems wrong but I can’t tell what” problem. The logs are probably available, you can see the metrics, but you’re stuck sifting through them in sequence trying to reconstruct what happened.

Distributed tracing changes the diagnostic workflow from “search for clues” to “read the receipt.” The trace tells you what happened, in order, with timing, which virtually eliminates investigation time and lets you go directly to the problem at hand.

It also changes how you think about reliability. When I see the LLM call timing out on first attempt consistently, I know to tune the timeout or check model load before it impacts the client. Being proactive in monitoring is a moving target, but it is still the goal.

THE TOOLCHAIN

Everything I used is open source and self-hostable:

OpenTelemetry Python SDK (opentelemetry-sdk, exporter packages, auto-instrumentation libraries)
Grafana Tempo for trace storage and querying
Grafana Alloy as the collector and forwarder
Grafana for visualization, with native Tempo datasource support and log/trace correlation

If you’re already running Prometheus and Grafana for metrics, adding Tempo for traces is a natural extension of the same stack. You can use the same agent, dashboards, and query interface. You’re adding one more signal type, but no new tooling paradigm.

The monitoring stack I run for Irin clients is the same stack I use to observe both Irin and my private infrastructure. It’s what lets me catch instrumentation gotchas and gives me a reliable view of all of my systems. I built Irin because I believe that monitoring your system shouldn’t be a full-time job. If the monitoring stack does what it’s supposed to, you should be able to check it intermittently through the day. It should tell you at a glance if something’s wrong, and send an alert if the problem merits it. If it’s noisy, crowded, and you don’t know where to begin when there’s a problem, the system doesn’t work — and the real problems get drowned out.

When you bring your data home, who is going to keep an eye on it?

Justyn Larry — Wed, 27 May 2026 16:27:58 +0000

Cloud providers have always sold convenience. Compute on demand, storage that scales, and somewhere in the fine print, the implied promise that someone else is watching the infrastructure. For a lot of teams, that last item was the most valuable thing they were paying for, whether they knew it or not.
That arrangement is starting to come apart.

The Numbers

Cloudian's 2026 research report surveyed 212 senior IT decision makers and found that 75% had moved workloads from the cloud back to on-premises infrastructure in the prior 24 months. That is not a rounding error or a niche trend. Three out of four senior IT professionals at organizations large enough to have senior IT professionals made a deliberate choice to bring their data and compute closer to home.
The reasons are not surprising. Security and compliance pressure is one driver, and the growth of AI workloads is another. Michael Gale, CMO at EDB, put it plainly in a recent IT Brew piece, “If you want to use AI and data, you’ve got to be secure and compliant, they’ve got to be next to each other.” Sending proprietary data to a third-party cloud provider to feed a general-purpose model is increasingly hard to justify when purpose-built, containerized, on-premises alternatives exist.
Egress fees are the third driver, and arguably the most compelling one. Cloud providers charge you to store data, and then they charge you to process it. And when you eventually decide you want it back, they charge you for that as well. Andy Stone, CTO for the Americas at Everpure, described it clearly: “They’re saying, as long as your data lives here, we’re cool; you want to take your data out, we’re going to charge you on the back end. In your data center, you don’t have that, you’re not going to pay an egress charge. It’s a benefit you derive, but the move itself takes time, a lot of planning and effort, and it’s certainly not easy in most cases.” In addition to charging for usage, companies are now paying not only to get their data back, but now the onus of monitoring and the associated costs are transferred back to the company as well.

What Moves With the Data

The part of this conversation that does not get enough attention is what teams lose when they leave the cloud, beyond the convenience of managed services.
AWS CloudWatch, Azure Monitor, Google Cloud Operations. These tools exist because cloud providers understand that customers need to be able to see their infrastructure to troubleshoot it, and customers who cannot troubleshoot it generate support tickets. Visibility was bundled into the cost of cloud compute because the cloud needed it to function at scale. Informed customers generate fewer support tickets, so monitoring in a cloud environment became an amenity, when in reality it lowers their support costs.
When a company repatriates its workloads, that visibility disappears. Now that the servers and the data are in house, so is the burden of monitoring the system. In the IT Brew Stone notes that repatriation requires a lot of architecting and planning, including managing the applications consuming and producing data. That’s accurate, and monitoring sits at the center of it. It’s hard to manage what you can’t see, and managing infrastructure on-premises creates a monitoring gap that needs to be filled, either internally or externally.

Unforeseen Migration Gaps

The teams making this move are not all large enterprises with dedicated platform engineering staff. It is reasonable to assume that some portion of that 75% are organizations with lean technical teams making a deliberate architectural choice to prioritize control. They have the skills to manage their own hardware, they’ve made the cost calculation and decided it made sense. What they frequently do not have is the time or the desire to build and maintain a production-grade observability stack on top of everything else that’s migrating from the cloud.
This is where the repatriation trend creates a genuinely new problem rather than just a different version of an old one. The cloud abstracted away the operational burden of monitoring. On-premises infrastructure exposes it directly. Companies need to be made aware that a disk is filling up before it causes an outage, alert routing needs to reach someone when a service goes down in the middle of the night, and log retention should go back far enough to reconstruct the events that occurred during an incident.
Building a monitoring stack is not the hard part, most teams can easily deploy the tooling. The open source tooling available for collecting telemetry is genuinely excellent. The real problem created by building an in-house monitoring system is the burden of ongoing operational overhead and figuring out which team members will own the maintenance. It’s an ongoing process that requires dedicated personnel to configure the tools, tune them, keep them running, and revisit the alert thresholds as the infrastructure changes. After dealing with planning and executing data repatriation for several months, they’re now faced with creating and maintaining monitoring for their infrastructure, and allocating resources they may not have to that endeavor.

The Path Ahead

The repatriation trend is not likely to lose momentum in any meaningful way. The AI data sovereignty argument is too strong, the cost of cloud computing is too high, and security is becoming a bigger issue. If anything, the next wave of AI agent deployments will accelerate it. Gale's estimate of up to 300 million agents operating in US enterprises is speculative but directionally correct. Agents need data, that data needs to be governed, and governance is substantially easier when you control the physical location of the data.
As companies continue to pull their data in-house, a large and growing number of technical teams will find themselves responsible for infrastructure that requires monitoring, with limited time and resources to build and maintain it. Cloud-provided tools and infrastructure demonstrated the need for good visibility, and altering the deployment model should not mean changing how teams monitor their systems.
Companies that navigate this well will be the ones who treat observability as a priority from the start of the repatriation process, not something to revisit once the migration is complete.

Adding an LLM Narration Layer to a Self-Hosted Observability Stack

Justyn Larry — Tue, 12 May 2026 17:21:59 +0000

I almost made the classic AI architecture mistake.

I could easily just dump raw Prometheus metrics and Loki logs into an LLM and ask it to summarize anomalies and trends. What could possibly go wrong? The more I thought about it, the more obvious it became that I needed more guardrails and smarter preprocessing, not just more AI.
Right now, it feels like every company is trying to answer the same question:
“How can we add AI to this?”

The more important question is whether AI belongs there at all, and if it does, how to implement it responsibly.

Over the last year, I built a self-hosted observability platform running Prometheus, Grafana, Loki, Alertmanager, and Grafana Alloy on bare metal infrastructure. Clients sign up through a web portal, run a bootstrap script hosted by an internal API, and receive dashboards, alerts, and monthly PDF health reports delivered by email.

The reporting system is where introducing an LLM actually started to make sense.
The reports already contained the raw information:
• CPU, memory, and disk trends
• uptime summaries
• alert history
• cost optimization findings
But raw information is not the same thing as insight.
If a client is already looking at Grafana dashboards, they already have access to the data. What they actually need is context:
• what changed,
• what matters,
• what should concern them,
• and what can probably be ignored.
That sent me down a path I spent the better part of a week wrestling with:

Do I actually need AI in this stack?

What the report system looks like right now

Each client gets a monthly PDF that covers:
• CPU, memory, and disk trends per server
• Alert history and incident counts
• Uptime summary
• A cost optimization section (flagging underutilized servers)

The report is generated by a Python script that queries Prometheus and Loki, builds a structured JSON findings object, pulls panel screenshots from Grafana Image Renderer, and assembles everything into a PDF via ReportLab. It goes out through Resend on a cron schedule.

Currently, the sections that require judgment are static templated text or stubbed as null. An LLM could add actual value to these sections, specifically in the anomaly narrative. The ability to tell a client “here’s what happened this month and this is why it matters" or "server X has averaged 4% CPU for 30 days, you are paying for capacity you are not using." Providing server-specific information and cost optimization recommendations is a heavy lift at scale. Maybe I do need AI….

The wrong answer is always the most tempting

My first instinct was to take the raw Prometheus metrics and Loki logs and just feed them straight into an LLM prompt, and ask it to summarize its findings, summarize the trends, and flag any anomalies.

The simplicity of that idea raised a red flag, and the reasons became obvious when I thought through what the model actually receives.

Raw Prometheus output is a time series. Thousands of data points, repeated metric names, label sets, timestamps in Unix epoch format. An LLM does not have built-in statistical reasoning about time series data and reads data as a flat list of numbers, producing summaries that bury the signal in noise and arrive at conclusions that sound confident but are mathematically hollow.

The second problem is client data isolation. Improper implementation with multi-tenant data risks leaking context between tenants in the prompt. Even with careful prompt engineering, raw metric dumps from multiple clients could potentially leak into one another, polluting the report data.

Cost and latency at scale posed a problem as well. With five clients, calling a cloud LLM API per client per month is manageable, but at fifty clients, the compute requirements and API costs scale aggressively.

Preprocess first, always

The correct pattern, and the one I settled on, is to preprocess the metrics into structured summaries before the LLM ever sees them. I didn’t want the LLM to perform data analysis, I wanted it to narrate.

This is the approach that I settled on:

Step 1: Query Prometheus and Loki with purpose

Instead of dumping raw time series, compute the statistics that matter:
• Average CPU utilization per server over the reporting period
• Peak CPU, with timestamp, over the same period
• Memory trend (growing, stable, shrinking)
• Disk utilization and projected time to threshold at current growth rate
• Alert counts by severity
• Error log counts and top recurring patterns from Loki
The Python script already does most of this to build the findings.json object. The change for me here was that instead of rendering that JSON directly into a PDF template, the system would need to also pass a structured summary of it to the LLM.

Step 2: Build a structured prompt, not a data dump

The input to the LLM looks something like this:

Server: web-01
Reporting period: April 2026

CPU: Average 68%, peak 94% on April 14 at 02:17 UTC
Memory: Average 71%, stable trend
Disk: 61% used, growing approximately 2% per month at current rate
Alerts fired: 3 (2 high CPU, 1 disk warning)
Error logs: 847 total, top pattern: "connection timeout to db-01" (312 occurrences)

Task: Write a 2-3 sentence plain-English summary of this server's behavior
during the reporting period. Note anything that warrants client attention.
Do not use technical jargon.

By setting up the prompt this way, I could lean into a job an LLM could perform at a high level. The preprocessing pipeline handles the statistical analysis before the LLM ever sees the data. The model’s job is reduced to converting structured findings into readable prose, which dramatically lowers the chance of hallucination or incorrect conclusions.

Step 3: Isolate per tenant, per server

To eliminate the possibility of tenant data mixing, each LLM call covers one server for one tenant. The prompt contains only the preprocessed summary for that server.

The privacy angle, and why it matters for SMB clients

The LLM runs locally on my LAN so client telemetry never leaves my infrastructure.
That decision was partly cost-driven, but mostly about data boundaries. Monitoring systems already require a significant amount of operational trust. Sending client metrics and logs to an external AI provider adds an additional layer of exposure that I was uncomfortable with.
Being able to say that the AI analysis of their logs runs on hardware I own and control, never outsourced, is a meaningful trust signal. The data never leaves the monitoring environment.

Error handling

This piece of the architecture took a little thought. Ultimately the LLM is an optional enrichment layer, not a report dependency. If local inference is unavailable for whatever reason, the report still ships.
The flow looks like this:

The LLM is an enrichment layer: static reports ship immediately on failure, with AI narratives following as a supplement only if local inference recovers.

This way, the client always gets a report. If the LLM is unavailable, the narrative section is absent. If the LLM is down temporarily, the narrative eventually reaches the client without re-sending the full report, and static report generation is never blocked by or reliant on LLM availability.

The LLM is not the analyst

If you are building something like this and starting fresh, the one architectural principle worth internalizing early is this: the LLM is a narrator, not an analyst. Do the analysis yourself in code and hand the result off to the LLM. Give the model clean, structured summaries and a well-defined writing task. The results are dramatically better than dumping raw data into a prompt and hoping for insight.

Secondly, as with everything, design for failure from the beginning. The pipeline should degrade gracefully when the inference endpoint is down, slow, or returning unusable data. Delivering a report without the narrative section is better than no report at all.

So, do I need AI in my monitoring stack?
The honest answer? I’m still not entirely sure.

This experiment has made me think differently about LLM integration. I no longer see the model as the system performing the analysis. The deterministic systems still do the reasoning. Prometheus, Loki, and the preprocessing pipeline establish the facts. The LLM’s job is to translate structured findings into readable context.

That distinction ended up mattering far more than the model itself.

If you are building something similar, my biggest takeaway is this:
Let the LLM be the narrator, not the creator. Keep the reasoning in your deterministic systems, and prompt the model to explain the result, not discover it.

DEV Community: Justyn Larry

Migrating from Promtail to Alloy for Log Collection

What actually changes

The journal case

The file-based case

The trap: don't guess the log method by distro

Migrating without a gap

Where this leaves you

Monitoring Docker Containers with Grafana Alloy and cAdvisor

Container Monitoring

Deployment

Wiring it into Alloy

Avoiding the Cardinality Trap

Three Useful Alerts

Where This Fits

Why LLM Decisions Should Be Deterministic

The Deterministic Layer

Industry Trends

Self-Reporting Is Not an Explanation

What Does Narration Actually Do?

Where This Is Heading

Related Reading

Prometheus Agent Mode vs Grafana Alloy: Choosing the Right Push Agent in 2026

The Shift in the Monitoring Landscape

The Case for Prometheus Agent

Where Alloy Wins

Alloy's Hidden Costs

Choose the Tool That Fits

Migrating from node_exporter to Grafana Alloy, One Server at a Time

The real difference is the direction of travel

The component that replaces node_exporter

Why migrate one server at a time?

The migration

1. Pick a canary host

2. Configure Alloy

3. Verify the new telemetry

4. Retire node_exporter

5. Watch the canary

Common migration mistakes

What you gain

Final thoughts

The LLM narrates. The code decides.

The problem

The Split

Context hydration

Fail-Closed

Advice, For What it's Worth

Stop Relying Entirely on Uptime Kuma for Incident Response

TL;DR:

Where Uptime Kuma excels

Growing pains

Reachability versus internals

The internal view

The honest part: this is more work

So what do you do with this

You Don't Need Kubernetes to Monitor 20 Linux VMs

VMs are often the answer

What node_exporter actually is

Setting up one VM

It is a little repetitive

What you can actually see

Maintenance costs

Where Irin comes in

The Only 5 PromQL Queries You Really Need to Monitor a Linux Server

TL;DR

Quick Reference:

1. CPU usage (percent busy)

2. Memory usage (percent used)

3. Disk space (percent full)

4. Disk IO (how saturated the disk is)

5. Network throughput (bytes per second)

The pattern

Where this leaves you

Metrics Tell You Something Broke. Tracing Tells You What, Where, and Why.

WHAT EXACTLY IS DISTRIBUTED TRACING?

THE ORIGINAL CORE STACK

THE SOLUTION: OPENTELEMETRY

THE PYTHON IMPLEMENTATION

LESSONS LEARNED

THE BIG DIFFERENCE