Azeem Siddiqui

Posted on Jun 22

Centralized Logging Across a Mixed-OS Server Fleet with Grafana Alloy and Loki

#ai #grafana #devops #logs

For a long time, finding a log on our fleet meant SSH-ing into the box and running grep. Which box? Depends. If you knew exactly where the thing ran, fine. If you didn't — and during an incident you usually don't — you were hopping between hosts, scrolling Tomcat logs, trying to line up timestamps across three machines by eye. It works right up until the moment you need it to work fast.

So I rolled out centralized logging: every log line from every server, in one place, queryable, with a Slack command so people didn't even have to open Grafana to check something. The stack is Grafana Alloy on each host shipping to Loki. This post is that rollout — including the parts the docs don't warn you about, which is most of what made it take longer than the happy-path tutorial suggests.

A quick word on why Alloy and Loki, since there are other options. I'd run ELK before, and the Elasticsearch side is a lot of machine to feed and babysit for what amounts to "let me search my logs." Loki indexes labels, not the full log text, so it's dramatically cheaper to run and the storage footprint is sane. The paid SaaS log tools are great until you see the bill for a chatty fleet. Alloy is Grafana's agent — it replaced Promtail and the old Grafana Agent — and it does the collecting. The combination gave me searchable logs without standing up a search cluster or signing a contract.

What the pipeline looks like

  each host                          central
┌───────────────┐
│  app logs     │
│  /var/log/... │
│      │        │
│   Grafana     │   push (HTTP)    ┌──────────┐     ┌──────────┐
│   Alloy       │ ───────────────► │   Loki   │ ◄── │ Grafana  │
│  (per box)    │                  └──────────┘     └──────────┘
└───────────────┘                       ▲
                                         │ LogQL over HTTP API
                                   ┌──────────┐
                                   │ Slack bot│
                                   └──────────┘

Alloy reads files on each host, tags each line with a few labels (which host, which environment, which service), and pushes to Loki. Grafana queries Loki for the dashboard view. The Slack bot hits Loki's HTTP API directly so people can pull logs without leaving the channel they're already in. That last piece turned out to be the one everybody actually uses.

Standing up Loki

For trying this out, a single Loki container is plenty. Here's a docker-compose.yml that gets you Loki and Grafana side by side:

services:
  loki:
    image: grafana/loki:3.0.0
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/config.yaml
    volumes:
      - ./loki-config.yaml:/etc/loki/config.yaml
      - loki-data:/loki

  grafana:
    image: grafana/grafana:11.0.0
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin

volumes:
  loki-data:

And a minimal loki-config.yaml that stores everything on the local filesystem:

auth_enabled: false

server:
  http_listen_port: 3100

common:
  instance_addr: 127.0.0.1
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  retention_period: 744h   # 31 days

docker compose up -d and you've got Loki listening on 3100 and Grafana on 3000.

Filesystem storage is fine for a single node and for following along. In a real deployment I point object_store at S3 (or whatever object storage you've got) — Loki is built to keep chunks in object storage, and that's where it gets cheap and durable. Don't run a real fleet on local disk. But for getting the pipeline working end to end, this is one less thing to set up.

The Alloy agent

Now the part that runs on every host. Install on Debian/Ubuntu from Grafana's apt repo:

sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor \
  | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" \
  | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install -y alloy

On RHEL-family boxes (Rocky, Alma, and so on) it's the rpm repo and dnf install alloy instead. Same agent, different package manager — keep that in mind, because "mixed fleet" is the whole theme here and the install step is the first place it shows up.

Alloy's config lives at /etc/alloy/config.alloy and it speaks its own config language. Here's a working config that watches an app's logs and ships them:

// Discover the files to read, and stamp each one with labels.
local.file_match "app" {
  path_targets = [
    {
      __path__ = "/var/log/myapp/*.log",
      host     = constants.hostname,
      env      = "prod",
      service  = "api",
    },
  ]
}

// Read those files.
loki.source.file "app" {
  targets    = local.file_match.app.targets
  forward_to = [loki.process.app.receiver]
}

// Process: stitch multiline entries, pull out the log level.
loki.process "app" {
  stage.multiline {
    firstline     = "^\\d{4}-\\d{2}-\\d{2}[ T]\\d{2}:\\d{2}:\\d{2}"
    max_wait_time = "3s"
  }

  stage.regex {
    expression = "\\s(?P<level>TRACE|DEBUG|INFO|WARN|ERROR|FATAL)\\s"
  }

  stage.labels {
    values = { level = "" }
  }

  forward_to = [loki.write.central.receiver]
}

// Ship to Loki.
loki.write "central" {
  endpoint {
    url = "http://loki-host:3100/loki/api/v1/push"
  }
}

The shape is always the same: find files, read files, process, write. The labels I set in local.file_match — host, env, service — ride along with every line, and they're what you'll select on later. constants.hostname fills in the box's hostname so you don't have to hardcode it per machine, which matters when you're templating this across a fleet.

Reload after editing:

sudo systemctl reload alloy
journalctl -u alloy -f

Watch that journal the first time. It'll tell you if it can't find files, can't reach Loki, or doesn't like your config — and one of those three is almost always your first problem.

Where "mixed-OS fleet" stops being a footnote

The happy-path tutorial assumes every box is the same and every log is readable. A real fleet is neither. Here's what actually slowed me down.

Permissions, every time. Alloy runs as the alloy user. Your application logs are very often owned by the app's user and group, mode 0640, which means a process that isn't root and isn't in that group can't read them. Alloy starts up clean, reports no errors, and ships nothing — because it can't open the file, and "can't read it" looks a lot like "nothing to read." The fix is to put the alloy user in the app's group:

sudo usermod -aG tomcat alloy
sudo systemctl restart alloy

I lost more time to this than I'd like to admit, precisely because it fails silently. If a host is shipping nothing and the journal looks fine, check whether alloy can actually cat the file. That's the first thing now.

The box that's two years past EOL. Every fleet has one. The distro's so old the vendor stopped shipping security patches, nobody's volunteering to migrate it, and it is of course running something important. The agent will usually still install, but the older init and a couple of defaults don't behave like they do on a current release, and you don't want a routine package update yanking the version out from under you. I pin the package version on those boxes and treat them as their own little special case rather than pretending they're like the rest.

Ripping out the old shipper. Most fleets that have been around a while already have some log agent — a leftover vendor thing, an old Promtail, a Filebeat somebody set up in 2019. Leave it running next to Alloy and you've got two processes tailing the same files, double-shipping, and a config nobody remembers the login for. Part of this rollout was finding and removing those:

sudo systemctl stop legacy-log-agent
sudo systemctl disable legacy-log-agent

Decommissioning the old thing is genuinely part of the job, not a cleanup task for later. Two agents disagreeing about what they've shipped is a debugging session you don't want.

Multiline: the parsing that bites everyone

Here's the one that everybody hits. A Java stack trace is one logical event spread across forty physical lines. By default, Alloy treats every line as its own log entry — so your stack trace lands in Loki as forty separate, context-free entries, and when you go looking for the error during an incident, you find the first line and none of the cause.

The stage.multiline block fixes it, and the whole thing turns on one field: firstline. That regex describes what the start of a new entry looks like. Anything that doesn't match gets appended to the entry above it. So if your log lines begin with a timestamp:

stage.multiline {
  firstline     = "^\\d{4}-\\d{2}-\\d{2}[ T]\\d{2}:\\d{2}:\\d{2}"
  max_wait_time = "3s"
}

a new line that starts with 2024-01-15 10:23:45 is a new entry, and the indented at com.example... frames underneath it get glued on where they belong. One readable event instead of forty fragments.

Now the part that actually cost me time. One firstline regex does not cover a mixed fleet, because different applications start their lines differently. One service logs ISO timestamps. Another leads with a bracketed [2024-01-15T10:23:45]. A third uses a syslog-style Jan 15 10:23:45. A regex tuned for the first silently fails on the others — and "silently" is the dangerous word. It doesn't error. It mostly looks fine. The single-line entries parse correctly, so the dashboards look healthy, and you don't find out the multiline is broken until you're staring at a stack trace shredded into thirty rows in the middle of an outage. That is the worst possible moment to learn your firstline regex was written for a different app.

So I stopped trying to write one config to rule them all. Each application family gets its own firstline pattern, and I template the Alloy config per app-type — same skeleton, the firstline and the paths swapped in. The discipline is: when you onboard a new kind of app, go look at three or four raw log lines and confirm what the start-of-entry marker actually is before you trust the multiline. Thirty seconds of looking saves you that incident.

One more knob: max_wait_time is how long Alloy waits for continuation lines before it gives up and flushes what it has. Too short and you'll occasionally guillotine the tail of a slow-written stack trace. A few seconds is a sane default.

Querying it: LogQL

With logs flowing, you query them with LogQL. It reads like PromQL's cousin. You start by selecting a stream with labels, then filter the text:

{env="prod", service="api"} |= "ERROR"

That's "lines from the prod api service containing ERROR." The label selector narrows down which streams Loki even looks at — that's the cheap part — and the |= filters the text. A few more that I reach for constantly:

{env="prod"} |= "ERROR" != "favicon"          # errors fleet-wide, minus the noise
{service="api"} | json | status >= 500         # parse JSON logs, filter on a field
{host="web-03"} | level="ERROR"                 # the label we extracted earlier
sum by (service) (rate({env="prod"} |= "ERROR" [5m]))   # error rate per service

That last one is a metric built straight out of logs, which is where Loki starts earning its place beyond grep-in-a-box: you can alert on it.

Logs in Slack

The dashboard is fine, but the thing that got people to actually adopt this was the Slack bot. Nobody wants to context-switch to a browser to answer "is the api throwing errors right now." So the bot queries Loki's HTTP API and posts results back in the channel.

The core is one function that hits Loki's query_range endpoint:

async function queryLoki({ query, sinceMs, limit = 50 }) {
  const end = `${Date.now()}000000`;            // ns; Loki wants nanoseconds
  const start = `${Date.now() - sinceMs}000000`;

  const url = new URL("http://loki-host:3100/loki/api/v1/query_range");
  url.searchParams.set("query", query);
  url.searchParams.set("start", start);
  url.searchParams.set("end", end);
  url.searchParams.set("limit", String(limit));
  url.searchParams.set("direction", "backward"); // newest first

  const res = await fetch(url);
  const body = await res.json();

  // result: [{ stream: {labels...}, values: [[ts_ns, line], ...] }]
  return body.data.result.flatMap((s) => s.values.map(([ts, line]) => ({ ts, line })));
}

A small gotcha lives in that first line. Loki timestamps are nanoseconds, and Date.now() gives milliseconds, so you'd naturally write Date.now() * 1e6. Don't — that number is bigger than JavaScript's safe integer range and you'll get a subtly wrong, rounded timestamp that makes your queries return the wrong window. Building the string by hand (${Date.now()}000000, six zeros for ms-to-ns) sidesteps the whole floating-point problem. This one is not in the docs; it's in the "why are my time ranges off" stage of debugging.

Now wire it to a slash command. Using Slack's Bolt framework, a /logs command with a tiny argument parser handles the cases people actually ask for:

import pkg from "@slack/bolt";
const { App } = pkg;

const app = new App({
  token: process.env.SLACK_BOT_TOKEN,
  signingSecret: process.env.SLACK_SIGNING_SECRET,
});

// "/logs service=api ERROR"        -> {service="api"} |= "ERROR", last 15m
// "/logs tail service=api"         -> same selector, last few lines
// "/logs service=api ERROR yesterday" -> shift the window back a day
function parse(text) {
  const tokens = text.trim().split(/\s+/).filter(Boolean);
  const selectors = [];
  const filters = [];
  let sinceMs = 15 * 60 * 1000;
  let limit = 50;

  for (const t of tokens) {
    if (t === "tail") { limit = 20; }
    else if (t === "yesterday") { sinceMs = 24 * 60 * 60 * 1000; }
    else if (t.includes("=")) {
      const [k, v] = t.split("=");
      selectors.push(`${k}="${v}"`);
    } else {
      filters.push(`|= "${t}"`);
    }
  }

  const query = `{${selectors.join(", ")}} ${filters.join(" ")}`.trim();
  return { query, sinceMs, limit };
}

app.command("/logs", async ({ command, ack, respond }) => {
  await ack();
  const { query, sinceMs, limit } = parse(command.text);

  let rows;
  try {
    rows = await queryLoki({ query, sinceMs, limit });
  } catch (err) {
    await respond(`Query failed: \`${query}\``);
    return;
  }

  if (rows.length === 0) {
    await respond(`No matches for \`${query}\` in the window.`);
    return;
  }

  const block = rows
    .reverse()                       // oldest -> newest reads better
    .map((r) => r.line)
    .join("\n")
    .slice(0, 3500);                 // stay under Slack's message limit

  await respond("```

\n" + block + "\n

```");
});

await app.start();

So in any channel: /logs service=api ERROR gives you the last fifteen minutes of API errors, tail trims it to a handful of recent lines, and yesterday shifts the window when you're chasing something from the day before. The parser is deliberately dumb — label-ish tokens become selectors, bare words become line filters, two keywords adjust the window. It covers the questions people actually ask, and it's small enough to extend in an afternoon when they ask for more.

(If you want to take this further, the same Loki results make a clean input to an LLM for "summarize what went wrong here" — but that's a different post.)

Before you trust it in production

A few things I'd make sure are handled before leaning on this:

Object storage, not local disk. Point Loki at S3 or equivalent. Local filesystem is for the demo; it doesn't survive the node, and it doesn't scale.
Retention, set on purpose. Decide how long you keep logs and configure retention_period to match — for compliance reasons, for cost reasons, or both. Don't let it default into "forever" by accident.
Label cardinality is the trap. This is the Loki mistake. Labels are an index; every unique combination of label values is a separate stream. Put something high-cardinality in a label — a request ID, a user ID, a raw timestamp — and you'll blow up the index and bring Loki to its knees. Keep labels low-cardinality (host, env, service, level) and filter on everything else inside the query with |= and | json. When in doubt, it's a filter, not a label.
Alert on patterns. Once error rate is a LogQL metric, wire an alert to it — a Grafana alert rule into a Slack channel, say — so a spike pages you instead of waiting to be noticed. The logs being centralized is step one; the logs telling you something's wrong before a human looks is the actual payoff.

Wrapping up

The rollout itself is the easy part: install Alloy, point it at your files, label the streams, ship to Loki. The work that doesn't show up in the quickstart is the fleet being a fleet — the permission that fails silently, the EOL box that needs special handling, the old agent you have to evict, and the multiline regex that's wrong for half your apps and won't admit it until the worst possible moment. None of those are hard once you know to look. They're just the difference between "it worked on my one test box" and "it works across everything I actually run."

What you get at the end is worth the fuss: every line from every server in one place, a query language that turns logs into alertable metrics, and a Slack command that means people check the logs because it's easy instead of avoiding it because it isn't. The day an incident hits and the answer is one /logs away instead of a tour of five machines, the whole thing pays for itself.

DEV Community