Justyn Larry for Irin Observability

Posted on May 12 • Originally published at irinobservability.com

Adding an LLM Narration Layer to a Self-Hosted Observability Stack

#ai #devops #monitoring #architecture

I almost made the classic AI architecture mistake.

I could easily just dump raw Prometheus metrics and Loki logs into an LLM and ask it to summarize anomalies and trends. What could possibly go wrong? The more I thought about it, the more obvious it became that I needed more guardrails and smarter preprocessing, not just more AI.
Right now, it feels like every company is trying to answer the same question:
“How can we add AI to this?”

The more important question is whether AI belongs there at all, and if it does, how to implement it responsibly.

Over the last year, I built a self-hosted observability platform running Prometheus, Grafana, Loki, Alertmanager, and Grafana Alloy on bare metal infrastructure. Clients sign up through a web portal, run a bootstrap script hosted by an internal API, and receive dashboards, alerts, and monthly PDF health reports delivered by email.

The reporting system is where introducing an LLM actually started to make sense.
The reports already contained the raw information:
• CPU, memory, and disk trends
• uptime summaries
• alert history
• cost optimization findings
But raw information is not the same thing as insight.
If a client is already looking at Grafana dashboards, they already have access to the data. What they actually need is context:
• what changed,
• what matters,
• what should concern them,
• and what can probably be ignored.
That sent me down a path I spent the better part of a week wrestling with:

Do I actually need AI in this stack?

What the report system looks like right now

Each client gets a monthly PDF that covers:
• CPU, memory, and disk trends per server
• Alert history and incident counts
• Uptime summary
• A cost optimization section (flagging underutilized servers)

The report is generated by a Python script that queries Prometheus and Loki, builds a structured JSON findings object, pulls panel screenshots from Grafana Image Renderer, and assembles everything into a PDF via ReportLab. It goes out through Resend on a cron schedule.

Currently, the sections that require judgment are static templated text or stubbed as null. An LLM could add actual value to these sections, specifically in the anomaly narrative. The ability to tell a client “here’s what happened this month and this is why it matters" or "server X has averaged 4% CPU for 30 days, you are paying for capacity you are not using." Providing server-specific information and cost optimization recommendations is a heavy lift at scale. Maybe I do need AI….

The wrong answer is always the most tempting

My first instinct was to take the raw Prometheus metrics and Loki logs and just feed them straight into an LLM prompt, and ask it to summarize its findings, summarize the trends, and flag any anomalies.

The simplicity of that idea raised a red flag, and the reasons became obvious when I thought through what the model actually receives.

Raw Prometheus output is a time series. Thousands of data points, repeated metric names, label sets, timestamps in Unix epoch format. An LLM does not have built-in statistical reasoning about time series data and reads data as a flat list of numbers, producing summaries that bury the signal in noise and arrive at conclusions that sound confident but are mathematically hollow.

The second problem is client data isolation. Improper implementation with multi-tenant data risks leaking context between tenants in the prompt. Even with careful prompt engineering, raw metric dumps from multiple clients could potentially leak into one another, polluting the report data.

Cost and latency at scale posed a problem as well. With five clients, calling a cloud LLM API per client per month is manageable, but at fifty clients, the compute requirements and API costs scale aggressively.

Preprocess first, always

The correct pattern, and the one I settled on, is to preprocess the metrics into structured summaries before the LLM ever sees them. I didn’t want the LLM to perform data analysis, I wanted it to narrate.

This is the approach that I settled on:

Step 1: Query Prometheus and Loki with purpose

Instead of dumping raw time series, compute the statistics that matter:
• Average CPU utilization per server over the reporting period
• Peak CPU, with timestamp, over the same period
• Memory trend (growing, stable, shrinking)
• Disk utilization and projected time to threshold at current growth rate
• Alert counts by severity
• Error log counts and top recurring patterns from Loki
The Python script already does most of this to build the findings.json object. The change for me here was that instead of rendering that JSON directly into a PDF template, the system would need to also pass a structured summary of it to the LLM.

Step 2: Build a structured prompt, not a data dump

The input to the LLM looks something like this:

Server: web-01
Reporting period: April 2026

CPU: Average 68%, peak 94% on April 14 at 02:17 UTC
Memory: Average 71%, stable trend
Disk: 61% used, growing approximately 2% per month at current rate
Alerts fired: 3 (2 high CPU, 1 disk warning)
Error logs: 847 total, top pattern: "connection timeout to db-01" (312 occurrences)

Task: Write a 2-3 sentence plain-English summary of this server's behavior
during the reporting period. Note anything that warrants client attention.
Do not use technical jargon.

By setting up the prompt this way, I could lean into a job an LLM could perform at a high level. The preprocessing pipeline handles the statistical analysis before the LLM ever sees the data. The model’s job is reduced to converting structured findings into readable prose, which dramatically lowers the chance of hallucination or incorrect conclusions.

Step 3: Isolate per tenant, per server

To eliminate the possibility of tenant data mixing, each LLM call covers one server for one tenant. The prompt contains only the preprocessed summary for that server.

The privacy angle, and why it matters for SMB clients

The LLM runs locally on my LAN so client telemetry never leaves my infrastructure.
That decision was partly cost-driven, but mostly about data boundaries. Monitoring systems already require a significant amount of operational trust. Sending client metrics and logs to an external AI provider adds an additional layer of exposure that I was uncomfortable with.
Being able to say that the AI analysis of their logs runs on hardware I own and control, never outsourced, is a meaningful trust signal. The data never leaves the monitoring environment.

Error handling

This piece of the architecture took a little thought. Ultimately the LLM is an optional enrichment layer, not a report dependency. If local inference is unavailable for whatever reason, the report still ships.
The flow looks like this:

The LLM is an enrichment layer: static reports ship immediately on failure, with AI narratives following as a supplement only if local inference recovers.

This way, the client always gets a report. If the LLM is unavailable, the narrative section is absent. If the LLM is down temporarily, the narrative eventually reaches the client without re-sending the full report, and static report generation is never blocked by or reliant on LLM availability.

The LLM is not the analyst

If you are building something like this and starting fresh, the one architectural principle worth internalizing early is this: the LLM is a narrator, not an analyst. Do the analysis yourself in code and hand the result off to the LLM. Give the model clean, structured summaries and a well-defined writing task. The results are dramatically better than dumping raw data into a prompt and hoping for insight.

Secondly, as with everything, design for failure from the beginning. The pipeline should degrade gracefully when the inference endpoint is down, slow, or returning unusable data. Delivering a report without the narrative section is better than no report at all.

So, do I need AI in my monitoring stack?
The honest answer? I’m still not entirely sure.

This experiment has made me think differently about LLM integration. I no longer see the model as the system performing the analysis. The deterministic systems still do the reasoning. Prometheus, Loki, and the preprocessing pipeline establish the facts. The LLM’s job is to translate structured findings into readable context.

That distinction ended up mattering far more than the model itself.

If you are building something similar, my biggest takeaway is this:
Let the LLM be the narrator, not the creator. Keep the reasoning in your deterministic systems, and prompt the model to explain the result, not discover it.

DEV Community