<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Irin Observability</title>
    <description>The latest articles on DEV Community by Irin Observability (@irinobservability).</description>
    <link>https://dev.to/irinobservability</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F13187%2Feeadef7c-e964-4da3-85a3-a003298e0816.png</url>
      <title>DEV Community: Irin Observability</title>
      <link>https://dev.to/irinobservability</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/irinobservability"/>
    <language>en</language>
    <item>
      <title>Adding an LLM Narration Layer to a Self-Hosted Observability Stack</title>
      <dc:creator>Justyn Larry</dc:creator>
      <pubDate>Tue, 12 May 2026 17:21:59 +0000</pubDate>
      <link>https://dev.to/irinobservability/adding-an-llm-narration-layer-to-a-self-hosted-observability-stack-p35</link>
      <guid>https://dev.to/irinobservability/adding-an-llm-narration-layer-to-a-self-hosted-observability-stack-p35</guid>
      <description>&lt;p&gt;I almost made the classic AI architecture mistake.&lt;/p&gt;

&lt;p&gt;I could easily just dump raw Prometheus metrics and Loki logs into an LLM and ask it to summarize anomalies and trends.  What could possibly go wrong?  The more I thought about it, the more obvious it became that I needed more guardrails and smarter preprocessing, not just more AI.&lt;br&gt;
Right now, it feels like every company is trying to answer the same question:&lt;br&gt;
“How can we add AI to this?”&lt;/p&gt;

&lt;p&gt;The more important question is whether AI belongs there at all, and if it does, how to implement it responsibly.&lt;/p&gt;

&lt;p&gt;Over the last year, I built a self-hosted observability platform running Prometheus, Grafana, Loki, Alertmanager, and Grafana Alloy on bare metal infrastructure. Clients sign up through a web portal, run a bootstrap script hosted by an internal API, and receive dashboards, alerts, and monthly PDF health reports delivered by email.&lt;/p&gt;

&lt;p&gt;The reporting system is where introducing an LLM actually started to make sense.&lt;br&gt;
The reports already contained the raw information:&lt;br&gt;
    • CPU, memory, and disk trends&lt;br&gt;
    • uptime summaries&lt;br&gt;
    • alert history&lt;br&gt;
    • cost optimization findings&lt;br&gt;
But raw information is not the same thing as insight.&lt;br&gt;
If a client is already looking at Grafana dashboards, they already have access to the data. What they actually need is context:&lt;br&gt;
    • what changed,&lt;br&gt;
    • what matters,&lt;br&gt;
    • what should concern them,&lt;br&gt;
    • and what can probably be ignored.&lt;br&gt;
That sent me down a path I spent the better part of a week wrestling with:&lt;/p&gt;

&lt;p&gt;Do I actually need AI in this stack?&lt;/p&gt;
&lt;h2&gt;
  
  
  What the report system looks like right now
&lt;/h2&gt;

&lt;p&gt;Each client gets a monthly PDF that covers:&lt;br&gt;
    • CPU, memory, and disk trends per server &lt;br&gt;
    • Alert history and incident counts &lt;br&gt;
    • Uptime summary &lt;br&gt;
    • A cost optimization section (flagging underutilized servers) &lt;/p&gt;

&lt;p&gt;The report is generated by a Python script that queries Prometheus and Loki, builds a structured JSON findings object, pulls panel screenshots from Grafana Image Renderer, and assembles everything into a PDF via ReportLab.  It goes out through Resend on a cron schedule.&lt;/p&gt;

&lt;p&gt;Currently, the sections that require judgment are static templated text or stubbed as null.  An LLM could add actual value to these sections, specifically in the anomaly narrative.  The ability to tell a client “here’s what happened this month and this is why it matters" or "server X has averaged 4% CPU for 30 days, you are paying for capacity you are not using."  Providing server-specific information and cost optimization recommendations is a heavy lift at scale.   Maybe I do need AI….&lt;/p&gt;
&lt;h2&gt;
  
  
  The wrong answer is always the most tempting
&lt;/h2&gt;

&lt;p&gt;My first instinct was to take the raw Prometheus metrics and Loki logs and just feed them straight into an LLM prompt, and ask it to summarize its findings, summarize the trends, and flag any anomalies.&lt;/p&gt;

&lt;p&gt;The simplicity of that idea raised a red flag, and the reasons became obvious when I thought through what the model actually receives.&lt;/p&gt;

&lt;p&gt;Raw Prometheus output is a time series.  Thousands of data points, repeated metric names, label sets, timestamps in Unix epoch format.  An LLM does not have built-in statistical reasoning about time series data and reads data as a flat list of numbers, producing summaries that bury the signal in noise and arrive at conclusions that sound confident but are mathematically hollow.&lt;/p&gt;

&lt;p&gt;The second problem is client data isolation.  Improper implementation with multi-tenant data risks leaking context between tenants in the prompt.  Even with careful prompt engineering, raw metric dumps from multiple clients could potentially leak into one another, polluting the report data.&lt;/p&gt;

&lt;p&gt;Cost and latency at scale posed a problem as well.  With five clients, calling a cloud LLM API per client per month is manageable, but at fifty clients, the compute requirements and API costs scale aggressively.  &lt;/p&gt;
&lt;h2&gt;
  
  
  Preprocess first, always
&lt;/h2&gt;

&lt;p&gt;The correct pattern, and the one I settled on, is to preprocess the metrics into structured summaries before the LLM ever sees them. I didn’t want the LLM to perform data analysis, I wanted it to narrate.  &lt;/p&gt;

&lt;p&gt;This is the approach that I settled on:&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1: Query Prometheus and Loki with purpose
&lt;/h3&gt;

&lt;p&gt;Instead of dumping raw time series, compute the statistics that matter:&lt;br&gt;
    • Average CPU utilization per server over the reporting period &lt;br&gt;
    • Peak CPU, with timestamp, over the same period &lt;br&gt;
    • Memory trend (growing, stable, shrinking) &lt;br&gt;
    • Disk utilization and projected time to threshold at current growth rate &lt;br&gt;
    • Alert counts by severity &lt;br&gt;
    • Error log counts and top recurring patterns from Loki &lt;br&gt;
The Python script already does most of this to build the findings.json object. The change for me here was that instead of rendering that JSON directly into a PDF template, the system would need to also pass a structured summary of it to the LLM.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 2: Build a structured prompt, not a data dump
&lt;/h3&gt;

&lt;p&gt;The input to the LLM looks something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Server: web-01
Reporting period: April 2026

CPU: Average 68%, peak 94% on April 14 at 02:17 UTC
Memory: Average 71%, stable trend
Disk: 61% used, growing approximately 2% per month at current rate
Alerts fired: 3 (2 high CPU, 1 disk warning)
Error logs: 847 total, top pattern: "connection timeout to db-01" (312 occurrences)

Task: Write a 2-3 sentence plain-English summary of this server's behavior
during the reporting period. Note anything that warrants client attention.
Do not use technical jargon.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By setting up the prompt this way, I could lean into a job an LLM could perform at a high level.  The preprocessing pipeline handles the statistical analysis before the LLM ever sees the data.  The model’s job is reduced to converting structured findings into readable prose, which dramatically lowers the chance of hallucination or incorrect conclusions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Isolate per tenant, per server
&lt;/h3&gt;

&lt;p&gt;To eliminate the possibility of tenant data mixing, each LLM call covers one server for one tenant.  The prompt contains only the preprocessed summary for that server.&lt;/p&gt;

&lt;h2&gt;
  
  
  The privacy angle, and why it matters for SMB clients
&lt;/h2&gt;

&lt;p&gt;The LLM runs locally on my LAN so client telemetry never leaves my infrastructure.&lt;br&gt;
That decision was partly cost-driven, but mostly about data boundaries.  Monitoring systems already require a significant amount of operational trust.  Sending client metrics and logs to an external AI provider adds an additional layer of exposure that I was uncomfortable with.&lt;br&gt;
Being able to say that the AI analysis of their logs runs on hardware I own and control, never outsourced, is a meaningful trust signal.  The data never leaves the monitoring environment.  &lt;/p&gt;

&lt;h2&gt;
  
  
  Error handling
&lt;/h2&gt;

&lt;p&gt;This piece of the architecture took a little thought.  Ultimately the LLM is an optional enrichment layer, not a report dependency. If local inference is unavailable for whatever reason, the report still ships.&lt;br&gt;
The flow looks like this: &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbmy9510fm04rq7fgt5yn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbmy9510fm04rq7fgt5yn.png" alt="Flowchart showing a fault-tolerant reporting pipeline for a self-hosted observability platform. Scheduled report generation preprocesses metrics and writes structured findings to JSON before calling a local Ollama-based LLM over Tailscale. Successful responses are inserted into the final report as narrative summaries. If the LLM is unavailable due to timeout or connection failure, the system logs an internal alert, skips the narrative section, renders a static PDF, and delivers the report on schedule. A retry process runs the next morning, optionally sending a supplemental narrative-only email if inference later succeeds."&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The LLM is an enrichment layer: static reports ship immediately on failure, with AI narratives following as a supplement only if local inference recovers.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This way, the client always gets a report.  If the LLM is unavailable, the narrative section is absent.  If the LLM is down temporarily, the narrative eventually reaches the client without re-sending the full report, and static report generation is never blocked by or reliant on LLM availability.&lt;/p&gt;

&lt;h2&gt;
  
  
  The LLM is not the analyst
&lt;/h2&gt;

&lt;p&gt;If you are building something like this and starting fresh, the one architectural principle worth internalizing early is this: the LLM is a narrator, not an analyst. Do the analysis yourself in code and hand the result off to the LLM. Give the model clean, structured summaries and a well-defined writing task. The results are dramatically better than dumping raw data into a prompt and hoping for insight.&lt;/p&gt;

&lt;p&gt;Secondly, as with everything, design for failure from the beginning.  The pipeline should degrade gracefully when the inference endpoint is down, slow, or returning unusable data. Delivering a report without the narrative section is better than no report at all.&lt;/p&gt;

&lt;p&gt;So, do I &lt;em&gt;need&lt;/em&gt; AI in my monitoring stack?&lt;br&gt;
The honest answer?  I’m still not entirely sure.&lt;/p&gt;

&lt;p&gt;This experiment has made me think differently about LLM integration.  I no longer see the model as the system performing the analysis. The deterministic systems still do the reasoning. Prometheus, Loki, and the preprocessing pipeline establish the facts. The LLM’s job is to translate structured findings into readable context.&lt;/p&gt;

&lt;p&gt;That distinction ended up mattering far more than the model itself.&lt;/p&gt;

&lt;p&gt;If you are building something similar, my biggest takeaway is this:&lt;br&gt;
Let the LLM be the narrator, not the creator.  Keep the reasoning in your deterministic systems, and prompt the model to explain the result, not discover it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Multi-tenant observability on two servers: architecture tradeoffs and isolation challenges</title>
      <dc:creator>Justyn Larry</dc:creator>
      <pubDate>Wed, 29 Apr 2026 15:00:00 +0000</pubDate>
      <link>https://dev.to/irinobservability/multi-tenant-observability-on-two-servers-architecture-tradeoffs-and-isolation-challenges-ome</link>
      <guid>https://dev.to/irinobservability/multi-tenant-observability-on-two-servers-architecture-tradeoffs-and-isolation-challenges-ome</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnwskh1m7iz67q75gmzxo.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnwskh1m7iz67q75gmzxo.jpg" alt=" " width="712" height="1421"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;About six months ago I was managing infrastructure across several environments and ran into a consistent limitation: there wasn’t a clean way to provide per-environment observability with real isolation without duplicating the entire monitoring stack. Dashboard variables solved for presentation, not security, and any admin could still access everything. Spinning up separate Prometheus instances fixed isolation, but at the cost of operational overhead and fragmentation. Neither approach scaled cleanly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The stack&lt;/strong&gt;&lt;br&gt;
The core is standard: Prometheus for metrics, Loki for logs, Grafana for visualization, Alertmanager for routing, Blackbox for website endpoints, and Grafana Alloy as the agent on client hosts.  Everything runs in Docker Compose on two Lenovo ThinkCentre M75s, I have one primary server, and one warm standby server.  MinIO provides S3-compatible object storage for Loki chunks, while PostgreSQL backs the portal and streams to the replica.  Nginx and Cloudflare tunnels handle ingress.&lt;br&gt;
Nothing exotic. The interesting decisions are in how the pieces fit together, not which pieces were chosen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The architecture decision that defined everything&lt;/strong&gt;&lt;br&gt;
Early on I had to choose how to handle high availability at the data layer. The obvious approach is server-side replication, by running Prometheus remote_write from the primary to the replica, so the replica stays current. I tried it. Then I removed it.&lt;br&gt;
The problem with server-side replication is that it creates a dependency between the two servers. If the primary is the bottleneck, the replica suffers. If the remote_write endpoint is misconfigured, you get silent data loss with no indication anything went wrong. And when you eventually need to promote the replica, you're never quite sure how much data it really has.&lt;br&gt;
The approach I landed on is client-side dual-push.  Each client's Alloy agent pushes metrics and logs to both of our servers simultaneously through two separate Cloudflare tunnels without creating any substantial overhead for the client’s servers.  The primary and replica servers have no knowledge of each other at the metrics layer.  Each Prometheus instance receives the same data independently.  Each Loki instance receives the same logs independently and stores them each in their own instance of MinIO.&lt;br&gt;
The practical result is that the warm standby isn't warm, it's live.  If the primary goes down, the replica has current data up to the moment of failure.  Failover is a Cloudflare tunnel redirect and a PostgreSQL promotion.  No data replay, no gap in metrics, no complicated reconciliation.&lt;br&gt;
The tradeoff is double the egress from every client host and double the ingestion load on our internal network.  At current scale that's not meaningful.  At a few hundred tenants it becomes a real consideration.  We’re currently in the process of planning how to manage that future problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three-layer tenant isolation&lt;/strong&gt;&lt;br&gt;
The isolation model runs at three independent layers, and the independence is intentional. Any single layer failing shouldn't compromise the others.&lt;br&gt;
The first layer is Prometheus labels.  Every metric series that arrives at the ingestion endpoint carries a tenant label injected by Alloy before the push.  Prometheus doesn't trust the client to label correctly so Alloy handles it, and the label is set in the config file generated server-side at registration time. A client cannot mislabel their own series, even if they try.&lt;br&gt;
The second layer is separate Grafana organizations.  Each tenant gets their own org.  Users in that org can only see dashboards scoped to their org.  The data sources in each org have a preset label filter applied, so even if someone found a way to query directly, they'd only see their own tenant's data.&lt;br&gt;
The third layer is per-tenant Cloudflare Access service tokens.  Each tenant authenticates their Alloy push through a unique token.  Revoke the token and that tenant's agents stop pushing immediately.  There’s no Prometheus config change, no restart, no waiting for a scrape interval.  It's the fastest lever in the decommissioning flow.&lt;br&gt;
A compromised token exposes one tenant's data only, not any other tenant’s.  The next improvement in the roadmap is moving from per-tenant tokens to per-server tokens.  By doing so, a compromised token would then expose one machine rather than one organization. That's a Phase 2 item.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Design Evolution&lt;/strong&gt;&lt;br&gt;
The first iteration of this project ran node_exporter and promtail on each server, which worked great on a local network, but as a production model it fell short.  Asking a client to expose multiple ports and poke holes in their firewalls felt like an unnecessary security risk, and one of our core beliefs is that we should require as little as possible from the clients, and be as unobtrusive as possible in the client’s infrastructure.  Our clients should not have to worry about anything we install on their system, and we should not ask them to change anything about their infrastructure to accommodate us.  Keeping all of this in mind, we rebuilt the entire stack from scratch using Grafana Alloy as the remote agent using an encrypted Cloudflare tunnel to connect to our servers.&lt;br&gt;&lt;br&gt;
This innocent initial design flaw made me instantly begin to think about the bigger picture in all the design decisions.  The focus on build decisions shifted to forward-thinking and ensuring that all decisions involving the build as production ready as feasible, without going down the rabbit-hole of continuous innovation at the expense of production readiness.  This also served to crystallize the idea that we should take an in-depth look at all the software options available and ensure that any options we choose best serve the end users.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I got wrong&lt;/strong&gt;&lt;br&gt;
Three things worth being honest about.&lt;br&gt;
The first problem I came across was documentation drift.  I documented a decision to remove client-side dual-push in the architecture log after briefly experimenting with server-side replication.  The dual-push was never actually removed from the client configs.  I discovered this weeks later when reviewing the Alloy config on a client host.  The lesson: verify the running system, not the documentation.&lt;br&gt;
Then came data volume and proper backup protocols.  The entire stack is backed up in triplicate, but when I first set up the PBS backup script, I was capturing compose files, configs, and scripts, but not the actual data volume where Prometheus, Loki, Grafana, and PostgreSQL store their data.  The entire data layer was unprotected.  I found this during a backup verification exercise and fixed it immediately, but it's the kind of gap that only shows up when you look carefully.&lt;br&gt;
The third was an mTLS legacy issue in Grafana datasource configuration.  After a Grafana admin account recovery, the datasources had stale TLS settings from an old PKI infrastructure that no longer existed.  Grafana reported healthy but queries were silently misconfigured.  The fix was straightforward once found; the problem was that nothing surfaced it automatically.  I now run a data source health check after any Grafana restart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it stands&lt;/strong&gt;&lt;br&gt;
The platform is running, the architecture is validated, and I'm looking for a small number of beta testers willing to run it on real infrastructure and tell me honestly what's missing.  The free tier covers three servers with no credit card required, but for beta-testing I’m flexible.  The bootstrap script installs Alloy, registers the server against the API, and exits.  By doing this, there’s no ongoing shell access, no cron jobs, no modifications outside the Alloy install path. &lt;br&gt;
If you're running infrastructure without good visibility into it, or if you've looked at pricing from bigger companies and decided it doesn't fit, I'd like to hear about it.  The free tier covers three servers, no credit card required. Full script at &lt;a href="https://monitor.irinobservability.com/bootstrap.sh" rel="noopener noreferrer"&gt;https://monitor.irinobservability.com/bootstrap.sh&lt;/a&gt; if you want to read it before running anything.&lt;br&gt;
&lt;a href="https://irinobservability.com/signup" rel="noopener noreferrer"&gt;https://irinobservability.com/signup&lt;/a&gt;&lt;/p&gt;

</description>
      <category>grafana</category>
      <category>devops</category>
      <category>selfhosted</category>
      <category>prometheus</category>
    </item>
  </channel>
</rss>
