<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Justyn Larry</title>
    <description>The latest articles on DEV Community by Justyn Larry (@justyn_larry_e12a0d9779f4).</description>
    <link>https://dev.to/justyn_larry_e12a0d9779f4</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2596002%2Ffdfef0d8-b625-4804-adf5-f3fab1c34777.jpg</url>
      <title>DEV Community: Justyn Larry</title>
      <link>https://dev.to/justyn_larry_e12a0d9779f4</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/justyn_larry_e12a0d9779f4"/>
    <language>en</language>
    <item>
      <title>Metrics Tell You Something Broke. Tracing Tells You What, Where, and Why.</title>
      <dc:creator>Justyn Larry</dc:creator>
      <pubDate>Thu, 04 Jun 2026 15:00:00 +0000</pubDate>
      <link>https://dev.to/irinobservability/metrics-tell-you-something-broke-tracing-tells-you-what-where-and-why-3j6b</link>
      <guid>https://dev.to/irinobservability/metrics-tell-you-something-broke-tracing-tells-you-what-where-and-why-3j6b</guid>
      <description>&lt;p&gt;Complacency is a killer. The monitoring stack that I built works, and it’s reliable, so leaving it alone seems like the most obvious thing to do. Focusing on marketing, documentation, taking time away from it all seem like good options, but there’s always a better way to do something, to solve a problem you didn’t realize you had.&lt;/p&gt;

&lt;p&gt;In my spare time, I look through Reddit and Dev.to for ideas or inspiration. Systems that others are using that I’m not, or that I’m not aware of. Distributed traces jumped out at me from both forums — I can tie a system event to the metrics, instead of stumbling around logs? This is a monitoring goldmine. How had I missed this?&lt;/p&gt;

&lt;h2&gt;
  
  
  WHAT EXACTLY IS DISTRIBUTED TRACING?
&lt;/h2&gt;

&lt;p&gt;For any kind of multi-step processes running on your system, distributed tracing provides a timeline of exactly what happened, and how long each step took. It’s like getting a receipt for the work showing you where time and resources were spent. Each request or job gets a trace ID, and every step records a span — a named block with a start time, end time, and any attributes you want to attach. Those spans assemble into a waterfall, and you can see at a glance where time was spent, what succeeded, and what failed.&lt;/p&gt;

&lt;p&gt;This added visibility can take a technical team from “this seems slow” to a detailed accounting of how long a process took and what the system was actually doing when the process was lagging.&lt;/p&gt;

&lt;h2&gt;
  
  
  THE ORIGINAL CORE STACK
&lt;/h2&gt;

&lt;p&gt;Irin Observability runs on Prometheus, Grafana, Loki, Grafana Alloy, and Alertmanager. I’ve built a robust monitoring stack that tracks metrics for request rates, error rates, LLM call counts, and report generation status. There are also logs flowing from all the services through Loki, so overall, I believed that the stack was well-instrumented and very readable.&lt;/p&gt;

&lt;p&gt;The alert system that I built runs through five internal services to process each alert through an alert annotator and to generate a monthly report in sequence:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An alert comes in from a client’s infrastructure&lt;/li&gt;
&lt;li&gt;The alert annotator calls a local LLM to add a plain-English explanation for a panel on one of the dashboards&lt;/li&gt;
&lt;li&gt;The annotated result gets pushed back into Loki&lt;/li&gt;
&lt;li&gt;At the end of the month, the aggregation script gathers all findings for report generation&lt;/li&gt;
&lt;li&gt;The LLM narrative layer writes a summary&lt;/li&gt;
&lt;li&gt;The report generator assembles everything into a PDF and sends it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of those steps runs in a different process. Some run as Docker containers, some as host Python scripts. When auditing the reports and something didn’t look right, I had to check the logs on the Loki Log Exporter Dashboard or grep logs across multiple services, correlate timestamps manually, and piece together what happened. This was both frustrating and time-consuming. The platform should be telling me what the problem is in addition to telling me that something is wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  THE SOLUTION: OPENTELEMETRY
&lt;/h2&gt;

&lt;p&gt;OpenTelemetry (OTel) is an open source standard for collecting telemetry data — traces, metrics, and logs — from applications. It’s vendor-neutral, well-maintained, and has solid Python libraries.&lt;/p&gt;

&lt;p&gt;Grafana Tempo is an open source backend for storing and querying traces. It integrates directly with Grafana, so once it’s running you can navigate from a log line to a trace, or from a trace to the logs that were happening at the same time.&lt;/p&gt;

&lt;p&gt;Getting this running involved three parts. First, I deployed Tempo as a Docker Compose service, with a config file and a Grafana datasource. The second step was to wire up Grafana Alloy as the collector. Since Alloy is the agent already running on my servers to ship metrics and logs, I was able to add an OTLP receiver block to accept traces from internal services and forward them to Tempo — one config change, and the heartbeat API distributed the updated config files to all the monitored servers. The final step was to instrument the Python services. This is where things got a little more difficult, but it also taught me some valuable lessons.&lt;/p&gt;

&lt;h2&gt;
  
  
  THE PYTHON IMPLEMENTATION
&lt;/h2&gt;

&lt;p&gt;The OTel Python SDK has two modes. The first is auto-instrumentation, which handles the common cases automatically. If you’re running a Flask or FastAPI app, importing two libraries and calling .instrument() captures every HTTP request with no further changes. If you’re using psycopg2 for Postgres queries, one more library call and every query becomes a span.&lt;/p&gt;

&lt;p&gt;The second, manual spans, are for the logic your code owns — units of work that typical instrumentation frameworks can’t see automatically. I used these to capture the LLM call itself (duration, prompt size, whether the response parsed cleanly), each section of the aggregation script so I can see which Prometheus query is slow, and the overall per-tenant run so every trace carries a tenant name.&lt;/p&gt;

&lt;h2&gt;
  
  
  LESSONS LEARNED
&lt;/h2&gt;

&lt;p&gt;Short-lived scripts need an explicit flush.&lt;/p&gt;

&lt;p&gt;The aggregation script and report generator run once and exit. The default OTel exporter batches spans and sends them on a timer. If the process exits before the batch fires, you lose all your spans. I fixed it by adding two lines: force_flush() and shutdown() in a try/finally block before exit. I lost my first few test traces before I figured this out.&lt;br&gt;
The psycopg2-binary package breaks auto-instrumentation silently.&lt;/p&gt;

&lt;p&gt;The OTel instrumentation library checks for a package literally named psycopg2. If you installed psycopg2-binary — the same library, different distribution name — the check fails and you receive no database spans, no error message, nothing reported. The fix is one parameter: Psycopg2Instrumentor().instrument(skip_dep_check=True).&lt;br&gt;
Background tasks break parent-child trace linkage.&lt;/p&gt;

&lt;p&gt;My alert annotator returns a 200 response immediately and processes the alert in a background thread. The HTTP span closes when the response is sent, but before the real work begins, which means each alert generates two separate traces — a brief HTTP span and an orphaned processing span. The model behavior was correct, not a bug, but it looked confusing until I understood the threading model. I accepted it and correlate the two traces by alert fingerprint when necessary.&lt;/p&gt;
&lt;h2&gt;
  
  
  THE BIG DIFFERENCE
&lt;/h2&gt;

&lt;p&gt;This is where things get interesting, and how the original monitoring stack differs from its current iteration.&lt;/p&gt;

&lt;p&gt;Prior to integrating distributed tracing, I knew that the report pipeline ran. That’s it — pass/fail, true/false. If something went wrong, where did it happen, and why? What was the system state at the time of the failure? Now I can open a trace in Grafana Tempo and see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;report.generate: total duration 4m 12s
  db.get_contacts: 41ms
  aggregation.run (per tenant): 2m 18s
    aggregation.stability: 39ms
    aggregation.resources: 1.2s  (slow Prometheus query range)
    aggregation.alerts: 88ms
  llm.narrative_generation: 1m 44s
    llm.build_prompt: 12ms
    llm.call attempt 1: 119s  (timeout)
    llm.call attempt 2: 44s   (success)
    llm.parse: 3ms
  report.build_pdf: 8s
  report.send_email: 2s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That waterfall tells me that the Ollama model timed out on the first attempt and succeeded on the second. I don’t have to go digging through logs in an approximate time frame to figure out what happened. The Prometheus query for resource metrics was the slow step in aggregation. PDF build and email delivery were fast. The problem isn’t solved, but I know exactly what the problem is.&lt;/p&gt;

&lt;p&gt;Through the alert annotator, I can now see every alert as a trace. The system shows me the dedup check against Loki, the LLM call, the result push. I can filter by tenant, by alert name, by whether the LLM call succeeded. A 55-second LLM call that I used to see only as a latency spike in a Prometheus histogram is now a named span with the prompt size, the response size, and whether the JSON parsed cleanly.&lt;/p&gt;

&lt;h2&gt;
  
  
  THE IMPLICATIONS
&lt;/h2&gt;

&lt;p&gt;If you have any experience with monitoring, you have almost certainly hit the “something seems wrong but I can’t tell what” problem. The logs are probably available, you can see the metrics, but you’re stuck sifting through them in sequence trying to reconstruct what happened.&lt;/p&gt;

&lt;p&gt;Distributed tracing changes the diagnostic workflow from “search for clues” to “read the receipt.” The trace tells you what happened, in order, with timing, which virtually eliminates investigation time and lets you go directly to the problem at hand.&lt;/p&gt;

&lt;p&gt;It also changes how you think about reliability. When I see the LLM call timing out on first attempt consistently, I know to tune the timeout or check model load before it impacts the client. Being proactive in monitoring is a moving target, but it is still the goal.&lt;/p&gt;

&lt;h2&gt;
  
  
  THE TOOLCHAIN
&lt;/h2&gt;

&lt;p&gt;Everything I used is open source and self-hostable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenTelemetry Python SDK (opentelemetry-sdk, exporter packages, auto-instrumentation libraries)&lt;/li&gt;
&lt;li&gt;Grafana Tempo for trace storage and querying&lt;/li&gt;
&lt;li&gt;Grafana Alloy as the collector and forwarder&lt;/li&gt;
&lt;li&gt;Grafana for visualization, with native Tempo datasource support and log/trace correlation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re already running Prometheus and Grafana for metrics, adding Tempo for traces is a natural extension of the same stack. You can use the same agent, dashboards, and query interface. You’re adding one more signal type, but no new tooling paradigm.&lt;/p&gt;

&lt;p&gt;The monitoring stack I run for Irin clients is the same stack I use to observe both Irin and my private infrastructure. It’s what lets me catch instrumentation gotchas and gives me a reliable view of all of my systems. I built Irin because I believe that monitoring your system shouldn’t be a full-time job. If the monitoring stack does what it’s supposed to, you should be able to check it intermittently through the day. It should tell you at a glance if something’s wrong, and send an alert if the problem merits it. If it’s noisy, crowded, and you don’t know where to begin when there’s a problem, the system doesn’t work — and the real problems get drowned out.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>distributedsystems</category>
      <category>monitoring</category>
      <category>sre</category>
    </item>
    <item>
      <title>When you bring your data home, who is going to keep an eye on it?</title>
      <dc:creator>Justyn Larry</dc:creator>
      <pubDate>Wed, 27 May 2026 16:27:58 +0000</pubDate>
      <link>https://dev.to/irinobservability/when-you-bring-your-data-home-who-is-going-to-keep-an-eye-on-it-gap</link>
      <guid>https://dev.to/irinobservability/when-you-bring-your-data-home-who-is-going-to-keep-an-eye-on-it-gap</guid>
      <description>&lt;p&gt;Cloud providers have always sold convenience.  Compute on demand, storage that scales, and somewhere in the fine print, the implied promise that someone else is watching the infrastructure.  For a lot of teams, that last item was the most valuable thing they were paying for, whether they knew it or not.&lt;br&gt;
That arrangement is starting to come apart.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;p&gt;Cloudian's 2026 research report surveyed 212 senior IT decision makers and found that 75% had moved workloads from the cloud back to on-premises infrastructure in the prior 24 months.  That is not a rounding error or a niche trend.  Three out of four senior IT professionals at organizations large enough to have senior IT professionals made a deliberate choice to bring their data and compute closer to home.&lt;br&gt;
The reasons are not surprising. Security and compliance pressure is one driver, and the growth of AI workloads is another.  Michael Gale, CMO at EDB, put it plainly in a recent IT Brew piece, “If you want to use AI and data, you’ve got to be secure and compliant, they’ve got to be next to each other.”  Sending proprietary data to a third-party cloud provider to feed a general-purpose model is increasingly hard to justify when purpose-built, containerized, on-premises alternatives exist.&lt;br&gt;
Egress fees are the third driver, and arguably the most compelling one. Cloud providers charge you to store data, and then they charge you to process it. And when you eventually decide you want it back, they charge you for that as well. Andy Stone, CTO for the Americas at Everpure, described it clearly: “They’re saying, as long as your data lives here, we’re cool; you want to take your data out, we’re going to charge you on the back end.  In your data center, you don’t have that, you’re not going to pay an egress charge. It’s a benefit you derive, but the move itself takes time, a lot of planning and effort, and it’s certainly not easy in most cases.”  In addition to charging for usage, companies are now paying not only to get their data back, but now the onus of monitoring and the associated costs are transferred back to the company as well.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Moves With the Data
&lt;/h2&gt;

&lt;p&gt;The part of this conversation that does not get enough attention is what teams lose when they leave the cloud, beyond the convenience of managed services.&lt;br&gt;
AWS CloudWatch, Azure Monitor, Google Cloud Operations.  These tools exist because cloud providers understand that customers need to be able to see their infrastructure to troubleshoot it, and customers who cannot troubleshoot it generate support tickets.  Visibility was bundled into the cost of cloud compute because the cloud needed it to function at scale.  Informed customers generate fewer support tickets, so monitoring in a cloud environment became an amenity, when in reality it lowers their support costs.&lt;br&gt;
When a company repatriates its workloads, that visibility disappears.   Now that the servers and the data are in house, so is the burden of monitoring the system.  In the IT Brew Stone notes that repatriation requires a lot of architecting and planning, including managing the applications consuming and producing data.  That’s accurate, and monitoring sits at the center of it. It’s hard to manage what you can’t see, and managing infrastructure on-premises creates a monitoring gap that needs to be filled, either internally or externally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Unforeseen Migration Gaps
&lt;/h2&gt;

&lt;p&gt;The teams making this move are not all large enterprises with dedicated platform engineering staff.  It is reasonable to assume that some portion of that 75% are organizations with lean technical teams making a deliberate architectural choice to prioritize control.  They have the skills to manage their own hardware, they’ve made the cost calculation and decided it made sense. What they frequently do not have is the time or the desire to build and maintain a production-grade observability stack on top of everything else that’s migrating from the cloud.&lt;br&gt;
This is where the repatriation trend creates a genuinely new problem rather than just a different version of an old one.  The cloud abstracted away the operational burden of monitoring. On-premises infrastructure exposes it directly.  Companies need to be made aware that a disk is filling up before it causes an outage, alert routing needs to reach someone when a service goes down in the middle of the night, and log retention should go back far enough to reconstruct the events that occurred during an incident.&lt;br&gt;
Building a monitoring stack is not the hard part, most teams can easily deploy the tooling. The open source tooling available for collecting telemetry is genuinely excellent.  The real problem created by building an in-house monitoring system is the burden of ongoing operational overhead and figuring out which team members will own the maintenance.  It’s an ongoing process that requires dedicated personnel to configure the tools, tune them, keep them running, and revisit the alert thresholds as the infrastructure changes.  After dealing with planning and executing data repatriation for several months, they’re now faced with creating and maintaining monitoring for their infrastructure, and allocating resources they may not have to that endeavor.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Path Ahead
&lt;/h2&gt;

&lt;p&gt;The repatriation trend is not likely to lose momentum in any meaningful way.  The AI data sovereignty argument is too strong, the cost of cloud computing is too high, and security is becoming a bigger issue.  If anything, the next wave of AI agent deployments will accelerate it.  Gale's estimate of up to 300 million agents operating in US enterprises is speculative but directionally correct.  Agents need data, that data needs to be governed, and governance is substantially easier when you control the physical location of the data.&lt;br&gt;
As companies continue to pull their data in-house, a large and growing number of technical teams will find themselves responsible for infrastructure that requires monitoring, with limited time and resources to build and maintain it.  Cloud-provided tools and infrastructure demonstrated the need for good visibility, and altering the deployment model should not mean changing how teams monitor their systems.&lt;br&gt;
Companies that navigate this well will be the ones who treat observability as a priority from the start of the repatriation process, not something to revisit once the migration is complete.  &lt;/p&gt;

</description>
      <category>linux</category>
      <category>infrastructure</category>
      <category>cloud</category>
      <category>devops</category>
    </item>
    <item>
      <title>Adding an LLM Narration Layer to a Self-Hosted Observability Stack</title>
      <dc:creator>Justyn Larry</dc:creator>
      <pubDate>Tue, 12 May 2026 17:21:59 +0000</pubDate>
      <link>https://dev.to/irinobservability/adding-an-llm-narration-layer-to-a-self-hosted-observability-stack-p35</link>
      <guid>https://dev.to/irinobservability/adding-an-llm-narration-layer-to-a-self-hosted-observability-stack-p35</guid>
      <description>&lt;p&gt;I almost made the classic AI architecture mistake.&lt;/p&gt;

&lt;p&gt;I could easily just dump raw Prometheus metrics and Loki logs into an LLM and ask it to summarize anomalies and trends.  What could possibly go wrong?  The more I thought about it, the more obvious it became that I needed more guardrails and smarter preprocessing, not just more AI.&lt;br&gt;
Right now, it feels like every company is trying to answer the same question:&lt;br&gt;
“How can we add AI to this?”&lt;/p&gt;

&lt;p&gt;The more important question is whether AI belongs there at all, and if it does, how to implement it responsibly.&lt;/p&gt;

&lt;p&gt;Over the last year, I built a self-hosted observability platform running Prometheus, Grafana, Loki, Alertmanager, and Grafana Alloy on bare metal infrastructure. Clients sign up through a web portal, run a bootstrap script hosted by an internal API, and receive dashboards, alerts, and monthly PDF health reports delivered by email.&lt;/p&gt;

&lt;p&gt;The reporting system is where introducing an LLM actually started to make sense.&lt;br&gt;
The reports already contained the raw information:&lt;br&gt;
    • CPU, memory, and disk trends&lt;br&gt;
    • uptime summaries&lt;br&gt;
    • alert history&lt;br&gt;
    • cost optimization findings&lt;br&gt;
But raw information is not the same thing as insight.&lt;br&gt;
If a client is already looking at Grafana dashboards, they already have access to the data. What they actually need is context:&lt;br&gt;
    • what changed,&lt;br&gt;
    • what matters,&lt;br&gt;
    • what should concern them,&lt;br&gt;
    • and what can probably be ignored.&lt;br&gt;
That sent me down a path I spent the better part of a week wrestling with:&lt;/p&gt;

&lt;p&gt;Do I actually need AI in this stack?&lt;/p&gt;
&lt;h2&gt;
  
  
  What the report system looks like right now
&lt;/h2&gt;

&lt;p&gt;Each client gets a monthly PDF that covers:&lt;br&gt;
    • CPU, memory, and disk trends per server &lt;br&gt;
    • Alert history and incident counts &lt;br&gt;
    • Uptime summary &lt;br&gt;
    • A cost optimization section (flagging underutilized servers) &lt;/p&gt;

&lt;p&gt;The report is generated by a Python script that queries Prometheus and Loki, builds a structured JSON findings object, pulls panel screenshots from Grafana Image Renderer, and assembles everything into a PDF via ReportLab.  It goes out through Resend on a cron schedule.&lt;/p&gt;

&lt;p&gt;Currently, the sections that require judgment are static templated text or stubbed as null.  An LLM could add actual value to these sections, specifically in the anomaly narrative.  The ability to tell a client “here’s what happened this month and this is why it matters" or "server X has averaged 4% CPU for 30 days, you are paying for capacity you are not using."  Providing server-specific information and cost optimization recommendations is a heavy lift at scale.   Maybe I do need AI….&lt;/p&gt;
&lt;h2&gt;
  
  
  The wrong answer is always the most tempting
&lt;/h2&gt;

&lt;p&gt;My first instinct was to take the raw Prometheus metrics and Loki logs and just feed them straight into an LLM prompt, and ask it to summarize its findings, summarize the trends, and flag any anomalies.&lt;/p&gt;

&lt;p&gt;The simplicity of that idea raised a red flag, and the reasons became obvious when I thought through what the model actually receives.&lt;/p&gt;

&lt;p&gt;Raw Prometheus output is a time series.  Thousands of data points, repeated metric names, label sets, timestamps in Unix epoch format.  An LLM does not have built-in statistical reasoning about time series data and reads data as a flat list of numbers, producing summaries that bury the signal in noise and arrive at conclusions that sound confident but are mathematically hollow.&lt;/p&gt;

&lt;p&gt;The second problem is client data isolation.  Improper implementation with multi-tenant data risks leaking context between tenants in the prompt.  Even with careful prompt engineering, raw metric dumps from multiple clients could potentially leak into one another, polluting the report data.&lt;/p&gt;

&lt;p&gt;Cost and latency at scale posed a problem as well.  With five clients, calling a cloud LLM API per client per month is manageable, but at fifty clients, the compute requirements and API costs scale aggressively.  &lt;/p&gt;
&lt;h2&gt;
  
  
  Preprocess first, always
&lt;/h2&gt;

&lt;p&gt;The correct pattern, and the one I settled on, is to preprocess the metrics into structured summaries before the LLM ever sees them. I didn’t want the LLM to perform data analysis, I wanted it to narrate.  &lt;/p&gt;

&lt;p&gt;This is the approach that I settled on:&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1: Query Prometheus and Loki with purpose
&lt;/h3&gt;

&lt;p&gt;Instead of dumping raw time series, compute the statistics that matter:&lt;br&gt;
    • Average CPU utilization per server over the reporting period &lt;br&gt;
    • Peak CPU, with timestamp, over the same period &lt;br&gt;
    • Memory trend (growing, stable, shrinking) &lt;br&gt;
    • Disk utilization and projected time to threshold at current growth rate &lt;br&gt;
    • Alert counts by severity &lt;br&gt;
    • Error log counts and top recurring patterns from Loki &lt;br&gt;
The Python script already does most of this to build the findings.json object. The change for me here was that instead of rendering that JSON directly into a PDF template, the system would need to also pass a structured summary of it to the LLM.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 2: Build a structured prompt, not a data dump
&lt;/h3&gt;

&lt;p&gt;The input to the LLM looks something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Server: web-01
Reporting period: April 2026

CPU: Average 68%, peak 94% on April 14 at 02:17 UTC
Memory: Average 71%, stable trend
Disk: 61% used, growing approximately 2% per month at current rate
Alerts fired: 3 (2 high CPU, 1 disk warning)
Error logs: 847 total, top pattern: "connection timeout to db-01" (312 occurrences)

Task: Write a 2-3 sentence plain-English summary of this server's behavior
during the reporting period. Note anything that warrants client attention.
Do not use technical jargon.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By setting up the prompt this way, I could lean into a job an LLM could perform at a high level.  The preprocessing pipeline handles the statistical analysis before the LLM ever sees the data.  The model’s job is reduced to converting structured findings into readable prose, which dramatically lowers the chance of hallucination or incorrect conclusions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Isolate per tenant, per server
&lt;/h3&gt;

&lt;p&gt;To eliminate the possibility of tenant data mixing, each LLM call covers one server for one tenant.  The prompt contains only the preprocessed summary for that server.&lt;/p&gt;

&lt;h2&gt;
  
  
  The privacy angle, and why it matters for SMB clients
&lt;/h2&gt;

&lt;p&gt;The LLM runs locally on my LAN so client telemetry never leaves my infrastructure.&lt;br&gt;
That decision was partly cost-driven, but mostly about data boundaries.  Monitoring systems already require a significant amount of operational trust.  Sending client metrics and logs to an external AI provider adds an additional layer of exposure that I was uncomfortable with.&lt;br&gt;
Being able to say that the AI analysis of their logs runs on hardware I own and control, never outsourced, is a meaningful trust signal.  The data never leaves the monitoring environment.  &lt;/p&gt;

&lt;h2&gt;
  
  
  Error handling
&lt;/h2&gt;

&lt;p&gt;This piece of the architecture took a little thought.  Ultimately the LLM is an optional enrichment layer, not a report dependency. If local inference is unavailable for whatever reason, the report still ships.&lt;br&gt;
The flow looks like this: &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbmy9510fm04rq7fgt5yn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbmy9510fm04rq7fgt5yn.png" alt="Flowchart showing a fault-tolerant reporting pipeline for a self-hosted observability platform. Scheduled report generation preprocesses metrics and writes structured findings to JSON before calling a local Ollama-based LLM over Tailscale. Successful responses are inserted into the final report as narrative summaries. If the LLM is unavailable due to timeout or connection failure, the system logs an internal alert, skips the narrative section, renders a static PDF, and delivers the report on schedule. A retry process runs the next morning, optionally sending a supplemental narrative-only email if inference later succeeds."&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The LLM is an enrichment layer: static reports ship immediately on failure, with AI narratives following as a supplement only if local inference recovers.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This way, the client always gets a report.  If the LLM is unavailable, the narrative section is absent.  If the LLM is down temporarily, the narrative eventually reaches the client without re-sending the full report, and static report generation is never blocked by or reliant on LLM availability.&lt;/p&gt;

&lt;h2&gt;
  
  
  The LLM is not the analyst
&lt;/h2&gt;

&lt;p&gt;If you are building something like this and starting fresh, the one architectural principle worth internalizing early is this: the LLM is a narrator, not an analyst. Do the analysis yourself in code and hand the result off to the LLM. Give the model clean, structured summaries and a well-defined writing task. The results are dramatically better than dumping raw data into a prompt and hoping for insight.&lt;/p&gt;

&lt;p&gt;Secondly, as with everything, design for failure from the beginning.  The pipeline should degrade gracefully when the inference endpoint is down, slow, or returning unusable data. Delivering a report without the narrative section is better than no report at all.&lt;/p&gt;

&lt;p&gt;So, do I &lt;em&gt;need&lt;/em&gt; AI in my monitoring stack?&lt;br&gt;
The honest answer?  I’m still not entirely sure.&lt;/p&gt;

&lt;p&gt;This experiment has made me think differently about LLM integration.  I no longer see the model as the system performing the analysis. The deterministic systems still do the reasoning. Prometheus, Loki, and the preprocessing pipeline establish the facts. The LLM’s job is to translate structured findings into readable context.&lt;/p&gt;

&lt;p&gt;That distinction ended up mattering far more than the model itself.&lt;/p&gt;

&lt;p&gt;If you are building something similar, my biggest takeaway is this:&lt;br&gt;
Let the LLM be the narrator, not the creator.  Keep the reasoning in your deterministic systems, and prompt the model to explain the result, not discover it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Multi-tenant observability on two servers: architecture tradeoffs and isolation challenges</title>
      <dc:creator>Justyn Larry</dc:creator>
      <pubDate>Wed, 29 Apr 2026 15:00:00 +0000</pubDate>
      <link>https://dev.to/irinobservability/multi-tenant-observability-on-two-servers-architecture-tradeoffs-and-isolation-challenges-ome</link>
      <guid>https://dev.to/irinobservability/multi-tenant-observability-on-two-servers-architecture-tradeoffs-and-isolation-challenges-ome</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnwskh1m7iz67q75gmzxo.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnwskh1m7iz67q75gmzxo.jpg" alt=" " width="712" height="1421"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;About six months ago I was managing infrastructure across several environments and ran into a consistent limitation: there wasn’t a clean way to provide per-environment observability with real isolation without duplicating the entire monitoring stack. Dashboard variables solved for presentation, not security, and any admin could still access everything. Spinning up separate Prometheus instances fixed isolation, but at the cost of operational overhead and fragmentation. Neither approach scaled cleanly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The stack&lt;/strong&gt;&lt;br&gt;
The core is standard: Prometheus for metrics, Loki for logs, Grafana for visualization, Alertmanager for routing, Blackbox for website endpoints, and Grafana Alloy as the agent on client hosts.  Everything runs in Docker Compose on two Lenovo ThinkCentre M75s, I have one primary server, and one warm standby server.  MinIO provides S3-compatible object storage for Loki chunks, while PostgreSQL backs the portal and streams to the replica.  Nginx and Cloudflare tunnels handle ingress.&lt;br&gt;
Nothing exotic. The interesting decisions are in how the pieces fit together, not which pieces were chosen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The architecture decision that defined everything&lt;/strong&gt;&lt;br&gt;
Early on I had to choose how to handle high availability at the data layer. The obvious approach is server-side replication, by running Prometheus remote_write from the primary to the replica, so the replica stays current. I tried it. Then I removed it.&lt;br&gt;
The problem with server-side replication is that it creates a dependency between the two servers. If the primary is the bottleneck, the replica suffers. If the remote_write endpoint is misconfigured, you get silent data loss with no indication anything went wrong. And when you eventually need to promote the replica, you're never quite sure how much data it really has.&lt;br&gt;
The approach I landed on is client-side dual-push.  Each client's Alloy agent pushes metrics and logs to both of our servers simultaneously through two separate Cloudflare tunnels without creating any substantial overhead for the client’s servers.  The primary and replica servers have no knowledge of each other at the metrics layer.  Each Prometheus instance receives the same data independently.  Each Loki instance receives the same logs independently and stores them each in their own instance of MinIO.&lt;br&gt;
The practical result is that the warm standby isn't warm, it's live.  If the primary goes down, the replica has current data up to the moment of failure.  Failover is a Cloudflare tunnel redirect and a PostgreSQL promotion.  No data replay, no gap in metrics, no complicated reconciliation.&lt;br&gt;
The tradeoff is double the egress from every client host and double the ingestion load on our internal network.  At current scale that's not meaningful.  At a few hundred tenants it becomes a real consideration.  We’re currently in the process of planning how to manage that future problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three-layer tenant isolation&lt;/strong&gt;&lt;br&gt;
The isolation model runs at three independent layers, and the independence is intentional. Any single layer failing shouldn't compromise the others.&lt;br&gt;
The first layer is Prometheus labels.  Every metric series that arrives at the ingestion endpoint carries a tenant label injected by Alloy before the push.  Prometheus doesn't trust the client to label correctly so Alloy handles it, and the label is set in the config file generated server-side at registration time. A client cannot mislabel their own series, even if they try.&lt;br&gt;
The second layer is separate Grafana organizations.  Each tenant gets their own org.  Users in that org can only see dashboards scoped to their org.  The data sources in each org have a preset label filter applied, so even if someone found a way to query directly, they'd only see their own tenant's data.&lt;br&gt;
The third layer is per-tenant Cloudflare Access service tokens.  Each tenant authenticates their Alloy push through a unique token.  Revoke the token and that tenant's agents stop pushing immediately.  There’s no Prometheus config change, no restart, no waiting for a scrape interval.  It's the fastest lever in the decommissioning flow.&lt;br&gt;
A compromised token exposes one tenant's data only, not any other tenant’s.  The next improvement in the roadmap is moving from per-tenant tokens to per-server tokens.  By doing so, a compromised token would then expose one machine rather than one organization. That's a Phase 2 item.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Design Evolution&lt;/strong&gt;&lt;br&gt;
The first iteration of this project ran node_exporter and promtail on each server, which worked great on a local network, but as a production model it fell short.  Asking a client to expose multiple ports and poke holes in their firewalls felt like an unnecessary security risk, and one of our core beliefs is that we should require as little as possible from the clients, and be as unobtrusive as possible in the client’s infrastructure.  Our clients should not have to worry about anything we install on their system, and we should not ask them to change anything about their infrastructure to accommodate us.  Keeping all of this in mind, we rebuilt the entire stack from scratch using Grafana Alloy as the remote agent using an encrypted Cloudflare tunnel to connect to our servers.&lt;br&gt;&lt;br&gt;
This innocent initial design flaw made me instantly begin to think about the bigger picture in all the design decisions.  The focus on build decisions shifted to forward-thinking and ensuring that all decisions involving the build as production ready as feasible, without going down the rabbit-hole of continuous innovation at the expense of production readiness.  This also served to crystallize the idea that we should take an in-depth look at all the software options available and ensure that any options we choose best serve the end users.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I got wrong&lt;/strong&gt;&lt;br&gt;
Three things worth being honest about.&lt;br&gt;
The first problem I came across was documentation drift.  I documented a decision to remove client-side dual-push in the architecture log after briefly experimenting with server-side replication.  The dual-push was never actually removed from the client configs.  I discovered this weeks later when reviewing the Alloy config on a client host.  The lesson: verify the running system, not the documentation.&lt;br&gt;
Then came data volume and proper backup protocols.  The entire stack is backed up in triplicate, but when I first set up the PBS backup script, I was capturing compose files, configs, and scripts, but not the actual data volume where Prometheus, Loki, Grafana, and PostgreSQL store their data.  The entire data layer was unprotected.  I found this during a backup verification exercise and fixed it immediately, but it's the kind of gap that only shows up when you look carefully.&lt;br&gt;
The third was an mTLS legacy issue in Grafana datasource configuration.  After a Grafana admin account recovery, the datasources had stale TLS settings from an old PKI infrastructure that no longer existed.  Grafana reported healthy but queries were silently misconfigured.  The fix was straightforward once found; the problem was that nothing surfaced it automatically.  I now run a data source health check after any Grafana restart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it stands&lt;/strong&gt;&lt;br&gt;
The platform is running, the architecture is validated, and I'm looking for a small number of beta testers willing to run it on real infrastructure and tell me honestly what's missing.  The free tier covers three servers with no credit card required, but for beta-testing I’m flexible.  The bootstrap script installs Alloy, registers the server against the API, and exits.  By doing this, there’s no ongoing shell access, no cron jobs, no modifications outside the Alloy install path. &lt;br&gt;
If you're running infrastructure without good visibility into it, or if you've looked at pricing from bigger companies and decided it doesn't fit, I'd like to hear about it.  The free tier covers three servers, no credit card required. Full script at &lt;a href="https://monitor.irinobservability.com/bootstrap.sh" rel="noopener noreferrer"&gt;https://monitor.irinobservability.com/bootstrap.sh&lt;/a&gt; if you want to read it before running anything.&lt;br&gt;
&lt;a href="https://irinobservability.com/signup" rel="noopener noreferrer"&gt;https://irinobservability.com/signup&lt;/a&gt;&lt;/p&gt;

</description>
      <category>grafana</category>
      <category>devops</category>
      <category>selfhosted</category>
      <category>prometheus</category>
    </item>
  </channel>
</rss>
