DEV Community

Cover image for How I Built a Production Observability Stack — And Broke It Twice Before It Worked

How I Built a Production Observability Stack — And Broke It Twice Before It Worked

I used to dismiss monitoring as something you bolt on after the real engineering is done. Logs were noise. Metrics were "a later problem." Alerts were for teams with dedicated SREs, not a small startup running three service types on Render.

I was wrong. Badly wrong. And it took a self-inflicted incident — where my own monitoring system became the thing that needed monitoring — to understand why observability is engineering, not afterthought.

This is a detailed account of building our observability stack from scratch: what we built, what broke, why it broke, and what the architecture looks like today.


The starting point

We run three service types on Render:

  • A web service — the main API and frontend server
  • Background workers — async job processors (queuing, retries, scheduled tasks)
  • Key-value stores — Render's managed Redis-compatible service

Before this project, each service logged to Render's default log drain. If something broke in production, you'd open the Render dashboard, pick a service, and scroll. No correlation across services, no metrics, no alerting, no history beyond what Render retained.

The goal was simple to state: centralize everything into a self-hosted Grafana stack, set up alerting, and gain actual visibility into what the system was doing.


The stack

Before getting into what went wrong, here's what we ended up with:

Component Role
Caddy Reverse proxy and automatic TLS. All external traffic enters here.
syslog-proxy Custom Python container. Enforces token validation on inbound syslog, strips the token, forwards clean RFC 5424 to Alloy.
Vector Disk-backed telemetry buffer between syslog-proxy and Alloy. Prevents message loss during restarts or slowdowns.
Alloy Grafana's telemetry collector. Receives logs, applies filters and transformations, routes to the right backend.
Loki Log storage. Stores log lines compressed and indexed by labels. Queried with LogQL.
Mimir Metrics storage. Long-term Prometheus-compatible time-series backend.
Tempo Trace storage. Stores distributed traces with minimal indexing — cheap to run.
Grafana The UI. Dashboards, alert rules, and unified querying across Loki, Mimir, and Tempo.

The full data flow:

Render services
    → syslog drain (TCP, RFC 5424)
    → syslog-proxy (token validation + strip)
    → Vector (disk-backed buffer)
    → Alloy (filter, transform, route)
    → Loki (logs) / Mimir (metrics) / Tempo (traces)
    → Grafana (dashboards + alerts)
    → Datadog + BetterStack (fan-out for external alerting)
Enter fullscreen mode Exit fullscreen mode

All of this runs on a single 4 GB DigitalOcean droplet, currently at ~30% total memory usage. That number matters — it means the stack is right-sized, not just "scaled until it stopped crashing."


Problem 1: No token enforcement on the log stream

Render lets you configure a syslog log drain — a TCP endpoint that receives log lines in RFC 5424 format. You give Render a URL with a token embedded in it, and Render forwards logs there.

The problem: that token is in the URL, but there's no native mechanism to validate it before your collector ingests the message. Render's outbound IP range is also shared across tenants. Any service within that range — from any organisation — can technically send data to your syslog endpoint if they know the address. And since we were pointing directly at Alloy, anything that reached the port got ingested.

This is less of an active security threat and more of a correctness and isolation problem. You want to know that the logs you're querying are your logs.

The fix: we wrote a lightweight Python container — syslog-proxy — that sits between Render and Alloy. Every inbound TCP connection goes through it first. It reads the syslog message, extracts the token from the structured data field (RFC 5424 puts auth metadata in the [SD-ID key="value"] block), validates it against a shared secret, strips it from the message, and forwards clean RFC 5424 to Alloy's syslog listener.

If the token is missing or wrong, the connection is dropped. No entry.

The key distinction: the proxy isn't adding authentication to the stream. Render was already sending a token. The proxy is the enforcement layer that was missing — the thing that actually checks it before data moves downstream.


Problem 2: Connection drops under load

After deploying the proxy, we started seeing intermittent message loss. Not every time, not obviously, but present — we'd notice gaps in log sequences that shouldn't have gaps.

The root cause was straightforward: the proxy was synchronous and had no internal buffer. When Alloy was slow to accept a connection (startup, GC pause, momentary backpressure), the proxy would drop the TCP connection rather than queue the message. Lost message, no retry, no error surfaced to the user.

The fix: we inserted Vector between syslog-proxy and Alloy.

Vector is a high-performance, Rust-based telemetry agent built specifically for this kind of pipeline work. The relevant feature here is disk-backed buffering. Messages are written to disk immediately on receipt, then forwarded to the downstream destination. If Alloy is slow, Vector queues. If Alloy restarts entirely, Vector holds the messages and delivers once the connection is re-established.

The pipeline now looks like: syslog-proxy validates and strips → hands off to Vector → Vector buffers to disk → Vector forwards to Alloy.

Connection drops stopped immediately. We have not lost a log message to transport failure since.


The incident

With the pipeline stable, we validated it end-to-end on our highest-volume service — the web API. It processed tens of thousands of log lines without issue. Alloy filtered and routed correctly. Loki ingested cleanly. Grafana showed the data. Everything looked good.

So we did what seemed logical: we added all remaining services to the syslog drain simultaneously.

Within two hours, the system was on fire.

Both Datadog and BetterStack — external services we fan logs to in parallel for alerting and long-term retention — were overwhelmed. The Datadog exporter started returning unexpected EOF from their intake API. This is Datadog actively closing connections, not just timing out. They were rejecting us.

Alert rules fired across the board. The alert channel, which was supposed to tell us when our application was unhealthy, was now full of noise about the monitoring pipeline being unhealthy. The observer had become the thing being observed.

We rolled back all drain configurations to Render's defaults within minutes.

Post-mortem

The root cause was a combination of three things that compounded each other:

1. Unfiltered background worker logs. Background job processors emit a lot of lifecycle noise by default — job enqueued, job started, job completed, job retried, and so on. These events fire on every job. At our job volume, a single background worker service can generate thousands of log lines per minute. None of this is signal for ongoing observability. It's useful when debugging a specific job failure, not for dashboards.

2. Fan-out multiplication. We were routing logs to three destinations simultaneously: Loki, Datadog, and BetterStack. A 3x spike in log volume becomes a 9x spike in outbound pipeline throughput. Every destination gets hit at the same time, with the same burst.

3. All services at once. Testing on one service — even the highest-volume one — told us the pipeline could handle a single source. It told us nothing about what happens when five sources open simultaneously. The aggregate volume was an order of magnitude higher than anything we'd validated.

What we changed

Incremental rollout. After rolling back, we added services back one at a time with 24 hours of observation between each. If volume, error rates, and downstream health looked stable for 24 hours, we added the next one. The full rollout took five days instead of five minutes.

Log filtering in Alloy. We defined explicit filter rules per service type. Background worker INFO-level lifecycle events (enqueued, started, completed) are now dropped by Alloy before they reach any destination. Only WARNING and above, plus specific job failure patterns, pass through. This cut background worker log volume by roughly 80%.

Volume alerts on the ingestion pipeline. We added alert rules on Alloy's own metrics — ingestion rate, error rate, downstream write failures. If the pipeline itself starts showing stress, we know before it cascades.


The detail that almost cost us — the 8 KB syslog limit

After the system was stable, we noticed something odd: certain application errors were present in Render's own log view but missing from Loki. Not all of them. Just the large ones.

The ones that were missing were stack traces. Full Java-style stack traces can easily exceed 100 KB. Syslog has a default maximum message size of 8 KB (defined in RFC 5424). Messages that exceed it are silently truncated or dropped, depending on the implementation.

Silent. No error. No warning. No indication in any metric that a message was lost. The data simply didn't arrive.

We increased the max_message_len parameter in the syslog-proxy and Alloy's syslog listener to 256 KB. The missing stack traces appeared immediately.

The lesson here is broader than syslog: check the defaults of every tool in your pipeline. Buffer sizes, message limits, timeout values, retry caps — these are all set to something that was reasonable for a generic use case. They may not be reasonable for yours. And when they're wrong, most tools will not tell you.


Current state

The stack has been running stably for several weeks. Current metrics:

  • Droplet: 4 GB RAM, ~30% utilization under normal load
  • Log ingestion: all three service types, filtered, continuous
  • Storage: local filesystem (S3 planned for long-term retention)
  • External fan-out: Datadog and BetterStack, with volume alerts before we approach their intake limits
  • Alert coverage: application error rates, pipeline health, job failure patterns, infrastructure metrics via Mimir

The Grafana dashboards now show a real-time view of system health across all services. When something breaks, we know about it from an alert before a user reports it. That has happened twice since launch, and both times the alert fired before any user-facing degradation was detectable.


Key takeaways

Not all logs are worth storing. Define what matters per service before you start ingesting. Background worker lifecycle events are not observability — they're debug information that belongs in a trace, not a log aggregator. Decide at collection time, not retention time.

Buffer every I/O boundary. Silent drops are worse than backpressure. Vector saved us from losing data during restarts and slowdowns. Put a buffer anywhere data moves between two systems that can fail independently.

Fan-out multiplies everything. A 3x spike in log volume doesn't hit three destinations at 3x each — it hits three destinations at 3x simultaneously, and each downstream system now has to handle the same burst. Design for the aggregate.

Roll out incrementally. One source, observe for 24 hours, then next. Validating on a single service tells you almost nothing about aggregate behavior. The incident would not have happened if we had added services one at a time.

Check your defaults. The 8 KB syslog limit is a good example of a default that works for most cases and silently breaks for edge cases. Every tool in your pipeline has limits like this. Read the config reference. Set explicit values.

Monitor the monitor. Your observability pipeline is a production system. It needs its own health metrics, its own alerts, and its own runbook. If the pipeline goes down during an incident, you're blind exactly when you need visibility most.


Stack

Grafana · Loki · Mimir · Tempo · Alloy · Vector · Caddy · Render · DigitalOcean · S3 (planned)


If you're building something similar or have questions about any part of the architecture, feel free to reach out.

Top comments (0)