Onuoha Chiemerie

Posted on May 20

Production-Grade Observability: Building a Complete LGTM Stack with SLOs, DORA Metrics, and Intelligent Alerting

#architecture #devops #monitoring #sre

Introduction
In modern DevOps, simply knowing whether your application is "up" or "down" isn't enough. Users care about latency, reliability, and the consistency of your service. To meet these expectations at scale, we built a production-grade observability platform using the LGTM stack (Loki, Grafana, Tempo, Prometheus), integrated DORA metrics for CI/CD visibility, and implemented SLI/SLO/error budget frameworks to align engineering with business outcomes.
This blog post walks through our complete implementation—from architecture and infrastructure-as-code to burn-rate alerting, incident management, and live chaos testing. We'll show you how to move beyond CPU/RAM monitoring into meaningful reliability engineering.

Why LGTM Over Managed Alternatives?
We evaluated several observability solutions: Datadog, New Relic, Splunk, and managed ELK. Here's why we chose LGTM:

Cost transparency: No surprise bills based on volume spikes. Fixed infrastructure cost with predictable scaling.
Full control: We own the alerting rules, dashboard definitions, and data retention policies—all version-controlled in Git.
Composable: Each tool (Prometheus, Loki, Tempo, Grafana) is best-in-class for its job. You're not locked into one vendor's compromise.
Community-driven: The ecosystem is mature, well-documented, and battle-tested by thousands of companies.
Learning value: Understanding how these tools work teaches deeper systems thinking than a button-click interface ever could.

Architecture: A Complete Data Flow
Our stack comprises nine services orchestrated via Docker Compose:
Data Flow

Metrics: The sample app exports Prometheus metrics (counters, histograms, gauges). Prometheus scrapes the app, node-exporter, blackbox-exporter, and the DORA exporter at configurable intervals (default 15s).
Logs: The app sends structured JSON logs to the OpenTelemetry Collector via OTLP. The collector forwards them to Loki, which indexes by service_name and other labels.
Traces: The app instruments request spans and ships them to the collector, which exports to Tempo for distributed tracing.
Alerts: Prometheus evaluates rules (all version-controlled). Alert rules fire to Alertmanager, which groups, inhibits, and routes to Slack with structured payloads.

Key detail: Loki's datasource is configured with a derived field for trace_id. When you spot an error in logs, one click drills into Tempo to see the full trace—the ultimate observability feedback loop.

The Four Golden Signals as SLIs
The Four Golden Signals are a framework for measuring service health: Latency, Traffic, Errors, and Saturation. We define each as a Service Level Indicator (SLI)—a measurable proxy for what users care about.

Latency: How long does it take to serve a request?
This PromQL expression calculates the 95th percentile of request latency over a 5-minute window. We distinguish successful requests (status < 500) from error responses—a slow error is cheaper to tolerate than a slow success.
Our SLI target: 95% of successful requests complete in under 500ms.
Traffic: How much demand is the system handling?
This measures requests per second. Traffic acts as a leading indicator—if traffic spikes before errors, you can scale proactively. If traffic drops suddenly, that's also a sign of trouble upstream.
Our SLI target: No hard limit; we monitor for anomalies via trends.
Errors: What fraction of requests fail?
This calculates the 5-minute error rate as a ratio. "Errors" include explicit 5xx responses, timeouts, and policy violations (e.g., slow timeouts count as errors for SLO purposes).
Our SLI target: Error rate < 1% (equivalently, 99% success rate).
Saturation: How full is the service?
Saturation covers CPU, memory, disk, and connection pools. When saturation approaches 100%, performance degrades sharply. Early warning of saturation lets you scale before users hit the wall.
Our SLI targets:

CPU: warn at 80%, critical at 90%.
Memory: warn at 80%, critical at 90%.
Disk: warn at 75%, critical at 90%

SLOs, Error Budgets, and the Policy That Binds Them
An SLO is a commitment: "We promise 99% availability over a 30-day window." The unspent portion—1% bad events over 30 days—is your error budget.

Calculating Error Budget Burn
Suppose your error SLO is 99% success over 30 days. The error budget is 1% of 30 days = 7.2 hours of allowed "bad" requests.
Fast burn rate (14.4x burn): Consuming 1% of the budget in 1 hour. Example: error rate jumps to 2.88% (all budget burned in ~2.4 hours). Action: Immediate incident response, rollback, or emergency fix.
Slow burn rate (5x burn): Consuming 1% of the budget in 6 hours. Example: error rate rises to 1.48% (all budget consumed in ~30 hours). Action: Schedule a reliability sprint; don't rush, but prioritize.
Our Error Budget Policy
At 50% consumed: Service owner reviews recent changes and opens mitigation tasks.
At 75% consumed: Non-critical features require service-owner approval before merging.
At 100% consumed: Feature freeze—focus entirely on reliability until SLO recovers or leadership signs off on the risk.
Review cadence: Every two weeks or after significant incidents. If burn-rate alerts are noisy, we adjust thresholds rather than ignoring them.
Ownership: Service owner owns SLO health and burn rate decisions. Platform engineering owns observability tooling correctness.
Prometheus Alert Rules: From Theory to Action
All alert rules live in version-controlled YAML files under prometheus/alerts/. Here are the key categories:
Infrastructure Alerts

yaml
-alert: HighCPUWarning
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
annotations:
summary: "CPU usage is above 80%"
dashboard_url: "http://localhost:3000/d/node-exporter/node-exporter"
runbook_url: "runbooks/high-cpu.md"

This fires when CPU exceeds 80% for 5+ consecutive minutes. The for clause prevents alert flapping—transient spikes don't trigger paging.
Key design choice: We alert on warning (80%) before critical (90%), giving operators time to respond gracefully. Both alerts include runbook links, so responders know their first investigation steps.

SLO Burn-Rate Alerts

yaml

alert: FastBurnRate expr: | ( sum(rate(http_requests_total{status_code=~"5.."}[5m])) / clamp_min(sum(rate(http_requests_total[5m])), 1) ) > 0.144 # 99% SLO allows ~1.44% error rate for fast burn for: 5m labels: severity: critical annotations: summary: "Fast error-budget burn detected" dashboard_url: "http://localhost:3000/d/slo-error-budget/slo-error-budget" runbook_url: "runbooks/fast-burn.md" The expression 0.144 represents a 14.4x burn rate—consuming the error budget 14 times faster than normal. This threshold is calibrated to the SLO and measurement window. Why burn-rate alerts work: They're based on rate of degradation, not absolute thresholds. A 2% error rate might be acceptable for 1 hour (slow burn), but not for 5 minutes (fast burn). This reduces alert fatigue and focuses on what users actually experience. Alertmanager Routing and Slack Templates Alerts funnel through Alertmanager, which groups, inhibits, and routes them: yamlroute: receiver: "slack" group_by: ["service", "severity"] group_wait: 30s group_interval: 5m repeat_interval: 4h routes:
- match: service: sample-app receiver: "slack-app" continue: true

inhibit_rules:

source_match: severity: critical service: node target_match: severity: warning equal: ["instance"] Inhibition: If a host is fully unreachable (critical), suppress CPU/memory/latency noise on the same host. This prevents alert storms during outages. Grouping: Alerts for the same service/severity are bundled into a single Slack message every 5 minutes. If you have 10 CPU warnings on different hosts, you get one grouped notification, not 10 individual pings. The Slack template includes:

Alert name and severity
Current metric value (e.g., "CPU 94.2%")
Affected instance/service
Grafana dashboard link for drill-down
Runbook link for action items

DORA Metrics: Connecting Engineering Velocity to Reliability
DORA (DevOps Research & Assessment) metrics measure engineering team performance and correlate it with system stability:

Deployment Frequency (DF): How often do you ship?

Elite: > 1 per day
High: 1–7 per week
Medium: 1–4 per month
Low: < 1 per month

Lead Time for Changes (LTC): How long from commit to production?

Elite: < 1 hour
High: 1–24 hours
Medium: 1–7 days
Low: > 1 month

Change Failure Rate (CFR): What % of deployments cause incidents?

Elite: 0–15%
High: 15–45%
Medium: 45–60%
Low: > 60%

Mean Time to Restore (MTTR): How long to fix failed deployments?

Elite: < 1 hour
High: 1–24 hours
Medium: 1–7 days
Low: > 1 week

Our Implementation
We built a custom DORA exporter (dora-exporter/app.py) that queries GitHub Actions API:

Scrapes workflow runs and their success/failure status
Extracts deployment timestamps, commit hashes, and PR merge times
Exposes metrics as Prometheus gauges and counters
Calculates deployment frequency and classifications

The Grafana dashboard shows:

Deployment frequency with DORA classification banner
Lead time breakdown: commit → pipeline trigger → pipeline complete → deployment confirmed
Change failure rate trended over 7 and 30-day windows with SLO threshold line
MTTR distribution (how long alerts take to resolve).

Why DORA Matters
Engineering metrics are business metrics. High deployment frequency with low CFR and MTTR signals a mature CI/CD process and strong incident response. Teams improving these metrics often see:

Faster incident recovery (MTTR)
Fewer rollbacks and hotfixes (CFR)
Higher customer satisfaction
Less on-call burden.

Grafana Dashboards: The View Into Your System
All five dashboards are provisioned as JSON, never hand-configured in the UI:

Unified Observability Dashboard This is the most important. A user sees a latency spike. They:

Click the spike in the latency panel
See logs from that time window in Loki
Click a trace_id derived field link
Land in Tempo with the full distributed trace
Identify the slow service and failing endpoint

The underlying magic: Loki's derived fields interpret trace_id in logs and link directly to Tempo.

SLO & Error Budget Dashboard Shows:

SLI vs. SLO gauges (e.g., "Current error rate: 0.3% vs. SLO 1%")
Error budget remaining in percentage and absolute time
Burn rate time series with fast/slow thresholds as red/yellow bands
7-day and 30-day SLO compliance history

When the burn rate line crosses into the red zone, operators know they're in fast-burn territory and should engage incident response.

DORA Metrics Dashboard Shows:

Deployment frequency trend with DORA classification badge
Lead time breakdown by phase
CFR and MTTR distributions
Recent deployments with outcomes

Node Exporter Dashboard System-level metrics: CPU (total and per-core), memory (used/cached/available), disk I/O, network I/O, load averages.
Blackbox Exporter Dashboard Synthetic monitoring: uptime/downtime timeline, HTTP response time percentiles (p50, p90, p99), SSL expiration countdown, probe success rate. All dashboards link to each other and to runbooks, creating a cohesive incident response workflow.

Runbooks: From Alert to Resolution
A runbook is a structured troubleshooting guide. For each alert, we provide:

What is this alert? Plain-English definition and SLO/metric context.
Why did it fire? Likely causes (bad deploy, traffic spike, resource leak, etc.).
First 3 investigation steps: Specific queries and checks to narrow the problem.
How do I resolve it? Step-by-step fix procedures.
When do I roll back? Criteria for reverting the last deploy.
When do I escalate? Conditions for involving leadership (e.g., no progress in 30 minutes).

Game Day: Chaos and Failure Scenarios
To validate the entire stack, we ran three production-like failure scenarios:
Scenario 1: Deployment Failure
Trigger: Force a failed GitHub Actions workflow.
Expected: DORA exporter detects failure, CFR alert fires in Slack, dashboard CFR panel updates, team follows runbook to determine if rollback is needed.
Validation: Screenshot showing alert in Slack, CFR spike in Grafana, and timeline of detection-to-notification.
Scenario 2: Latency Injection
Trigger: Call /slow endpoint repeatedly via load testing.
Expected:

Latency histogram in Grafana rises (95th percentile > 500ms)
SLI degrades (error rate approaches threshold)
Burn-rate alert fires (fast or slow)
Logs show slow requests from the app
Click trace_id link → Tempo shows the slow span

Validation: Screenshots of each drill-down step.
Scenario 3: Resource Pressure
Trigger: CPU spike (stress tool) or memory exhaustion.
Expected:

Warning alert (80% CPU) fires first → runbook
Critical alert (90% CPU) fires if pressure continues
As pressure clears, recovery notification sent
Dashboard shows metrics normalizing

Validation: Timeline showing warning → critical → recovery sequence with recovery notification in Slack.
All three scenarios confirm that:

Alerts fire reliably with correct timing
Runbooks are actionable and lead to resolution
Dashboard drill-downs (logs → traces) work end-to-end
Team confidence in incident response increases.

Toil Identification and Automation
We identified three sources of toil (repetitive manual work):

Dashboard Screenshot Collection Before: Manually opening each dashboard in Grafana, taking screenshots, and uploading to Slack for weekly reports. After: Automated script using Grafana API (GET /api/dashboards/db/:slug, export as image) to generate a weekly PDF report. Reduces toil by ~4 hours/week.
Incident Timeline Assembly Before: Manually sifting through logs, traces, and alert history to construct a timeline during post-incident reviews. After: Python script querying Loki for logs and Tempo for traces by alert timestamp, auto-generating a structured timeline with severity progression. Reduces PIR prep by ~2 hours/incident.
Runbook Maintenance Before: Runbooks drift from reality as systems change; alerts point to outdated procedures. After: Runbook links embedded in alerts trigger a CI/CD check: does the linked file exist and contain required sections? Missing/broken links fail the build. Reduces runbook staleness and improves on-call experience.

Infrastructure as Code: One-Command Deployment
The entire stack deploys with:
bashcp .env.example .env
make up
Under the hood:

docker-compose.yml orchestrates 9 services
Prometheus, Alertmanager, Loki, Tempo configs are volume-mounted from version-controlled files
Grafana datasources and dashboards are auto-provisioned via /etc/grafana/provisioning/
Alert rules are read from prometheus/alerts/*.yml

For Terraform users, we provide a wrapper:
bashcd terraform
terraform init
terraform apply
This abstracts Docker infrastructure (if deploying to AWS, Kubernetes, etc.) while keeping the stack definition identical.
Philosophy: Infrastructure should be repeatable and auditable. A new team member should be able to run make up and have a fully functional observability platform in 3 minutes, no "click here" steps.

Lessons Learned and Future Improvements
What Worked Well

SLO-driven alerting: Burn-rate alerts are far less noisy than absolute thresholds. Team adoption was immediate.
Derived fields in Loki: The one-click logs → traces drill-down is transformative for troubleshooting.
Version-controlled dashboards: Dashboard-as-code eliminates UI-drift and enables code review on observability changes.
Inhibition rules: Suppressing secondary alerts during outages prevents alert storms.

What We'd Improve

Cardinality management: High-cardinality labels (e.g., user IDs, request paths) can balloon Prometheus storage. We'd implement stricter label policies and use recording rules more aggressively.
Temporal data retention: 7 days of logs / 5 days of traces is tight for deep investigations. Moving to S3-backed storage would extend retention without local-disk bloat.
Multi-team onboarding: Today, dashboard creation requires JSON knowledge. A dashboard-builder UI (while still exporting as JSON) would democratize observability.

Conclusion
Building a production-grade observability platform is a journey, not a destination. The LGTM stack gives you:

Metrics (Prometheus): Understand system behavior at scale.
Logs (Loki): Contextualize metrics with event sequences.
Traces (Tempo): See causal chains across services.
Visualization (Grafana): Unify all signals in dashboards.

Layered with SLOs, error budgets, and DORA metrics, you move from "Is it up?" to "Are we meeting user expectations?" That shift is where reliability engineering begins.
The observability platform is code, the alert rules are testable, and the runbooks are executable—just like the services you're monitoring. Treat observability as a first-class infrastructure concern, and your team will spend less time fighting fires and more time building features.

DEV Community

Production-Grade Observability: Building a Complete LGTM Stack with SLOs, DORA Metrics, and Intelligent Alerting

Top comments (0)