Building a Practical Observability Toolkit for a Modern Web Service
Building a Practical Observability Toolkit for a Modern Web Service
Observability is not just about logs and dashboards; it’s a disciplined approach to making your system’s behavior visible, debuggable, and maintainable. This tutorial walks you through designing and implementing a practical observability toolkit for a production web service. You’ll get concrete code examples, a minimal yet usable architecture, and step-by-step guidance you can adapt to your stack.
Overview: what you’ll build
- A lightweight telemetry layer that emits structured, low-overhead events from your service.
- A centralized, scalable backend for collection, storage, and querying of traces, metrics, and logs.
- A developer-friendly surface: dashboards, alerting, and on-call runbooks that empower engineers to diagnose issues quickly.
- A reproducible local development setup to simulate faults and verify observability end-to-end.
Design goals
- End-to-end visibility: you can trace requests from entry to exit, gauge latency, and identify bottlenecks.
- Low overhead: minimal impact on response times and resource usage.
- Modularity: swap in different backends (e.g., OpenTelemetry, Jaeger, Prometheus) without rewriting your application code.
- Developer ergonomics: easy instrumentation, sane defaults, and meaningful dashboards.
Chapter 1. Instrumentation strategy
- Core signals: traces (request-level), metrics (system-level), logs (event-level).
- Trace model: each request has a trace_id, span_id, parent_span_id, name, start_time, end_time, attributes.
- Metrics model: counter, gauge, histogram; label dimensions such as service, endpoint, status, region.
- Logs: structured JSON logs with at least timestamp, level, message, trace_id, span_id.
Code: a minimal OpenTelemetry setup (Python example)
-Prereqs: Python 3.11+, install opentelemetry-api, opentelemetry-sdk, and an exporter (e.g., opentelemetry-exporter-otlp-http).
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-http
from fastapi import FastAPI, Request
import time
from opentelemetry import trace
from opentelemetry.trace import TracerProvider
from opentelemetry.sdk.trace import Span, SpanProcessor, ReadableSpan
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
app = FastAPI()
trace.set_tracer_provider(TracerProvider(resource=Resource.create({SERVICE_NAME: "my-service"})))
tracer = trace.get_tracer(name)
Export traces to OTLP endpoint (e.g., your collector)
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces", insecure=True)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
Optional: log to console during development
console_exporter = ConsoleSpanExporter()
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(console_exporter))
Instrument FastAPI
@ app.on_event("startup")
def startup():
FastAPIInstrumentor.instrument_app(app)
@app.get("/items/{item_id}")
async def read_item(item_id: int, q: str | None = None):
with tracer.start_as_current_span("handler.read_item") as span:
span.set_attribute("item_id", item_id)
time.sleep(0.05) # simulate work
if q:
# simulate a branch
time.sleep(0.02)
return {"item_id": item_id, "q": q}
If you’re not using Python, the same concept applies across languages: initialize a tracer provider, add span processors, and instrument frameworks or middleware to create spans for requests. The key is to propagate trace context across service boundaries.
Chapter 2. Centralized backend: a minimal, scalable stack
Choose a backend that supports traces, metrics, and logs with a unified view. A practical option is a lightweight OpenTelemetry collector-forwarder into a back-end stack like Jaeger for traces, Prometheus for metrics, and Loki for logs. You can also use a cloud-backed observability platform if you prefer.
Recommended stack (low friction):
- Traces: Jaeger (or OpenTelemetry Collector → Jaeger)
- Metrics: Prometheus + Grafana
- Logs: Loki + Grafana Loki
- Optional: OpenTelemetry Collector as a central router
Deployment sketch:
- Instrument your service to export OTLP traces to the OpenTelemetry Collector.
- Collector exports traces to Jaeger, metrics to Prometheus, and logs to Loki (via appropriate exporters).
Minimal Docker Compose (illustrative)
version: "3.9"
services:
app:
build: ./app
ports:
- "8000:8000"
environment:
OTEL_EXPORTER_OTLP_ENDPOINT: http://collector:4318/v1/traces
collector:
image: otel/opentelemetry-collector-contrib:0.72.0
command: ["config=/etc/collector-config.yaml"]
volumes:
- ./collector-config.yaml:/etc/collector-config.yaml
ports:
- "4317:4317"
- "4318:4318"
- "8888:8888"
jaeger:
image: jaegertracing/all-in-one:1.38
ports:
- "16686:16686"
- "14268:14268"
grafana:
image: grafana/grafana:9.0
ports:
- "3000:3000"
collector-config.yaml (high level)
receivers:
otlp:
protocols:
http:
grpc:
exporters:
jaeger:
endpoint: "http://jaeger:14268/api/traces"
logging:
service:
pipelines:
traces:
receivers: [otlp]
exporters: [jaeger, logging]
This is a starting point. In production, you’ll fine-tune sampling, batch settings, and security (TLS, auth).
Chapter 3. Instrumentation guidelines and best practices
- Start with critical paths: instrument request handlers, database calls, external API calls.
- Use persistent context: propagate trace_id and span_id through all downstream calls (HTTP headers or library-provided context propagation).
- Keep spans meaningful: name spans by operation (e.g., "db.query.user" rather than "db_call").
- Include useful attributes: endpoint, version, environment, user_id (if safe), error codes, latency.
- Set sensible sampling: sample 1-10% by default in prod; grow if issues are detected.
Code: a simple middleware to auto-instrument HTTP requests (Node.js/Express sample)
If you’re using Node.js, you can rely on OpenTelemetry auto-instrumentation and a minimal middleware to enrich logs with trace context.
const express = require('express');
const { diag, DiagConsoleLogger, DiagLogLevel } = require('@opentelemetry/api');
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-otlp-http');
const { SimpleSpanProcessor } = require('@opentelemetry/sdk-trace-base');
diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.INFO);
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: 'http://localhost:4318/v1/traces' }),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
const app = express();
app.use((req, res, next) => {
// trace context is propagated by instrumentations
res.on('finish', () => {
// example: log status and latency
const t = req.elapsed || Date.now() - req.start;
console.log(request ${req.method} ${req.originalUrl} => ${res.statusCode} (${t}ms));
});
next();
});
app.get('/health', (req, res) => res.send('ok'));
app.listen(3000, () => console.log('Service listening on 3000'));
Chapter 4. Observability as code: dashboards and alerts
- Dashboards: create a “service health” overview with latency p95, error rate, request rate; a “root cause” dashboard with traces for the latest incidents.
- Alerts: define threshold-based rules with clear runbooks. Examples:
- Latency: p95 > 300ms for 5 minutes → trigger SRE on-call.
- Error rate: 5xx rate > 1% for 10 minutes → trigger alert.
- Saturation: CPU > 85% for 5 minutes → alert on capacity.
- Runbooks: for each alert, include steps to reproduce, suspected causes, data to check (logs, traces, metrics), and remediation actions.
Example Grafana alerting concepts
- Use a single datasource per data type (Jaeger for traces, Prometheus for metrics, Loki for logs).
- Create a dashboard with panels:
- Latency distribution (P99) by endpoint.
- Error rate by route.
- Throughput by service.
- Alerts:
- Prometheus alert rule: up for a dependent service down? or latency_high.
Chapter 5. Local development and fault injection
- Use a canary-like approach: run a dev service with a two-branch routing to a “fault injector” that throws exceptions or simulates latency.
- Add a simple fault injector middleware to your app for testing.
Python example: fault injection middleware
from fastapi import FastAPI, Request, HTTPException
import random
app = FastAPI()
@app.middleware("http")
async def fault_injector(request: Request, call_next):
if request.query_params.get("fault") == "latency":
await asyncio.sleep(0.5) # 500ms delay
if request.query_params.get("fault") == "error":
raise HTTPException(status_code=500, detail="Injected fault")
response = await call_next(request)
return response
Usage: curl http://localhost:8000/items/1?fault=latency to test latency spike, or ?fault=error to simulate errors. This helps validate observability signals.
Chapter 6. Observability for teams: policy and culture
- Ownership: assign teams to own specific services’ observability dashboards.
- Documentation: maintain a runbook for each major alert with the steps to diagnose.
- Training: run periodic “observability drills” to rehearse incident detection and response.
- Privacy and compliance: avoid collecting sensitive user data in traces; sanitize attributes where necessary.
Chapter 7. A compact starter blueprint you can adapt
1) Instrumentation
- Add a tracer provider and a span for each request in your web framework.
- Add attributes for endpoint, environment, version, and user_id if safe.
2) Backend
- Deploy OpenTelemetry Collector with exporters to Jaeger (traces), Loki (logs), and Prometheus (metrics).
- Set up Prometheus to scrape exporters if possible and Loki to collect logs.
3) Dashboards
- Create a Grafana workspace with:
- A “Service Health” dashboard (p95 latency, error rate, throughput by endpoint).
- A “Trace Explorer” view for the top slow endpoints.
- A “Logs by Trace” view to correlate a trace with related logs.
4) Alerts
- Latency: p95 > 300ms for 5 minutes.
- Error rate: 1%+ HTTP 5xx over 10 minutes.
- Saturation: CPU > 85% for 5 minutes.
5) Local validation
- Run the service locally with OTLP endpoint pointing to a local collector.
- Use curl with various parameters to exercise healthy paths and fault injections.
- Confirm traces appear in Jaeger, metrics in Prometheus, and logs in Loki.
Illustrative example: tracing a sample request flow
- A user requests GET /payments/confirm.
- The service starts a trace: trace_id=abc123, span_id=xyz789.
- It calls a database: span db_query with duration 40ms.
- It makes an external API call: span external_api with duration 120ms.
- The handler finishes with 200 OK; trace shows overall latency ~200ms.
- Jaeger shows the trace path; Grafana dashboards aggregate latency and error metrics.
What to watch for during adoption
- Over-instrumentation slows down the request path. Keep sampling reasonable and avoid heavy synchronous work in critical paths.
- Ensure trace context propagation across services; missed propagation leads to broken traces.
- Plan for data retention: traces can be large; set retention policies that balance debugging value with storage costs.
Conclusion
A practical observability toolkit combines lightweight instrumentation, a sane backend stack, developer-friendly dashboards, and disciplined incident response. This approach gives your teams the visibility needed to diagnose issues quickly, improve reliability, and iterate confidently in production.
Would you like me to tailor this blueprint to your tech stack (for example, specific language, framework, and cloud/provider choices), and provide a ready-to-run docker-compose file plus a minimal sample service in your preferred language?
-
Rizwan Saleem | https://rizwansaleem.co
Top comments (0)