DEV Community

Rizwan Saleem
Rizwan Saleem

Posted on

How to set up monitoring and observability that actually helps you sleep at night

How to set up monitoring and observability that actually helps you sleep at night

Monitoring and Observability for Engineers: A Practical Setup Guide

The Three Pillars: Logs, Metrics, Traces

Observability gives you end-to-end visibility into complex systems, helping you troubleshoot faster and improve user experiences. The three fundamental pillars are:

Pillar What it is What it answers Best used for
Logs Discrete events with detailed context at a specific moment What happened? Debugging errors, auditing, detailed context
Metrics Numerical measurements aggregated over time How much/many? Alerting, trend analysis, performance monitoring
Traces Individual requests flowing through distributed systems Where did it happen? Finding bottlenecks, dependencies, root causes

Metrics tell you when problems occur, traces show you where problems live, and logs explain why problems happened. Combining all three enables a holistic view of system behavior.

Structured Logging: Best Practices

Structured logging is foundational to effective observability.

Core Principles

  1. Use JSON format - Avoid plain-text logs for better parsing
  2. Design a schema first - Agree on field names and types across your organization
  3. Use severity levels consistently: debug, info, warn, error, fatal
  4. Include trace_id in every log - This correlates logs with traces
  5. Log at boundaries - HTTP requests, database calls, external service interactions
  6. Avoid sensitive data - Never log passwords, tokens, or PII
  7. Watch cardinality - High-cardinality fields hurt query performance

Example (JSON structured log)

{
  "timestamp": "2026-05-29T22:15:30Z",
  "level": "error",
  "message": "Database query failed",
  "trace_id": "abc123def456",
  "span_id": "span789",
  "service": "user-api",
  "user_id": "user_12345",
  "duration_ms": 1523,
  "error": "connection timeout"
}
Enter fullscreen mode Exit fullscreen mode

Centralization

Ship all logs to a centralized system like ELK (Elasticsearch, Logstash, Kibana), EFK (Elasticsearch, Fluentd, Kibana), or Grafana + Loki.

Distributed Tracing: Tracking Requests Across Services

Distributed tracing is crucial for microservice architectures where requests flow through multiple services.

Key Implementation Steps

  1. Use OpenTelemetry - The industry standard for standardized telemetry data
  2. Propagate trace_id and span_id automatically - Use libraries like Spring Cloud Sleuth (Java) or OpenTelemetry auto-instrumentation (multi-language)
  3. Always include trace_id in logs - This enables correlation between logs and traces
  4. Use tracing backends - Jaeger, Zipkin, or commercial APM tools (Datadog, New Relic)

What Traces Reveal

  • Request path through services
  • Timing of each operation
  • Dependencies between services
  • Bottlenecks and slow operations ### Metrics: What to Collect

Essential Metrics Categories

Category Key Metrics
Infrastructure CPU usage, memory, disk I/O, network
Application Request rate, error rate, response time (p50/p95/p99)
Business Active users, conversions, transaction volume
Queue/System Queue depth, cache hit rate, connection pool usage

RED Method (for services)

  • Rate: requests per second
  • Errors: error rate (percentage or count)
  • Duration: response time distribution [implied from ]

USE Method (for infrastructure)

  • Utilization: CPU, memory, disk usage
  • Saturation: queue lengths, load
  • Errors: hardware/software errors [implied from ] ### Setting Up Meaningful Alerts

Alerting strategy should focus on user impact rather than just technical thresholds.

Alert Best Practices

  1. Start with clear goals - Define what you want to improve (reduce downtime, improve UX, detect security issues)
  2. Alert on symptoms, not causes - Alert on "high error rate" not "database CPU at 90%"
  3. Use SLOs (Service Level Objectives) - Configure alerts based on SLO burn rates
  4. Avoid alert fatigue - Remove unnecessary alerts, tune thresholds
  5. Make alerts actionable - Every alert should have a clear next step

Alert Types by Severity

Severity When to use Example
Critical User impact, requires immediate action Error rate > 5%, site down
Warning Degradation, can be addressed soon Response time p99 > 2s
Info Trend notification, no immediate action Daily traffic 20% above average

Building Dashboards That Help Debugging

Dashboard Design Principles

  1. Build unified dashboards - Bring logs, metrics, and traces together in one view
  2. Surface what matters - Highlight critical user impact
  3. Use pre-built + custom dashboards - Start with templates, then customize
  4. Make it actionable - Include links to traces, logs, and runbooks

Essential Dashboard Sections

  • Health overview: Error rate, latency, traffic (Golden Signals)
  • Infrastructure: CPU, memory, disk, network
  • Dependencies: Database, cache, external APIs
  • Business metrics: Active users, conversions
  • Recent deploys: Correlation with performance changes ### Building Observability Into Your System From Day One

Step-by-Step Setup Guide

  1. Define clear goals - What do you want to achieve? (faster incident resolution, better UX)
  2. Start small - Focus on a critical service before expanding
  3. Select tools that unify data - Choose tools that bring telemetry together consistently
  4. Instrument with OpenTelemetry - Standardized data collection across languages
  5. Enable structured logging - JSON format with trace_id from day one
  6. Enable distributed tracing - Activate for all services
  7. Set up dashboards - Use pre-built or custom dashboards for key metrics
  8. Define alerts and SLOs - Configure alert policies based on user impact
  9. Integrate with existing tools - Connect to Kubernetes, AWS, Azure, CI/CD
  10. Train your team - Knowledge sharing on reading telemetry and troubleshooting

Observability-as-Code

Implement observability configuration as code for version control and consistency.

Review and Refine Regularly

No system is static. Regular reviews help you:

  • Remove unnecessary alerting
  • Update visualizations
  • Confirm alignment with goals

Best Practices Checklist

✓ Use OpenTelemetry for standardized data
✓ Enable code profiling for deeper insights
✓ Implement Observability-as-Code
✓ Build unified dashboards with actionable alerts
✓ Attach correlation IDs for distributed tracing
✓ Follow log retention policies

Quick Start Tools Reference

Category Open Source Commercial
Logging ELK, EFK, Grafana Loki Datadog Logs, New Relic
Metrics Prometheus, Grafana Datadog, New Relic, Prometheus Cloud
Tracing Jaeger, Zipkin Datadog APM, New Relic APM
Unified Grafana (with plugins) Datadog, New Relic, Splunk

Start with one critical service, implement all three pillars, then expand systematically.
Understanding how logs, metrics, and traces complement each other transforms incident response from guesswork into systematic investigation. When used in harmony, they provide a holistic view enabling rapid troubleshooting and proactive problem-solving.

What's your current tech stack (e.g., Kubernetes, AWS, microservices)? I can provide more specific tool recommendations for your setup.


Rizwan Saleem — https://rizwansaleem.co

Top comments (0)