Rizwan Saleem

Posted on May 30

How to set up monitoring and observability that actually helps you sleep at night

#frontend #webdev

How to set up monitoring and observability that actually helps you sleep at night

Monitoring and Observability for Engineers: A Practical Setup Guide

The Three Pillars: Logs, Metrics, Traces

Observability gives you end-to-end visibility into complex systems, helping you troubleshoot faster and improve user experiences. The three fundamental pillars are:

Pillar	What it is	What it answers	Best used for
Logs	Discrete events with detailed context at a specific moment	What happened?	Debugging errors, auditing, detailed context
Metrics	Numerical measurements aggregated over time	How much/many?	Alerting, trend analysis, performance monitoring
Traces	Individual requests flowing through distributed systems	Where did it happen?	Finding bottlenecks, dependencies, root causes

Metrics tell you when problems occur, traces show you where problems live, and logs explain why problems happened. Combining all three enables a holistic view of system behavior.

Structured Logging: Best Practices

Structured logging is foundational to effective observability.

Core Principles

Use JSON format - Avoid plain-text logs for better parsing
Design a schema first - Agree on field names and types across your organization
Use severity levels consistently: debug, info, warn, error, fatal
Include trace_id in every log - This correlates logs with traces
Log at boundaries - HTTP requests, database calls, external service interactions
Avoid sensitive data - Never log passwords, tokens, or PII
Watch cardinality - High-cardinality fields hurt query performance

Example (JSON structured log)

{
  "timestamp": "2026-05-29T22:15:30Z",
  "level": "error",
  "message": "Database query failed",
  "trace_id": "abc123def456",
  "span_id": "span789",
  "service": "user-api",
  "user_id": "user_12345",
  "duration_ms": 1523,
  "error": "connection timeout"
}

Centralization

Ship all logs to a centralized system like ELK (Elasticsearch, Logstash, Kibana), EFK (Elasticsearch, Fluentd, Kibana), or Grafana + Loki.

Distributed Tracing: Tracking Requests Across Services

Distributed tracing is crucial for microservice architectures where requests flow through multiple services.

Key Implementation Steps

Use OpenTelemetry - The industry standard for standardized telemetry data
Propagate trace_id and span_id automatically - Use libraries like Spring Cloud Sleuth (Java) or OpenTelemetry auto-instrumentation (multi-language)
Always include trace_id in logs - This enables correlation between logs and traces
Use tracing backends - Jaeger, Zipkin, or commercial APM tools (Datadog, New Relic)

What Traces Reveal

Request path through services
Timing of each operation
Dependencies between services
Bottlenecks and slow operations ### Metrics: What to Collect

Essential Metrics Categories

Category	Key Metrics
Infrastructure	CPU usage, memory, disk I/O, network
Application	Request rate, error rate, response time (p50/p95/p99)
Business	Active users, conversions, transaction volume
Queue/System	Queue depth, cache hit rate, connection pool usage

RED Method (for services)

Rate: requests per second
Errors: error rate (percentage or count)
Duration: response time distribution [implied from ]

USE Method (for infrastructure)

Utilization: CPU, memory, disk usage
Saturation: queue lengths, load
Errors: hardware/software errors [implied from ] ### Setting Up Meaningful Alerts

Alerting strategy should focus on user impact rather than just technical thresholds.

Alert Best Practices

Start with clear goals - Define what you want to improve (reduce downtime, improve UX, detect security issues)
Alert on symptoms, not causes - Alert on "high error rate" not "database CPU at 90%"
Use SLOs (Service Level Objectives) - Configure alerts based on SLO burn rates
Avoid alert fatigue - Remove unnecessary alerts, tune thresholds
Make alerts actionable - Every alert should have a clear next step

Alert Types by Severity

Severity	When to use	Example
Critical	User impact, requires immediate action	Error rate > 5%, site down
Warning	Degradation, can be addressed soon	Response time p99 > 2s
Info	Trend notification, no immediate action	Daily traffic 20% above average

Building Dashboards That Help Debugging

Dashboard Design Principles

Build unified dashboards - Bring logs, metrics, and traces together in one view
Surface what matters - Highlight critical user impact
Use pre-built + custom dashboards - Start with templates, then customize
Make it actionable - Include links to traces, logs, and runbooks

Essential Dashboard Sections

Health overview: Error rate, latency, traffic (Golden Signals)
Infrastructure: CPU, memory, disk, network
Dependencies: Database, cache, external APIs
Business metrics: Active users, conversions
Recent deploys: Correlation with performance changes ### Building Observability Into Your System From Day One

Step-by-Step Setup Guide

Define clear goals - What do you want to achieve? (faster incident resolution, better UX)
Start small - Focus on a critical service before expanding
Select tools that unify data - Choose tools that bring telemetry together consistently
Instrument with OpenTelemetry - Standardized data collection across languages
Enable structured logging - JSON format with trace_id from day one
Enable distributed tracing - Activate for all services
Set up dashboards - Use pre-built or custom dashboards for key metrics
Define alerts and SLOs - Configure alert policies based on user impact
Integrate with existing tools - Connect to Kubernetes, AWS, Azure, CI/CD
Train your team - Knowledge sharing on reading telemetry and troubleshooting

Observability-as-Code

Implement observability configuration as code for version control and consistency.

Review and Refine Regularly

No system is static. Regular reviews help you:

Remove unnecessary alerting
Update visualizations
Confirm alignment with goals

Best Practices Checklist

✓ Use OpenTelemetry for standardized data
✓ Enable code profiling for deeper insights
✓ Implement Observability-as-Code
✓ Build unified dashboards with actionable alerts
✓ Attach correlation IDs for distributed tracing
✓ Follow log retention policies

Quick Start Tools Reference

Category	Open Source	Commercial
Logging	ELK, EFK, Grafana Loki	Datadog Logs, New Relic
Metrics	Prometheus, Grafana	Datadog, New Relic, Prometheus Cloud
Tracing	Jaeger, Zipkin	Datadog APM, New Relic APM
Unified	Grafana (with plugins)	Datadog, New Relic, Splunk

Start with one critical service, implement all three pillars, then expand systematically.
Understanding how logs, metrics, and traces complement each other transforms incident response from guesswork into systematic investigation. When used in harmony, they provide a holistic view enabling rapid troubleshooting and proactive problem-solving.

What's your current tech stack (e.g., Kubernetes, AWS, microservices)? I can provide more specific tool recommendations for your setup.

Rizwan Saleem — https://rizwansaleem.co

DEV Community

How to set up monitoring and observability that actually helps you sleep at night

How to set up monitoring and observability that actually helps you sleep at night

Monitoring and Observability for Engineers: A Practical Setup Guide

The Three Pillars: Logs, Metrics, Traces

Structured Logging: Best Practices

Core Principles

Example (JSON structured log)

Centralization

Distributed Tracing: Tracking Requests Across Services

Key Implementation Steps

What Traces Reveal

Essential Metrics Categories

RED Method (for services)

USE Method (for infrastructure)

Alert Best Practices

Alert Types by Severity

Building Dashboards That Help Debugging

Dashboard Design Principles

Essential Dashboard Sections

Step-by-Step Setup Guide

Observability-as-Code

Review and Refine Regularly

Best Practices Checklist

Quick Start Tools Reference

Top comments (0)