How to set up monitoring and observability that actually helps you sleep at night
Monitoring and Observability for Engineers: A Practical Setup Guide
The Three Pillars: Logs, Metrics, Traces
Observability gives you end-to-end visibility into complex systems, helping you troubleshoot faster and improve user experiences. The three fundamental pillars are:
| Pillar | What it is | What it answers | Best used for |
|---|---|---|---|
| Logs | Discrete events with detailed context at a specific moment | What happened? | Debugging errors, auditing, detailed context |
| Metrics | Numerical measurements aggregated over time | How much/many? | Alerting, trend analysis, performance monitoring |
| Traces | Individual requests flowing through distributed systems | Where did it happen? | Finding bottlenecks, dependencies, root causes |
Metrics tell you when problems occur, traces show you where problems live, and logs explain why problems happened. Combining all three enables a holistic view of system behavior.
Structured Logging: Best Practices
Structured logging is foundational to effective observability.
Core Principles
- Use JSON format - Avoid plain-text logs for better parsing
- Design a schema first - Agree on field names and types across your organization
-
Use severity levels consistently:
debug,info,warn,error,fatal -
Include
trace_idin every log - This correlates logs with traces - Log at boundaries - HTTP requests, database calls, external service interactions
- Avoid sensitive data - Never log passwords, tokens, or PII
- Watch cardinality - High-cardinality fields hurt query performance
Example (JSON structured log)
{
"timestamp": "2026-05-29T22:15:30Z",
"level": "error",
"message": "Database query failed",
"trace_id": "abc123def456",
"span_id": "span789",
"service": "user-api",
"user_id": "user_12345",
"duration_ms": 1523,
"error": "connection timeout"
}
Centralization
Ship all logs to a centralized system like ELK (Elasticsearch, Logstash, Kibana), EFK (Elasticsearch, Fluentd, Kibana), or Grafana + Loki.
Distributed Tracing: Tracking Requests Across Services
Distributed tracing is crucial for microservice architectures where requests flow through multiple services.
Key Implementation Steps
- Use OpenTelemetry - The industry standard for standardized telemetry data
-
Propagate
trace_idandspan_idautomatically - Use libraries like Spring Cloud Sleuth (Java) or OpenTelemetry auto-instrumentation (multi-language) -
Always include
trace_idin logs - This enables correlation between logs and traces - Use tracing backends - Jaeger, Zipkin, or commercial APM tools (Datadog, New Relic)
What Traces Reveal
- Request path through services
- Timing of each operation
- Dependencies between services
- Bottlenecks and slow operations ### Metrics: What to Collect
Essential Metrics Categories
| Category | Key Metrics |
|---|---|
| Infrastructure | CPU usage, memory, disk I/O, network |
| Application | Request rate, error rate, response time (p50/p95/p99) |
| Business | Active users, conversions, transaction volume |
| Queue/System | Queue depth, cache hit rate, connection pool usage |
RED Method (for services)
- Rate: requests per second
- Errors: error rate (percentage or count)
- Duration: response time distribution [implied from ]
USE Method (for infrastructure)
- Utilization: CPU, memory, disk usage
- Saturation: queue lengths, load
- Errors: hardware/software errors [implied from ] ### Setting Up Meaningful Alerts
Alerting strategy should focus on user impact rather than just technical thresholds.
Alert Best Practices
- Start with clear goals - Define what you want to improve (reduce downtime, improve UX, detect security issues)
- Alert on symptoms, not causes - Alert on "high error rate" not "database CPU at 90%"
- Use SLOs (Service Level Objectives) - Configure alerts based on SLO burn rates
- Avoid alert fatigue - Remove unnecessary alerts, tune thresholds
- Make alerts actionable - Every alert should have a clear next step
Alert Types by Severity
| Severity | When to use | Example |
|---|---|---|
| Critical | User impact, requires immediate action | Error rate > 5%, site down |
| Warning | Degradation, can be addressed soon | Response time p99 > 2s |
| Info | Trend notification, no immediate action | Daily traffic 20% above average |
Building Dashboards That Help Debugging
Dashboard Design Principles
- Build unified dashboards - Bring logs, metrics, and traces together in one view
- Surface what matters - Highlight critical user impact
- Use pre-built + custom dashboards - Start with templates, then customize
- Make it actionable - Include links to traces, logs, and runbooks
Essential Dashboard Sections
- Health overview: Error rate, latency, traffic (Golden Signals)
- Infrastructure: CPU, memory, disk, network
- Dependencies: Database, cache, external APIs
- Business metrics: Active users, conversions
- Recent deploys: Correlation with performance changes ### Building Observability Into Your System From Day One
Step-by-Step Setup Guide
- Define clear goals - What do you want to achieve? (faster incident resolution, better UX)
- Start small - Focus on a critical service before expanding
- Select tools that unify data - Choose tools that bring telemetry together consistently
- Instrument with OpenTelemetry - Standardized data collection across languages
-
Enable structured logging - JSON format with
trace_idfrom day one - Enable distributed tracing - Activate for all services
- Set up dashboards - Use pre-built or custom dashboards for key metrics
- Define alerts and SLOs - Configure alert policies based on user impact
- Integrate with existing tools - Connect to Kubernetes, AWS, Azure, CI/CD
- Train your team - Knowledge sharing on reading telemetry and troubleshooting
Observability-as-Code
Implement observability configuration as code for version control and consistency.
Review and Refine Regularly
No system is static. Regular reviews help you:
- Remove unnecessary alerting
- Update visualizations
- Confirm alignment with goals
Best Practices Checklist
✓ Use OpenTelemetry for standardized data
✓ Enable code profiling for deeper insights
✓ Implement Observability-as-Code
✓ Build unified dashboards with actionable alerts
✓ Attach correlation IDs for distributed tracing
✓ Follow log retention policies
Quick Start Tools Reference
| Category | Open Source | Commercial |
|---|---|---|
| Logging | ELK, EFK, Grafana Loki | Datadog Logs, New Relic |
| Metrics | Prometheus, Grafana | Datadog, New Relic, Prometheus Cloud |
| Tracing | Jaeger, Zipkin | Datadog APM, New Relic APM |
| Unified | Grafana (with plugins) | Datadog, New Relic, Splunk |
Start with one critical service, implement all three pillars, then expand systematically.
Understanding how logs, metrics, and traces complement each other transforms incident response from guesswork into systematic investigation. When used in harmony, they provide a holistic view enabling rapid troubleshooting and proactive problem-solving.
What's your current tech stack (e.g., Kubernetes, AWS, microservices)? I can provide more specific tool recommendations for your setup.
Rizwan Saleem — https://rizwansaleem.co
Top comments (0)