If you’ve ever deployed an application to production, you know the sinking feeling that comes with a PagerDuty alert at 3 AM. It’s not the alert itself that hurts, it’s the uncertainty. What went wrong? Is it the database? The network? A slow API? A memory leak?
Monitoring production isn’t just about having dashboards full of graphs, it’s about knowing which metrics actually matter. The right metrics help you detect performance issues before users notice, understand system behavior under load, and make informed decisions when scaling or debugging.
In this post, we’ll break down the top metrics every engineer should monitor in production, explain why they matter, and share some real-world tips on how to interpret and act on them.
1. Latency – The First Signal of Trouble
Latency is often the canary in the coal mine. Whether it’s a slow database query or an external API dragging you down, latency tells you how responsive your system feels to end users.
What to measure:
- API response time (p50, p95, p99)
- Database query duration
- External service call latency
- Page load times (for front-end monitoring)
Why it matters:
High latency doesn’t always mean failure, but it often means frustration. Users might tolerate a few errors, but they won’t wait forever. A 500ms delay might sound small, but across thousands of requests, it can crush throughput and user experience.
Pro tip:
Always look beyond the average. The tail (p95/p99) latency reveals how bad it gets for your slowest requests. That’s what users remember.
2. Error Rates – The Health Pulse of Your Application
You can have blazing-fast responses, but if half of them are 500 errors, your system isn’t healthy. Error rate monitoring helps you catch exceptions, failed requests, and dependency issues early.
What to measure:
- HTTP 4xx and 5xx responses
- Exception rates (application-level errors)
- Failed background jobs
- Timeout errors from dependencies
Why it matters:
Errors don’t just indicate broken code, they often reveal systemic issues: bad deployments, exhausted DB connections, API limits, or missing environment configs.
Pro tip:
Correlate your error spikes with deployment events or dependency changes. You’ll often find the cause hiding right there.
3. Throughput – How Much Work Is Happening
Throughput tells you how much traffic your system is handling, whether it’s requests per second, messages processed, or jobs completed. It’s the metric that connects performance with business activity.
What to measure:
- Requests per second (RPS) for APIs
- Jobs processed per minute for background workers
- Transactions or sessions per user
Why it matters:
Sudden drops in throughput may indicate issues like bottlenecks, queuing delays, or even upstream outages. On the other hand, unexpected spikes could mean your system is under attack or your marketing team just launched a campaign.
Pro tip:
Always pair throughput with latency and error rate. High throughput and high latency usually mean an overloaded system. Low throughput with high errors means something’s broken.
4. CPU Usage – When the System Starts to Sweat
CPU usage is one of the most basic yet essential system metrics. It helps you understand how efficiently your application code and infrastructure resources are being utilized.
What to measure:
- CPU utilization (%) per container, node, or service
- Context switches and system load averages
- Process-level CPU usage (for your main app process)
Why it matters:
Sustained high CPU can mean inefficient code, tight loops, or runaway processes. Low CPU doesn’t always mean “all good” either, sometimes it signals that your app is idle due to bottlenecks elsewhere, like I/O wait.
Pro tip:
Plot CPU usage against request throughput. If CPU rises faster than throughput, your app might not be scaling linearly.
5. Memory Utilization – Spotting Leaks and Inefficiencies
Memory is another silent killer in production systems. A small leak or unoptimized cache can slowly eat up RAM until your service crashes or the OOM killer strikes.
What to measure:
- Memory usage per process or container
- Heap usage (for managed languages like Java, Node.js, Python)
- Garbage collection frequency and duration
Why it matters:
Monitoring memory helps you catch leaks early and identify inefficient patterns, such as caching too aggressively or holding onto large objects.
Pro tip:
Set alerts for steady upward trends over time rather than static thresholds. Memory leaks often grow slowly, not in bursts.
6. Disk I/O and Storage Utilization
When your system suddenly becomes I/O bound, everything slows down. Disk read/write speeds and available storage directly impact performance, especially for databases and logging-heavy services.
What to measure:
- Disk read/write latency
- Disk queue length
- Storage usage per volume
Why it matters:
Full disks lead to failed writes, corrupted logs, and crashed services. Slow I/O can make even simple queries crawl.
Pro tip:
Monitor free disk space and inode usage (yes, that can fill up too). Rotate logs and prune old data regularly.
7. Network Metrics – The Hidden Layer of Performance
If you’ve ever debugged a “slow system” that turned out to be a DNS issue, you know how crucial network visibility is.
What to measure:
- Network latency (ping times, connection setup times)
- Packet loss and retransmission rates
- Bandwidth usage per node or service
Why it matters:
Network problems often manifest as app-level latency or timeouts. By monitoring network metrics, you can quickly tell whether the issue is inside your app or somewhere upstream.
Pro tip:
Correlate network latency spikes with dependency performance. You might find that your “slow database” is just a network hop away.
8. Database Metrics – Your App’s Backbone
Databases are often the bottleneck in production systems. Even with well-optimized queries, indexing strategies, and connection pooling, they deserve dedicated monitoring.
What to measure:
- Query latency and slow query count
- Connection pool utilization
- Cache hit/miss ratio
- Lock wait times
Why it matters:
A few slow queries can cascade into request timeouts and user frustration. Monitoring helps identify hotspots and scaling needs.
Pro tip:
Enable slow query logs and trace them in your APM tool. You’ll often find 80% of slowness coming from 20% of queries.
9. Application-Specific Business Metrics
Not every important metric is technical. Sometimes the best way to detect issues is by monitoring your business metrics.
What to measure:
- Number of signups, checkouts, or orders
- Failed payments or cart abandonments
- API usage per customer
Why it matters:
A drop in business KPIs often signals underlying technical problems like an endpoint failing silently or a bug in a workflow.
Pro tip:
Tie technical metrics to user outcomes. If latency spikes align with checkout failures, you know where to dig.
10. External Dependencies – The “Unknown Unknowns”
Modern systems rely heavily on third-party APIs, SaaS tools, and microservices. These dependencies are often outside your control, but you still need visibility.
What to measure:
- API response times and error rates from external services
- Dependency uptime (ping checks, synthetic tests)
- Circuit breaker trips (if you use one)
Why it matters:
An outage in a payment gateway or an authentication provider can take your service down just as effectively as your own bugs.
Pro tip:
Set up synthetic monitoring for critical external endpoints. Don’t wait for customers to tell you Stripe is down.
11. Log Volume and Patterns
Logs tell stories when you know what to look for. Monitoring log patterns helps detect new exceptions, spikes in warning messages, or missing events that should have occurred.
What to measure:
- Log event count per service
- Error/warning log ratios
- Missing or unexpected log patterns
Why it matters:
A sudden drop in logs may mean a service isn’t running at all. A flood of error logs might point to an edge case that escaped your tests.
Pro tip:
Feed logs into a centralized aggregator (like ELK, Loki, or Datadog Logs) and tag them by service and environment. That context is gold during debugging.
12. Uptime and Availability
At the end of the day, uptime is the simplest but most important metric. All the fancy metrics don’t matter if your service isn’t reachable.
What to measure:
- Service uptime percentage (SLA tracking)
- Endpoint availability via synthetic checks
- Regional availability (for multi-region setups)
Why it matters:
Every minute of downtime costs revenue and reputation. Monitoring helps ensure you meet SLAs and identify weak spots in your infrastructure.
Pro tip:
Measure uptime externally, not just internally. Sometimes your internal checks look fine while users face DNS or CDN issues.
13. Queue Metrics – Don’t Let the Backlog Grow
If your system relies on background jobs, queues, or messaging systems like Kafka or RabbitMQ, queue health is critical.
What to measure:
- Queue length and processing rate
- Job age and retry count
- Consumer lag (Kafka)
Why it matters:
Growing queues mean your workers aren’t keeping up. It can indicate bottlenecks, worker crashes, or input spikes.
Pro tip:
Watch for trending growth in queue length, not just spikes. A slowly growing backlog is often worse, it signals sustained undercapacity.
14. Deployment and Version Metrics
Ever seen performance tank right after a deploy? Version tracking can save hours of guesswork.
What to measure:
- Version tags per request or container
- Deploy frequency and failure rate
- Rollback count
Why it matters:
By correlating metrics with deployment versions, you can quickly identify regressions or bugs introduced in specific releases.
Pro tip:
Tag metrics and traces with the deployment version. You’ll thank yourself later when debugging.
15. User Experience Metrics
Your backend might look healthy, but what about the actual experience users get?
What to measure:
- Frontend load time (LCP, FID, CLS for web)
- Mobile app response times
- API latency from user regions
Why it matters:
Sometimes the bottleneck is the browser, the CDN, or the client network. These metrics give a user-centric perspective of performance.
Pro tip:
Combine RUM (Real User Monitoring) with APM data. Together, they close the feedback loop between system health and user experience.
Bringing It All Together
Monitoring production isn’t about collecting every possible metric, it’s about choosing the right ones that tell you when something’s off.
A solid approach is to group your metrics around the “Golden Signals” from Google’s SRE handbook:
- Latency
- Traffic
- Errors
- Saturation
If you monitor those across your services, you’ll catch most production issues before your users do.
Then, layer in:
- Business KPIs (to track real impact)
- Dependency health (to catch third-party outages)
- Deployment visibility (to connect changes with performance)
Final Thoughts
Good monitoring is like good storytelling. It helps you see what’s happening behind the scenes. As engineers, our goal isn’t just to put out fires but to build systems that tell us when they’re getting too hot.
Start small, pick a few key metrics from this list, and evolve your dashboards as your system grows. Over time, you’ll develop a sixth sense for production health, backed not by intuition but by solid, observable data.
And when that next PagerDuty alert goes off at 3 AM, you’ll know exactly where to look.
Top comments (0)