Olivia Madison

Posted on Nov 7, 2025

Top Metrics Every Engineer Should Monitor in Production

#metrics #developertips

If you’ve ever deployed an application to production, you know the sinking feeling that comes with a PagerDuty alert at 3 AM. It’s not the alert itself that hurts, it’s the uncertainty. What went wrong? Is it the database? The network? A slow API? A memory leak?

Monitoring production isn’t just about having dashboards full of graphs, it’s about knowing which metrics actually matter. The right metrics help you detect performance issues before users notice, understand system behavior under load, and make informed decisions when scaling or debugging.

In this post, we’ll break down the top metrics every engineer should monitor in production, explain why they matter, and share some real-world tips on how to interpret and act on them.

1. Latency – The First Signal of Trouble

Latency is often the canary in the coal mine. Whether it’s a slow database query or an external API dragging you down, latency tells you how responsive your system feels to end users.

What to measure:

API response time (p50, p95, p99)
Database query duration
External service call latency
Page load times (for front-end monitoring)

Why it matters:

High latency doesn’t always mean failure, but it often means frustration. Users might tolerate a few errors, but they won’t wait forever. A 500ms delay might sound small, but across thousands of requests, it can crush throughput and user experience.

Pro tip:

Always look beyond the average. The tail (p95/p99) latency reveals how bad it gets for your slowest requests. That’s what users remember.

2. Error Rates – The Health Pulse of Your Application

You can have blazing-fast responses, but if half of them are 500 errors, your system isn’t healthy. Error rate monitoring helps you catch exceptions, failed requests, and dependency issues early.

What to measure:

HTTP 4xx and 5xx responses
Exception rates (application-level errors)
Failed background jobs
Timeout errors from dependencies

Why it matters:

Errors don’t just indicate broken code, they often reveal systemic issues: bad deployments, exhausted DB connections, API limits, or missing environment configs.

Pro tip:

Correlate your error spikes with deployment events or dependency changes. You’ll often find the cause hiding right there.

3. Throughput – How Much Work Is Happening

Throughput tells you how much traffic your system is handling, whether it’s requests per second, messages processed, or jobs completed. It’s the metric that connects performance with business activity.

What to measure:

Requests per second (RPS) for APIs
Jobs processed per minute for background workers
Transactions or sessions per user

Why it matters:

Sudden drops in throughput may indicate issues like bottlenecks, queuing delays, or even upstream outages. On the other hand, unexpected spikes could mean your system is under attack or your marketing team just launched a campaign.

Pro tip:

Always pair throughput with latency and error rate. High throughput and high latency usually mean an overloaded system. Low throughput with high errors means something’s broken.

4. CPU Usage – When the System Starts to Sweat

CPU usage is one of the most basic yet essential system metrics. It helps you understand how efficiently your application code and infrastructure resources are being utilized.

What to measure:

CPU utilization (%) per container, node, or service
Context switches and system load averages
Process-level CPU usage (for your main app process)

Why it matters:

Sustained high CPU can mean inefficient code, tight loops, or runaway processes. Low CPU doesn’t always mean “all good” either, sometimes it signals that your app is idle due to bottlenecks elsewhere, like I/O wait.

Pro tip:

Plot CPU usage against request throughput. If CPU rises faster than throughput, your app might not be scaling linearly.

5. Memory Utilization – Spotting Leaks and Inefficiencies

Memory is another silent killer in production systems. A small leak or unoptimized cache can slowly eat up RAM until your service crashes or the OOM killer strikes.

What to measure:

Memory usage per process or container
Heap usage (for managed languages like Java, Node.js, Python)
Garbage collection frequency and duration

Why it matters:

Monitoring memory helps you catch leaks early and identify inefficient patterns, such as caching too aggressively or holding onto large objects.

Pro tip:

Set alerts for steady upward trends over time rather than static thresholds. Memory leaks often grow slowly, not in bursts.

6. Disk I/O and Storage Utilization

When your system suddenly becomes I/O bound, everything slows down. Disk read/write speeds and available storage directly impact performance, especially for databases and logging-heavy services.

What to measure:

Disk read/write latency
Disk queue length
Storage usage per volume

Why it matters:

Full disks lead to failed writes, corrupted logs, and crashed services. Slow I/O can make even simple queries crawl.

Pro tip:

Monitor free disk space and inode usage (yes, that can fill up too). Rotate logs and prune old data regularly.

7. Network Metrics – The Hidden Layer of Performance

If you’ve ever debugged a “slow system” that turned out to be a DNS issue, you know how crucial network visibility is.

What to measure:

Network latency (ping times, connection setup times)
Packet loss and retransmission rates
Bandwidth usage per node or service

Why it matters:

Network problems often manifest as app-level latency or timeouts. By monitoring network metrics, you can quickly tell whether the issue is inside your app or somewhere upstream.

Pro tip:

Correlate network latency spikes with dependency performance. You might find that your “slow database” is just a network hop away.

8. Database Metrics – Your App’s Backbone

Databases are often the bottleneck in production systems. Even with well-optimized queries, indexing strategies, and connection pooling, they deserve dedicated monitoring.

What to measure:

Query latency and slow query count
Connection pool utilization
Cache hit/miss ratio
Lock wait times

Why it matters:

A few slow queries can cascade into request timeouts and user frustration. Monitoring helps identify hotspots and scaling needs.

Pro tip:

Enable slow query logs and trace them in your APM tool. You’ll often find 80% of slowness coming from 20% of queries.

9. Application-Specific Business Metrics

Not every important metric is technical. Sometimes the best way to detect issues is by monitoring your business metrics.

What to measure:

Number of signups, checkouts, or orders
Failed payments or cart abandonments
API usage per customer

Why it matters:

A drop in business KPIs often signals underlying technical problems like an endpoint failing silently or a bug in a workflow.

Pro tip:

Tie technical metrics to user outcomes. If latency spikes align with checkout failures, you know where to dig.

10. External Dependencies – The “Unknown Unknowns”

Modern systems rely heavily on third-party APIs, SaaS tools, and microservices. These dependencies are often outside your control, but you still need visibility.

What to measure:

API response times and error rates from external services
Dependency uptime (ping checks, synthetic tests)
Circuit breaker trips (if you use one)

Why it matters:

An outage in a payment gateway or an authentication provider can take your service down just as effectively as your own bugs.

Pro tip:

Set up synthetic monitoring for critical external endpoints. Don’t wait for customers to tell you Stripe is down.

11. Log Volume and Patterns

Logs tell stories when you know what to look for. Monitoring log patterns helps detect new exceptions, spikes in warning messages, or missing events that should have occurred.

What to measure:

Log event count per service
Error/warning log ratios
Missing or unexpected log patterns

Why it matters:

A sudden drop in logs may mean a service isn’t running at all. A flood of error logs might point to an edge case that escaped your tests.

Pro tip:

Feed logs into a centralized aggregator (like ELK, Loki, or Datadog Logs) and tag them by service and environment. That context is gold during debugging.

12. Uptime and Availability

At the end of the day, uptime is the simplest but most important metric. All the fancy metrics don’t matter if your service isn’t reachable.

What to measure:

Service uptime percentage (SLA tracking)
Endpoint availability via synthetic checks
Regional availability (for multi-region setups)

Why it matters:

Every minute of downtime costs revenue and reputation. Monitoring helps ensure you meet SLAs and identify weak spots in your infrastructure.

Pro tip:

Measure uptime externally, not just internally. Sometimes your internal checks look fine while users face DNS or CDN issues.

13. Queue Metrics – Don’t Let the Backlog Grow

If your system relies on background jobs, queues, or messaging systems like Kafka or RabbitMQ, queue health is critical.

What to measure:

Queue length and processing rate
Job age and retry count
Consumer lag (Kafka)

Why it matters:

Growing queues mean your workers aren’t keeping up. It can indicate bottlenecks, worker crashes, or input spikes.

Pro tip:

Watch for trending growth in queue length, not just spikes. A slowly growing backlog is often worse, it signals sustained undercapacity.

14. Deployment and Version Metrics

Ever seen performance tank right after a deploy? Version tracking can save hours of guesswork.

What to measure:

Version tags per request or container
Deploy frequency and failure rate
Rollback count

Why it matters:

By correlating metrics with deployment versions, you can quickly identify regressions or bugs introduced in specific releases.

Pro tip:

Tag metrics and traces with the deployment version. You’ll thank yourself later when debugging.

15. User Experience Metrics

Your backend might look healthy, but what about the actual experience users get?

What to measure:

Frontend load time (LCP, FID, CLS for web)
Mobile app response times
API latency from user regions

Why it matters:

Sometimes the bottleneck is the browser, the CDN, or the client network. These metrics give a user-centric perspective of performance.

Pro tip:

Combine RUM (Real User Monitoring) with APM data. Together, they close the feedback loop between system health and user experience.

Bringing It All Together

Monitoring production isn’t about collecting every possible metric, it’s about choosing the right ones that tell you when something’s off.

A solid approach is to group your metrics around the “Golden Signals” from Google’s SRE handbook:

Latency
Traffic
Errors
Saturation

If you monitor those across your services, you’ll catch most production issues before your users do.

Then, layer in:

Business KPIs (to track real impact)
Dependency health (to catch third-party outages)
Deployment visibility (to connect changes with performance)

Final Thoughts

Good monitoring is like good storytelling. It helps you see what’s happening behind the scenes. As engineers, our goal isn’t just to put out fires but to build systems that tell us when they’re getting too hot.

Start small, pick a few key metrics from this list, and evolve your dashboards as your system grows. Over time, you’ll develop a sixth sense for production health, backed not by intuition but by solid, observable data.

And when that next PagerDuty alert goes off at 3 AM, you’ll know exactly where to look.

DEV Community

Top Metrics Every Engineer Should Monitor in Production

1. Latency – The First Signal of Trouble

2. Error Rates – The Health Pulse of Your Application

3. Throughput – How Much Work Is Happening

4. CPU Usage – When the System Starts to Sweat

5. Memory Utilization – Spotting Leaks and Inefficiencies

6. Disk I/O and Storage Utilization

7. Network Metrics – The Hidden Layer of Performance

8. Database Metrics – Your App’s Backbone

9. Application-Specific Business Metrics

10. External Dependencies – The “Unknown Unknowns”

11. Log Volume and Patterns

12. Uptime and Availability

13. Queue Metrics – Don’t Let the Backlog Grow

14. Deployment and Version Metrics

15. User Experience Metrics

Bringing It All Together

Final Thoughts

Top comments (0)