DEV Community: Vibhuti Sharma

Monitoring AWS Batch Jobs with CloudWatch Custom Metrics

Vibhuti Sharma — Thu, 26 Mar 2026 07:38:58 +0000

AWS Batch service is used for various compute workloads like data processing pipelines, background jobs and scheduled compute tasks. AWS provides many infrastructure-level metrics for Batch in CloudWatch, however there is a significant gap when it comes to job status monitoring. For example, the number of jobs that are RUNNABLE, RUNNING, FAILED, or SUCCEEDED are not available by default in CloudWatch. These metrics are visible on the AWS Batch dashboard but it does not exist in CloudWatch as a metric. This makes it difficult to answer operational questions such as:

Are jobs accumulating in a RUNNABLE state?
Are the jobs failing frequently?
Is the system keeping up with workload?

Without these metrics, building meaningful dashboards or alerts for Batch workloads becomes challenging.

In this blog post, we can understand how to close this observability gap by exporting custom AWS Batch job status metrics into CloudWatch, which can then be consumed by any third party observability tool. The blog post will walk you through a custom setup for exporting these metrics using EventBridge, Lambda Function, Batch API and Cloudwatch custom metrics.

Which AWS batch metrics does CloudWatch publish by default?

AWS Batch publishes a limited set of infrastructure metrics to CloudWatch under the ECS/Container Insights namespace. These metrics primarily describe compute environment capacity and resource utilization, rather than the status of jobs.

Examples of metrics available by default include:

StorageReadBytes: number of bytes read from storage on the instance
NetworkTxBytes: number of bytes transmitted by the resource
CpuReserved: CPU units reserved by tasks in the resource
StorageWriteBytes: number of bytes written to storage in the resource
EphemeralStorageReserved: number of bytes reserved from ephemeral storage in the resource
TaskCount: number of tasks running in the cluster
MemoryReserved: memory that is reserved by tasks in the resource
EphemeralStorageUtilized: number of bytes used from ephemeral storage in the resource
NetworkRxBytes: number of bytes received by the resource
CpuUtilized: CPU units used by tasks in the resource
ServiceCount: number of services in the cluster
ContainerInstanceCount: number of EC2 instances running the Amazon ECS agent that are registered with a cluster
MemoryUtilized: memory being used by tasks in the resource

However, job status metrics are not published by default. These missing metrics include:

Number of RUNNABLE jobs
Number of RUNNING jobs
Number of FAILED jobs
Number of SUCCEEDED jobs
Number of SUBMITTED jobs These metrics are critical for monitoring Batch workloads because they indicate system health and throughput.

For example:

A growing RUNNABLE job count may indicate insufficient compute capacity.
A spike in FAILED jobs may indicate application or infrastructure issues. To obtain these metrics, we need to query the AWS Batch API and publish the results ourselves.

How to export AWS batch job status metrics to CloudWatch

In this solution, we periodically queries AWS Batch API for job states and publishes the status and job counts as custom CloudWatch metrics.

Components used
The setup consists of four components:

EventBridge Rule
Runs on a schedule (for example every 5 minutes)
Triggers a Lambda function
Lambda Function
Calls the AWS Batch API
Retrieves job counts by status
Aggregates the results
Publishes metrics to CloudWatch
AWS Batch API
Provides job information through API calls such as jobSummaryList
CloudWatch Custom Metrics
Stores the exported job status metrics
Exposes them for dashboards and alert

Workflow
The process works as follows:

EventBridge triggers the Lambda function on a schedule.
Lambda queries AWS Batch for job counts in each status.
The job counts are aggregated.
Lambda publishes these counts as custom metrics to CloudWatch.
Grafana or another observability tool reads these metrics from CloudWatch. This architecture is serverless, inexpensive, and easy to extend, and does not require changes to existing Batch workloads.

Cost considerations

The cost of this setup is minimal because it relies entirely on serverless services and lightweight API calls.

EventBridge: EventBridge scheduled rules cost a fraction of a cent per million invocations. With a schedule of every 5 minutes, the cost is negligible.
Lambda: The Lambda function only performs a small number of API calls and executes for a short duration. In most cases, this will remain well within the Lambda free tier.
CloudWatch Custom Metrics: CloudWatch custom metrics are the primary cost factor. CloudWatch charges per metric per month. However, since the setup only publishes a small number of metrics (typically 4–6), the total cost remains low. For example, publishing metrics for:

RUNNABLE
RUNNING
FAILED
SUCCEEDED
SUBMITTED

Results in only a few custom metrics. Overall, the monthly cost of this setup is typically very small compared to the operational visibility it provides.

Implementation

The implementation consists of four main steps:

Creating the Lambda function
Querying the Batch API
Publishing metrics to CloudWatch
Scheduling the function using EventBridge

Lambda function logic

The Lambda function performs the following actions:

Retrieves job queues
Queries jobs by status
Counts jobs in each state
Publishes the counts to CloudWatch
Example Python implementation:

import os
import boto3
import json
from datetime import datetime
import logging

batch_client = boto3.client("batch")
cloudwatch = boto3.client("cloudwatch")

def batch_metrics_exporter(event, context):
    try:
        job_queues = json.loads(os.environ["JOB_QUEUES"])
        compute_env_mapping = json.loads(os.environ["COMPUTE_ENV_MAPPING"])

        all_metrics = []

        for queue_name in job_queues:
            job_counts = get_job_counts_by_status(batch_client, queue_name)
            compute_env = compute_env_mapping.get(queue_name, 'unknown')
            timestamp = datetime.utcnow()

            for status, count in job_counts.items():
                all_metrics.append({
                    'MetricName': f'{status}Jobs',
                    'Dimensions': [
                        {'Name': 'JobQueue', 'Value': queue_name},
                        {'Name': 'ComputeEnvironment', 'Value': compute_env}
                    ],
                    'Value': count,
                    'Unit': 'Count',
                    'Timestamp': timestamp
                })

        batch_size = 20
        for i in range(0, len(all_metrics), batch_size):
            cloudwatch.put_metric_data(
                Namespace='<YOUR_NAMESPACE>',
                MetricData=all_metrics[i:i + batch_size]
            )

        return {
            'statusCode': 200,
            'body': json.dumps({
                'message': f"Successfully published {len(all_metrics)} metrics",
                'queues_processed': len(job_queues)
            })
        }

    except Exception as e:
        logger.error(f"Error in lambda_handler: {str(e)}")
        return {"statusCode": 500, "body": json.dumps({"error": str(e)})}

def get_job_counts_by_status(batch_client, job_queue_name):
    job_statuses = ['SUBMITTED', 'PENDING', 'RUNNABLE', 'STARTING', 'RUNNING', 'SUCCEEDED', 'FAILED']
    job_counts = {'Submitted': 0, 'Runnable': 0, 'Running': 0, 'Failed': 0, 'Succeeded': 0}

    try:
        for status in job_statuses:
            response = batch_client.list_jobs(jobQueue=job_queue_name, jobStatus=status)
            count = len(response.get('jobSummaryList', []))

            if status in ['SUBMITTED', 'PENDING']:
                job_counts['Submitted'] += count
            elif status in ['RUNNABLE', 'STARTING']:
                job_counts['Runnable'] += count
            elif status == 'RUNNING':
                job_counts['Running'] += count
            elif status == 'FAILED':
                job_counts['Failed'] += count
            elif status == 'SUCCEEDED':
                job_counts['Succeeded'] += count

    except Exception as e:
        logger.error(f"Error getting job counts for queue {job_queue_name}: {str(e)}")

    return job_counts

IAM permissions required for the Lambda function

The Lambda function requires permissions for Batch APIs and CloudWatch metrics.

Minimal IAM policy example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": ["batch:DescribeJobQueues", "batch:ListJobs"],
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Action": "cloudwatch:PutMetricData",
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}

Setting up the EventBridge schedule
We can create an EventBridge rule with a schedule expression. The rate can be as per the requirement.

Example: rate(5 minutes)

Attach the Lambda function as the target. This ensures the metrics are refreshed periodically.

How to verify AWS Batch metrics in CloudWatch

Once the Lambda function begins publishing metrics, they will appear in CloudWatch under the custom namespace used in the implementation.

Navigate to: CloudWatch → Metrics →

You should see metrics such as:

JobCount (Status=RUNNABLE)
JobCount (Status=RUNNING)
JobCount (Status=FAILED)
JobCount (Status=SUCCEEDED)

Each metric will also include dimensions such as the job queue name, allowing filtering per queue. These metrics can now be queried, visualized, or used for alerting.

Limitations

One of the limitations of the Batch API is that it returns point-in-time snapshots rather than time series data.

This means the metrics represent the number of jobs in each state at the moment the Lambda function runs, rather than a continuous stream of job events.

However, this limitation can be addressed using PromQL queries in observability systems such as Prometheus or Grafana.

For example:

Deriving job failure rates
Calculating trends in runnable job backlog
Detecting abnormal changes in job states

Another limitation is data delay, which depends on the EventBridge schedule. If the rule runs every 5 minutes, the metrics will have up to a five minute delay.

Reducing the schedule interval improves freshness but increases API usage.

When should you use this setup?

This approach is most useful when Batch workloads involve long-running or heavy compute jobs. In such environments, understanding job queue health is important for operational stability.

Examples include:

Data processing pipelines
Machine learning workloads
ETL systems
Background compute services

However, the setup may be less useful for very short-lived jobs that start and complete within seconds. In those cases, the scheduled polling approach may miss transient states.

Therefore, this solution is most effective when jobs run long enough to be captured within the scheduled polling interval.

Conclusion

AWS Batch provides strong compute orchestration capabilities but lacks native job-level observability metrics in CloudWatch. By combining EventBridge, Lambda, the Batch API, and CloudWatch custom metrics, it is possible to export job status metrics and integrate them into existing observability dashboards.

This setup provides visibility into queue backlog, job failures, and system throughput, enabling better operational monitoring of Batch workloads. In practice, this solution has proven useful for tracking job health and building meaningful dashboards around Batch-based workloads. With minimal infrastructure and low cost, it significantly improves observability for production Batch environments.

Originally published at https://www.cloudraft.io on March 24, 2026.

Implementing Compliance-first Observability with OpenTelemetry

Vibhuti Sharma — Fri, 20 Jun 2025 12:52:57 +0000

Observability isn’t optional and neither is Compliance

While talking about observability, there is something that often gets missed in conversations: it's compliance. We all know observability is essential. When you’re running any kind of modern application or infrastructure, having good visibility through logs, metrics, and traces is not just helpful, it’s how you keep systems stable, catch issues early, and move with confidence.

But while we’ve gotten better at collecting and analyzing the data, we haven’t always paid enough attention to what that data contains or where it ends up. These logs and data can easily include sensitive information. Things like user details, access tokens, or internal system behavior often get logged without much thought. And if that data is exposed or mishandled, it turns into a serious risk both legally and operationally.

In this blog post, I’ll walk you through how to build observability pipelines that are not only functional but also secure, compliant, and built with intention. We’ll look at how OpenTelemetry can help with that, and why its processors are one of the most effective ways to protect and control the flow of telemetry data.

What It Costs When Compliance Fails

We often think of compliance as just legal formalities or contracts. But when compliance fails, the consequences are real and can impact a business way more. Even a small oversight, like an email address showing up in a debug log or a trace, containing sensitive user data being sent to an external service without proper filtering, can become a much larger issue. These incidents do not just violate policies, they also break customer's trust, trigger audits, and can lead to financial and legal damage.

According to the IBM Cost of a Data Breach Report, the global average cost of a breach in 2024 was over 4.9 million dollars. In regulated industries such as healthcare, finance, and insurance, that number tends to be even higher. And the cost isn’t just about regulatory fines. A significant portion comes from the loss of business, system downtime, incident response, and long-term brand reputation issues. Even if your organization is not governed by strict regulations like GDPR, HIPAA, or PCI-DSS, your users still expect their data to be treated with care. Once trust is lost, it’s incredibly difficult to win back.

That is why compliance can’t be treated as something you can add later. It has to be built into the foundation, and that includes how we handle observability data. When telemetry pipelines are left unguarded, they can quietly become one of the biggest liabilities in your stack.

OpenTelemetry and Its Role in Data Protection

OpenTelemetry has quickly become the standard framework for collecting telemetry data across modern, distributed systems. It provides a consolidated approach to gathering logs, metrics, and traces and offers the flexibility to send that data to a variety of destinations, from observability platforms to self-hosted backends or data lakes.

While OpenTelemetry is excellent at solving how data is collected and transported, it places the responsibility for what data is captured and how securely it is handled entirely on the user. Its flexibility is a strength, but also a risk. OpenTelemetry will not automatically prevent sensitive data from flowing through your pipelines.

What Kind of Telemetry Data Needs Protection?

Before going into the protection strategies, let’s first identify the types of data that could pose compliance risks. Common examples include:

Personally Identifiable Information (PII): emails, phone numbers, user IDs
Sensitive system metadata: IP addresses, internal hostnames
Confidential business context: debug logs, internal environment tags
Regulatory-bound attributes: region identifiers (e.g., for GDPR)

If this data makes its way into your telemetry stream, it will continue through your system unless you explicitly configure rules to stop or modify it.

This is where the OpenTelemetry Collector becomes critical. Acting as the central hub between data sources and their destinations, the Collector offers a place where telemetry data can be inspected, filtered, transformed, or enriched before it moves any further. It is here that organizations gain control over what data stays, what data is modified, and what data never leaves the boundary at all.

With the right configurations, the Collector becomes more than just a routing tool. It becomes a guardrail for enforcing data protection standards, filtering out sensitive information, and helping ensure compliance with security and privacy requirements. OpenTelemetry, when used thoughtfully, is not just an observability solution. It is a foundational piece of your data protection strategy.

Solving Real Compliance Challenges with OpenTelemetry Processors

Processors are one of the most important functions within the OpenTelemetry Collector when it comes to enforcing data protection and compliance. Positioned between data collection and export, they serve as the transformation and control layer where critical compliance logic can be applied before telemetry leaves your environment.

Learn how to design observability pipelines that enforce data protection, support compliance regulations, and ensure secure telemetry using OpenTelemetryThe strength of processors lies in their flexibility. They allow you to redact, suppress, enrich, or reshape telemetry based on your organization's security and privacy requirements. This feature is essential when dealing with sensitive or regulated data flowing through your observability systems.

Here are some of the practical ways processors help us address real-world compliance concerns:

Redacting Sensitive Information
Logs and traces often contain personal or confidential data such as email addresses, user IDs, or access tokens. Processors like attributesprocessor or transformprocessor can be configured to remove or make these values anonymous automatically, helping prevent unintentional exposure.

Filtering Non-Compliant Data
Telemetry that includes content that violates policies can be filtered out entirely before reaching any downstream systems. This helps reduce risk and ensures that observability does not become a liability.

Enforcing Data Residency and Routing Rules
For organizations subject to regional data protection laws, processors can route or drop telemetry based on attributes such as geography or service type. This ensures that data remains within defined boundaries and complies with jurisdictional requirements.

Normalizing and Structuring Telemetry for Audits
Compliance often requires structured, consistent data. Processors can standardize field names, values, and formats so that logs, metrics, and traces align with internal audit and reporting standards.

Reducing Noise to Highlight What Matters
Not all telemetry is useful, and excessive data can obscure important signals. Processors help reduce noise by removing redundant spans or unnecessary attributes, making it easier to focus on meaningful insights while keeping compliance in check.

By configuring processors with intent to be compliant, organizations can ensure that observability pipelines are secure, responsible, and aligned with compliance goals. This control layer not only supports regulatory requirements but also promotes better data and operational clarity. When designed properly, processors become more than just a technical feature, they can represent a proactive step toward secure and compliant observability.

Practical Examples of Compliance-first Telemetry Pipelines

After exploring the role of processors in enforcing compliance, let’s look at how to bring it all together in real-world telemetry pipelines. Building compliance-first observability is not just about theory; it is about designing workflows that consistently protect data across environments.

List of Processors in OpenTelemetry Collector

The OpenTelemetry Collector provides several processors out of the box. The most commonly used ones for compliance are:

attributesprocessor – Add, remove, update, or redact specific attributes
filterprocessor – Filter spans or logs based on matching criteria
routingprocessor – Route telemetry conditionally based on resource or attribute values
transformprocessor – Use expressions to rename, update, or drop fields

How to Choose the Right Processor

We can use the specific processors for doing the job based on various use cases, like:

Use Case	Processor
Remove or redact sensitive fields	attributesprocessor
Drop unnecessary or risky logs/spans	filterprocessor
Route telemetry based on geography or team	routingprocessor
Normalize field names for audit compliance	transformprocessor

Example 1: Redacting PII in application traces

In many applications, traces can unintentionally carry personally identifiable information like user email addresses or phone numbers. To address this, you can build a pipeline that begins with the otlp receiver, processes the trace data through an attributesprocessor configured to detect and redact sensitive fields such as user.email or user.phone, and finally exports it to a tracing backend like Jaeger or another OTLP-compatible service.
Example Config:

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  attributes/pii_redaction:
    actions:
      - key: user.email
        action: delete
      - key: user.phone
        action: delete

exporters:
  jaeger:
    endpoint: 'http://jaeger-collector:14250'
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [attributes/pii_redaction]
      exporters: [jaeger]

Outcome:
This pipeline deletes any attribute named user.email or user.phone before data is exported to Jaeger, ensuring no PII leaves the pipeline. With this setup, you preserve the diagnostic value of the trace without risking exposure of personal data. This approach helps maintain user privacy and stay aligned with data protection policies.

Example 2: Filtering internal debug logs in production

Developers often include verbose debug logs during development, but these logs are rarely suitable for production. In a compliance-first pipeline, you can start with a filelog or fluentforward receiver and pass the logs through a filterprocessor that drops entries where severity is set to "DEBUG" or the environment tag indicates it's development-only. The cleaned logs are then sent to a system like Google Cloud Logging or Datadog.

Example Config:

receivers:
  filelog:
    include: ['/var/log/app/*.log']
processors:
  filter/drop_debug_logs:
    logs:
      log_record:
        - severity_text: 'DEBUG'
exporters:
  googlecloud:
    project: my-production-project
service:
  pipelines:
    logs:
      receivers: [filelog]
      processors: [filter/drop_debug_logs]
      exporters: [googlecloud]

Outcome:

This ensures that only production-relevant and compliant log data is exported, reducing both operational risk and unnecessary storage or processing costs.

Example 3: Ensuring data residency compliance

Let’s say your organization collects telemetry from EU-based services and must comply with regional data residency laws. The pipeline begins with the otlp receiver and uses a routingprocessor to inspect attributes like region = eu-west1. Based on this, telemetry is selectively routed to an EU-based backend only.

Example Config:

receivers:
  otlp:
    protocols:
      grpc:
      http:
processors:
  routing/data_residency:
    table:
      - value: eu-west1
        exporters: [eu_backend]
        statement: match(resource.attributes["region"], "eu-west1")
      - value: default
        exporters: [non_eu_backend]
exporters:
  eu_backend:
    endpoint: 'eu-collector.mycompany.com'
  non_eu_backend:
    endpoint: 'us-collector.mycompany.com'
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [routing/data_residency]
      exporters: []

Outcome:

EU data is routed exclusively to EU-compliant systems, supporting regional legal obligations. This architecture ensures that regulated data never leaves its permitted geographic boundary, keeping your observability setup aligned with legal and contractual obligations.

Example 4: Standardizing attributes for compliance audits

In regulated industries, audit requirements often demand consistent telemetry formats. A compliance-aligned pipeline might start with receivers like prometheus, otlp, or filelog, and pass the data through a transformprocessor that renames fields. For instance, user_id becomes user.id, and txn_amount becomes transaction.amount. The processed data is then exported to a SIEM system or centralized log storage for long-term analysis. This kind of field normalization supports auditability and ensures that all downstream systems operate with uniform telemetry schemas, improving clarity and compliance readiness.

Example Config:

receivers:
  otlp:
    protocols:
      grpc:
      http:
processors:
  transform/standardize_fields:
    trace_statements:
      - context: span
        statements:
          - rename(attributes["user_id"], "user.id")
          - rename(attributes["txn_amount"], "transaction.amount")
exporters:
  logging:
    loglevel: debug
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [transform/standardize_fields]
      exporters: [logging]

Outcome:
With consistent attribute names, you improve audit readiness and make logs easier to correlate.

These examples show how easy it is to tailor your observability pipeline for compliance without sacrificing performance or visibility. By using the Collector as a policy engine, you ensure that compliance checks are built into your telemetry flow.

Designing a Secure and Compliant Observability Architecture

By now, it’s clear that securing telemetry data is not just about selecting the right tools. It involves designing the entire observability architecture with compliance in mind from the very beginning.

To do this effectively, observability should be treated as a data supply chain. Each stage, starting from ingestion to processing to export, must actively enforce protections, not just transfer data passively.

Centralize Control

The OpenTelemetry Collector sits at the center of a secure observability setup. It serves as the control point for managing ingestion, sanitation, transformation, routing, and export. This enables consistent enforcement of policies, regardless of where the data originates. If you need to redact PII before logs leave a Kubernetes cluster, route metrics from the EU to region-specific storage, or standardize trace data for audit readiness, the Collector is where those rules are applied.

As observability grows, managing multiple Collector instances across environments can become complex. To help us manage this situation we can implement the Open Agent Management Protocol (OpAMP). OpAMP provides a standardized way to remotely manage OpenTelemetry Collectors at scale. It enables you to push configuration updates, monitor agent health, and enforce policy changes without logging into each node manually. It’s an essential addition for teams aiming to maintain observability governance while reducing operational overhead.

Keep Processing and Export Logic Outside Application Code

A frequent mistake is embedding telemetry logic directly within application code. This introduces risk, increases complexity, and makes enforcement inconsistent across services. A more secure approach moves that logic into centrally managed Collector configurations. This allows teams to update rules without deploying new code and gives compliance teams the ability to audit pipelines independently.

Encrypt Telemetry in Transit and at Rest

All telemetry data, including logs, metrics, and traces, should be encrypted while in transit and when stored. Use TLS to secure communication between agents and Collectors, and ensure encryption at rest is enabled in your observability backends such as OpenSearch, Datadog, or GCP.

Avoid Overcollection and Excessive Retention

Collecting or retaining more data than necessary increases your risk exposure. Implement filtering at the source and within the Collector to discard irrelevant data. Align retention policies with legal and compliance requirements to ensure that sensitive data is not kept longer than necessary.

Enforce Separation of Duties

Not every team member needs access to all telemetry. Design the system to enforce access controls, both through infrastructure-level mechanisms like IAM or RBAC and within observability platforms using scoped dashboards or tenant-aware indexing. This limits access, reduces internal risk, and simplifies compliance audits.

Additional Layers of Data Protection Beyond Processors

While OpenTelemetry processors play a critical role in securing and shaping telemetry data, they should be part of a broader data protection strategy. Ensuring compliance requires a layered approach that includes infrastructure-level security, backend configurations, and organizational access controls.

Below are key layers that complement the processor-level protections:

1. End-to-End Encryption

Encryption must be enforced at every stage of telemetry flow. Use TLS for all communication between agents, collectors, and backend systems. Whether data is being transmitted over gRPC or HTTP, encrypted channels prevent interception and unauthorized access during transit.

2. Secure and Compliant Backends

After data is processed, it is stored or analyzed in backends such as OpenSearch, Google Cloud Logging, or Datadog. These systems must be configured to encrypt data at rest and enforce strict access controls. Ensure that backend permissions align with your organization's compliance policies.

3. Role-Based Access Control (RBAC) and Principle of Least Privilege

Limit access to telemetry data and configuration files using IAM or RBAC mechanisms. Each user or team should have access only to the data necessary for their responsibilities. This reduces the risk of accidental exposure and simplifies audit processes.

4. Protected Configuration Management

Treat OpenTelemetry configuration files as sensitive assets. Store them in secure, version-controlled repositories with restricted access. Use secrets management tools like HashiCorp Vault or GCP Secret Manager to inject credentials and tokens securely, instead of embedding them in plaintext within configuration files.

5. Routine Compliance Reviews and Audits

Security and compliance are ongoing responsibilities. Schedule periodic reviews of telemetry pipelines, access controls, and retention policies. Auditing configurations and data flows regularly helps identify outdated settings, over-permissive access, or unintentional data leakage.

6. Data Minimization Principles

Collect only what is necessary. Overcollection not only adds noise but also increases the surface area for compliance risk. Apply filters early in the pipeline, remove legacy or redundant telemetry sources, and periodically reassess what is being collected across environments.

Build Trust Into Your Observability Stack

Observability has come a long way, and today, building trust into it begins with intentional design. From deciding what to collect to how data is handled, OpenTelemetry offers the flexibility and control needed to embed security and compliance into every stage of the pipeline. By shaping telemetry as it flows, you enable teams to maintain visibility while reducing risk. I hope this article provides you with practical guidance to create observability pipelines that are not just effective, but also secure and compliant by design.

Top Metrics To Watch In Kubernetes

Vibhuti Sharma — Tue, 13 May 2025 00:00:58 +0000

Introduction

If you’ve ever found yourself knee-deep in a Kubernetes incident, watching a production microservice fail with mysterious 5xx errors, you know the drill: alerts are firing, dashboards are lit up like a Christmas tree, and your team is scrambling to make sense of a flood of metrics across every layer of the stack. It’s not a question of if this happens-it’s when.

In that high-pressure moment, the true challenge isn’t just debugging-it’s knowing where to look. For seasoned SREs and technical founders who live and breathe Kubernetes, the ability to quickly zero in on the right signals can make the difference between a five-minute fix and a five-hour outage.

So what are the metrics that actually move the needle? And how do you filter signal from noise when your platform is under fire?

This article breaks down the critical Kubernetes metrics that every high-performing team should keep an eye on-before the next incident catches you off guard.

If you don’t have a monitoring system in place, you’re already behind the curve. Kubernetes is a complex system with many moving parts, and without proper monitoring, you’re flying blind.

Why Every Minute Counts in Kubernetes Outages

When Kubernetes systems break, the impact isn’t just technical but also it’s financial, contractual, and reputational.

Real Cost of Downtime

According to Gartner , the average cost of IT downtime is $5,600 per minute, which adds up to over $330,000 per hour. We do not want to imagine this happening during peak traffic, a product launch, or a high-stakes client demo. The longer you spend guessing which part of the system failed, the more your business takes the hit. Often, it’s not even clear whether the issue lies in the network, storage, or application layer, leading to costly delays in diagnosis and resolution.

Tight SLAs & Tighter Repercussions

For the teams managing Kubernetes clusters on behalf of clients, Service Level Agreements (SLAs) can feel like a sword overhead. These agreements set strict limits on factors like downtime and error rates, where breaching them doesn’t just mean a few angry emails. It can lead to financial penalties, escalations, or even losing the client altogether. Without knowing which metrics reflect health and which signal red flags, they are always one step away from trouble.

Mean Time to Recovery

The Mean Time to Recovery (MTTR) is a critical KPI for SRE and DevOps teams. It reflects how long it takes to detect, troubleshoot, and restore service after a failure. A low MTTR means your systems are resilient and your team is effective. But reducing MTTR is only possible if you’re looking at the right data when the incident hits, and that’s where the top Kubernetes metrics come in.

That is exactly what this blog is here for. We will walk you through the most critical Kubernetes metrics to monitor, the ones that give you real insight into the health of your system, help reduce downtime, and improve your response during incidents. Whether you’re running a small dev cluster or a complex multi-tenant setup, this guide will help you prioritize the right signals.

Significance of The Four Golden Signals

If you have spent any time in the world of monitoring or Site Reliability Engineering, you have probably come across the Four Golden Signals: Latency, Traffic, Errors, and Saturation. Originally popularized in the Google SRE book- Comprehensive guide to site reliability , these signals remain the gold standard when it comes to what you should measure to understand your system’s health.

Even in Kubernetes environments where complexity multiplies with microservices, dynamic scaling, and distributed components, The Four Golden Signalshelp aim at the right target. They have a bold tie to our topic here. Hence, understanding them can help yield better understanding and results.

Latency helps you detect slowdowns even before users start complaining about them. Metrics like API server latency or HTTP request durations show where bottlenecks are live.
Traffic metrics (like request rate, network throughput) help you understand demand and stress levels across your system.
Errors surface failing pods, HTTP 5xx rates, and crash loops are your early warning signs.
Saturation tells you when you’re about to hit resource limits, whether it’s CPU, memory, or disk I/O on nodes and pods.

In a distributed system like Kubernetes, problems rarely announce themselves clearly. Golden Signals offer a language to interpret cluttered data, spot anomalies, and prioritize what truly needs fixing. Knowing how your app performs against these four dimensions makes your metrics strategy more focused, your alerts more meaningful, and your team more responsive.

The RED & USE Methods

Just like the Four Golden Signals, there are two other powerful frameworks that help teams make sense of their monitoring data, RED and USE. These methods offer a structured way to prioritize what to measure and where to look during troubleshooting. While Golden Signals give you a high-level overview of system health, RED and USE help you go deeper with intent, depending on whether you’re debugging an application-level issue or digging into infrastructure problems.

RED Method For Applications & Services

The RED method focuses on user-facing services and microservices, and is all about how your application is performing from a user’s perspective.It tracks three critical signals:

Requests per second (traffic)
Errors per second (failures)
Duration of requests (latency)

Think of RED as your first defense against a bad user experience. It closely aligns with the Four Golden Signals and is commonly visualized using pre-built RED dashboards in tools like Prometheus and Grafana. For a deeper dive, check out the official blog on The RED Method by Tom Wilkie

USE Method For Infrastructure & System Health

The USE method is shaped to lower-level system resources such as nodes, disks, and network interfaces. It tracks:

Utilization — How much of a resource is being used?
Saturation — Is the resource at or near capacity?
Errors — Are there any failures in the resource?

This is especially useful when you’re debugging performance bottlenecks or checking node health in Kubernetes. For example, using the USE method, you might quickly spot a disk I/O bottleneck or excessive memory pressure on a node.

They complement each other and help you design focused dashboards, meaningful alerts, and faster incident response workflows. For a deeper dive, check out the official blog on The USE Method by Brendan Gregg .

The Layers of Kubernetes Monitoring

Before we go deeper into metrics, it is important to understand where they come from. Kubernetes is a layered system, and each layer gives its own signals. If you want complete observability, you need to collect metrics from every layer.

Cluster Layer

This is the big-picture view. At this level, you track overall cluster health, how many nodes are active, how many are unschedulable, how many pods are in a crash loop, or if your autoscaler is working as expected. Metrics from the Kube Controller Manager, Cloud Controller, and Cluster Autoscaler belong here.

Control Plane

This is the brain of a cluster. Components like the API server, scheduler, and ETCD are responsible for making everything work. Metrics from this layer help you answer questions like “Is the scheduler under pressure?”, “Is the ETCD server healthy and responding on time?”, “Are API requests getting throttled?”.

Nodes

These are the worker machines (virtual or physical) that run your workloads. Key node-level metrics include CPU, memory, disk I/O, and network throughput. If nodes are overloaded, your pods will suffer even if your app code is flawless.

Pods & Containers

This is the execution layer. Monitoring pod status, container restarts, resource requests/limits, and OOM (Out of Memory) kills can quickly tell you if your workloads are running as expected or if they’re crashing silently in the background.

Applications

Finally, we reach the business logic that is the code you deploy. Application-level metrics include request latency, error rates, throughput, and custom business KPIs. These metrics help tie technical issues to user-facing problems, which is especially important when debugging customer-impacting incidents.

Kubernetes Metrics That Matter The Most

Once you understand the layers of observability in Kubernetes, the next step is knowing what to look at in each layer. Not all metrics are created equal; some help you react quickly, while others help you prevent issues entirely. Here are the top metrics across each layer that monitoring teams should track.

Cluster-Level Metrics

Let’s say your Kubernetes cluster is experiencing performance issues; maybe workloads are failing to schedule, pods are restarting, or users are complaining about latency. Instead of jumping into individual pod metrics, let’s start from the top. Here’s a practical flow to investigate issues at the cluster level and narrow down potential root causes.

Confirm the Symptoms at Scale

Start with basic observations. Are these problems isolated or affecting the entire cluster?

Check the number of unschedulable pods:

kubectl get pods --all-namespaces --field-selector=status.phase=Pending

Output:

NAMESPACE     NAME                                READY   STATUS    RESTARTS   AGE
default       myapp-frontend-7d8f9c6d8b-abcde     0/1     Pending   0          2m
kube-system   coredns-558bd4d5db-xyz12            0/1     Pending   0          1m

Pods are presented to be in a Pending state. To determine the root cause for it do further investigation. Look for frequent restarts:

kubectl get pods --all-namespaces | grep -v '0/' | grep 'CrashLoopBackOff\|Error'

Output:

NAMESPACE   NAME                             READY   STATUS             RESTARTS   AGE
default     api-server-5d9f8f6d8b-xyz12      0/1     CrashLoopBackOff   5          10m
default     db-service-7d8f9c6d8b-abcde      0/1     Error              3          8m

Are nodes under pressure?

kubectl top nodes

Output:

NAME     CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
node-1   1800m        75%    7000Mi          85%
node-2   1500m        80%    6500Mi          90%

We can see the node is under pressure. If multiple namespaces or workloads are impacted, it’s likely a cluster-level issue, not just an app problem.

Assess Node Health and Availability

Node problems ripple across the entire cluster. Let’s check how many are healthy:

kubectl get nodes

Output:

NAME     STATUS     ROLES   AGE   VERSION
node-1   Ready      worker  10d   v1.25.0
node-2   NotReady   worker  10d   v1.25.0

Watch out for nodes in a NotReady or Unknown state, these can cause workload evictions, failed scheduling, and data plane failures. If some nodes are out, look at recent cluster events:

kubectl get events --sort-by=.lastTimestamp

Output:

LAST SEEN   TYPE      REASON              OBJECT                     MESSAGE
2m          Warning   NodeNotReady        node/node-2                Node is not ready
1m          Warning   FailedScheduling    pod/myapp-frontend-xyz12   0/2 nodes are available: 1 node(s) were not ready

Pay attention to messages like:

NodeNotReady
FailedScheduling
ContainerGCFailed

Detect Resource Bottlenecks

Even if nodes are “Ready,” they might not have capacity. Check CPU and memory pressure:

kubectl describe nodes | grep -A5 "Conditions:"

Output:

Conditions:
  Type             Status   LastHeartbeatTime       Reason
  MemoryPressure   True     2025-05-06T16:00:00Z     KubeletHasInsufficientMemory
  DiskPressure     False    2025-05-06T16:00:00Z     KubeletHasSufficientDisk
  PIDPressure      False    2025-05-06T16:00:00Z     KubeletHasSufficientPID

Look for MemoryPressure, DiskPressure, or PIDPressure.

Check what resources the scheduler sees as allocatable:

kubectl describe node node-name | grep -A10 "Allocatable"

Output:

Allocatable:
  cpu: 2000m
  memory: 8192Mi
  pods: 110

If everything looks maxed out, your cluster may be underprovisioned, then it’s time to scale nodes or clean up unused resources.

Investigate Networking or DNS Issues

Another issue likely faced is latency complaints or failing pod readiness probes often come down to network problems.

Use Prometheus Dashboards to find out:

rate(container_network_receive_errors_total[5m])

Check for CoreDNS issues:

kubectl logs -n kube-system -l k8s-app=kube-dns

Output:

.:53
2025/05/06 16:05:00 [INFO] CoreDNS-1.8.0
2025/05/06 16:05:00 [INFO] plugin/reload: Running configuration MD5 = 1a2b3c4d5e6f
2025/05/06 16:05:00 [ERROR] plugin/errors: 2 123.45.67.89:12345 - 0000 /etc/resolv.conf: dial tcp: lookup example.com on 10.96.0.10:53: server misbehaving

Spot dropped packets or erratic latencies in inter-pod communication.

Connect the Dots

Now correlate your findings. Ask:

Are failing pods being scheduled on overloaded or failing nodes?
Are pods restarting due to OOMKills or image pull issues?
Do networking or DNS failures match the timing of user complaints?

By this point, a pattern should emerge. And you should be able to rule out the cause of the issue to be one of these reasons. By the example outputs from above we can rule out this to likely be a cluster-level problem caused by an over-utilized or partially unavailable node.

Control Plane Metrics

Let’s say you’ve ruled out node failures and cluster resource issues, but your workloads are still acting strange. Pods remain in Pendingfor too long, deployments aren’t progressing, and even basic kubectlcommands feel sluggish.

That’s your signal that the control plane might be the bottleneck. Here is how to troubleshoot Kubernetes control plane health using metrics, and trace the problem back to its source.

Gauge API Server Responsiveness

The API server is the front door to your cluster. If it’s slow, everything slows down; kubectl, CI/CD pipelines, controllers, autoscalers.

Check API server latency:

histogram_quantile(0.95, rate(apiserver_request_duration_seconds_bucket[5m]))

A spike here means users and internal components are all experiencing degraded interactions.

Look for API Server Errors

Latency might be caused by underlying failures especially from ETCD, which backs all API state.

Check for 5xx errors from the API server:

rate(apiserver_request_total{code=~”5..”}[5m])

A sustained increase could mean:

ETCD is overloaded or unhealthy
API server is under too much load
Network/storage latency is impacting ETCD reads/writes

If error rates correlate with latency spikes, check ETCD performance next.

Investigate Scheduler Delays

Maybe your pods are Pending and not getting scheduled even though nodes look healthy. This could be a scheduler problem, not a resource issue.

Check how long the scheduler is taking to place pods:

histogram_quantile(0.95, rate(scheduler_scheduling_duration_seconds_bucket[5m]))

High values here = the scheduler is overwhelmed, blocked, or crashing.

Correlate this with pod age:

kubectl get pods --all-namespaces --sort-by=.status.startTime

Output:

NAMESPACE     NAME                                      READY   STATUS    RESTARTS   AGE
default       myapp-frontend-7d8f9c6d8b-abcde           0/1     Pending   0          18m
default       api-server-5d9f8f6d8b-xyz12               0/1     CrashLoopBackOff  5  20m
default       db-service-7d8f9c6d8b-def45               0/1     Error     3          19m
kube-system   coredns-558bd4d5db-xyz12                  0/1     Pending   0          21m

New pods are in Pending for over 15 minutes, suggesting the scheduler is delayed and API server isn’t responding fast enough to resource or binding requests. If new pods sit in Pending too long, this is your bottleneck.

Monitor Controller Workqueues

The Controller Manager keeps the desired state in sync; scaling replicas, rolling updates, service endpoints, etc. If it’s backed up, changes won’t propagate.

Look at the depth of workqueues:

sum(workqueue_depth{name=~”.+”})

Most Kubernetes controllers are designed to quickly process items in their workqueues. A queue depth of 0–5 is generally normal and healthy. It means the controller is keeping up. Short spikes (up to ~10–20) can occur during events like rolling updates or scaling, and are usually harmless if they drop quickly. Start investigating if:

workqueue_depth stays above 50–100 consistently
workqueue_adds_total keeps rising rapidly
workqueue_work_duration_seconds shows long processing times

These symptoms suggest the controller is backed up, leading to delays in:

Rolling out deployments
Updating service endpoints
Reconciling desired vs. actual state

Also check:

sum(workqueue_adds_total)
avg(workqueue_work_duration_seconds)

Spikes here mean your controllers are overloaded, possibly due to a flood of changes or downstream API slowdowns.

Pull it All Together

From the above example outputs we can conclude the issue to be ETCD and API server latency which is causing cascading delays in the control plane:

Scheduler can’t assign pods quickly due to slow API server.
Controller Manager queues are backing up as desired state changes (like ReplicaSet creations) take too long to commit.
kubectl and system components (like CoreDNS or autoscalers) are affected by poor responsiveness from the API server, which relies on ETCD.

In general, let’s say you see:

High API latency
Elevated 5xx errors
Scheduler latency spikes
Controller queues backed up

When control plane metrics go bad, symptoms ripple through the whole system. Tracking these metrics as a cohesive unit helps you catch early signals before workloads break.

Node-Level Metrics: Digging into the Machine Layer

If control plane metrics look healthy but problems persist like pods getting OOMKilled, apps slowing down, or workloads behaving inconsistently; it’s time to inspect the nodes themselves. These are the machines that run your actual workloads. Here’s how to walk through node-level metrics to find the culprit.

Identify Which Nodes Are Affected

Start by getting a quick snapshot of node health:

kubectl get nodes

Output:

NAME      STATUS     ROLES    AGE   VERSION
node-1    Ready      worker   10d   v1.25.0
node-2    NotReady   worker   10d   v1.25.0

Look for any nodes not in Ready state. If nodes are marked NotReady, Unknown, or SchedulingDisabled, that's your first signal. Then describe them:

kubectl describe node node-2

Output:

Conditions:
  Type             Status  LastHeartbeatTime            Reason
  MemoryPressure   False   2025-05-06T16:00:00Z         KubeletHasSufficientMemory
  DiskPressure     True    2025-05-06T16:00:00Z         KubeletHasDiskPressure
  PIDPressure      False   2025-05-06T16:00:00Z         KubeletHasSufficientPID

Taints:
  node.kubernetes.io/disk-pressure:NoSchedule

Disk pressure is explicitly reported which is likely the source of pod issues. Focus on:

Conditions: Look for MemoryPressure, DiskPressure, or PIDPressure
Taints: Check if workloads are being prevented from scheduling

Check Resource Saturation

If nodes are Ready but workloads are misbehaving, they might just be under pressure. Get real-time usage:

kubectl top nodes

Output:

NAME     CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
node-1   1200m        60%    6000Mi          70%
node-2   800m         40%    5800Mi          68%

Based on the example output, CPU and memory are normal, likely the disk is the bottleneck. In general cases, look for:

High CPU%: Indicates throttling
High Memory%: Can cause OOMKills or evictions

If a node is maxed out, describe the pods on it:

kubectl get pods --all-namespaces -o wide | grep node-2

Output:

default       api-cache-678d456b7b-xyz11       0/1   Evicted           0     10m   node-2
default       order-db-7c9b5d49f-vx12c         0/1   Error             2     15m   node-2
default       analytics-app-67d945c78c-qwe78   0/1   CrashLoopBackOff  4     12m   node-2

Identify noisy neighbors or pods consuming abnormal resources. Multiple pods failing, evictions suggest disk pressure-based pod disruption.

Investigate Frequent Pod Restarts or Evictions

Pods restarting or getting evicted? Check the reason:

kubectl get pod pod-name -n namespace -o jsonpath="{.status.containerStatuses[*].lastState}"

Output:

{"terminated":{"reason":"Evicted","message":"The node was low on disk."}}

Common reasons:

OOMKilled: memory overuse
Evicted: node pressure (memory, disk, or PID)
CrashLoopBackOff: instability in app or runtime

Then verify which node they were running on, repeated issues from the same node point to a node-level problem.

Check Disk and Network Health

Some failures are subtler, slow apps, stuck I/O, DNS errors. These often come from disk or network bottlenecks.

Use Prometheus Dashboard:

Disk I/O:

rate(node_disk_reads_completed_total[5m])

Network errors:

rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])

These can indicate bad NICs, over-saturated interfaces, or DNS resolution failures affecting pods on that node. If not, SSH into the node and use:

iostat -xz 1 3

Example Output:

evice:         rrqm/s wrqm/s r/s   w/s  rkB/s  wkB/s avgrq-sz avgqu-sz await  svctm  %util
nvme0n1         0.00   12.00  50.00 250.00 1024.00 8192.00 60.00    8.50     30.00 1.00   99.90

And check:

dmesg | grep -i error

Example Output:

[ 10452.661212] blk_update_request: I/O error, dev nvme0n1, sector 768
[ 10452.661217] EXT4-fs error (device nvme0n1): ext4_find_entry:1463: inode #131072: comm kubelet: reading directory lblock 0

Look for high I/O wait, dropped packets, or NIC errors.

Review Node Stability & Uptime

Sometimes the issue is churn; nodes going up/down too frequently due to reboots or cloud spot termination.

Check uptime:

uptime

Output:

 16:15:03 up 2 days, 2:44, 1 user, load average: 5.12, 4.98, 3.80

Or with Prometheus:

node_time_seconds — node_boot_time_seconds

Frequent reboots suggest infrastructure problems or autoscaler misbehavior. If it’s spot nodes, review instance interruption rates.

Correlate and Isolate

In this example case, node-2 is experiencing disk I/O congestion which is confirmed by DiskPressure, pod evictions due to low disk, iostat metrics showing 99%+ utilization and 30ms I/O latency, and kernel logs showing read errors. This node is the root cause of pod disruptions and degraded application behavior. However there can also be other factors. Let’s say you find:

One node has 90%+ memory usage
That node also shows disk IO spikes and network errors
Most failing pods are running on that node

Node-level issues are often the hidden root of noisy, hard-to-trace application problems. Always include node health in your diagnostic workflow; even when app logs seem to tell a different story.

Pod & Deployment-Level Issues (RED Metrics)

If node level metrics look healthy but problems persist like some pods are slow, users are getting errors, and latency seems off; it is time to check what is wrong at the pod or deployment level? Here’s how to tackle it.

Spot the Symptoms

Start by identifying which services or deployments are affected. Are users reporting:

Slow API responses?
Errors in requests?
Timeouts?

Correlate with actual service/pod behavior using:

kubectl get pods -A

Output:

NAMESPACE     NAME                                READY   STATUS             RESTARTS   AGE
default       auth-api-7f8b45dd8f-abc12            0/1     CrashLoopBackOff   5          10m
default       auth-api-7f8b45dd8f-xyz89            0/1     CrashLoopBackOff   5          10m
default       payment-api-6f9c7f9b44-123qw         1/1     Running            0          20m

Look for pods in CrashLoopBackOff, Pending, or Error states. For example here, auth-api pods are failing that means something is wrong with that deployment.

Check the Request Rate

This tells you if the service is even receiving traffic, and whether it suddenly dropped.

If you’re using Prometheus + instrumentation (e.g., HTTP handlers exporting metrics):

rate(http_requests_total[5m])

Look for a sharp drop in traffic it might mean the pod isn’t even reachable due to readiness/liveness issues or misconfigured ingress.

Also check the load balancer/ingress controller logs (e.g., NGINX, Istio) for clues.

Check the Error Rate

This reveals if pods are throwing 5xx or 4xx errors, a sign of broken internal logic or downstream service failures.

rate(http_requests_total{status=~”5..”}[5m])

Also inspect the pods:

kubectl logs pod-name

Output:

Error: Missing required environment variable DATABASE_URL
    at config.js:12:15
    at bootstrapApp (/app/index.js:34:5)
    ...

Look for exceptions, failed database calls, or panics. Here we see the pod is crashing due to missing DATABASE_URL, which might be a config issue during deployment.

Use kubectl describe pod for events like:

Failing readiness/liveness probes
Container crashes
Volume mount errors

Example Output:

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Warning  Unhealthy  2m (x5 over 10m)   kubelet            Readiness probe failed: HTTP probe failed with statuscode: 500
  Warning  BackOff    2m (x10 over 10m)  kubelet            Back-off restarting failed container

Check the Request Duration (Latency)

High latency with no errors means something is slow, not broken.

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

If request durations spike:

Check if dependent services (e.g., database, Redis) are under pressure
Use tracing tools (e.g., Jaeger, OpenTelemetry) if set up

Look at CPU throttling with:

kubectl top pod

Output:

NAME                             CPU(cores)   MEMORY(bytes)
auth-api-7f8b45dd8f-abc12        15m          128Mi
payment-api-6f9c7f9b44-123qw     80m          200Mi

Based on the scenario we considered, there is no resource throttling or usage issues, the crash is logic-related, not pressure-related. And in Prometheus:

rate(container_cpu_cfs_throttled_seconds_total[5m])

Correlate with Deployment Events

Sometimes your pods are healthy but something changed in the deployment process (bad rollout, config error).

Check rollout history:

kubectl rollout history deployment deployment-name

Example Output:

deployment.apps/auth-api
REVISION  CHANGE-CAUSE
1         Initial deployment
2         Misconfigured env var for DATABASE_URL

See if a new revision broke things. If yes:

kubectl rollout undo deployment auth-api

Output:

deployment.apps/auth-api rolled back

Also review deployment description for more information:

kubectl describe deployment auth-api

Output:

Name:                   auth-api
Namespace:              default
Replicas:               2 desired | 0 updated | 0 available | 2 unavailable
StrategyType:           RollingUpdate
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    False   ProgressDeadlineExceeded
  Available      False   MinimumReplicasUnavailable

Environment:
  DATABASE_URL:  <unset>

Were all replicas successfully scheduled?
Did resource limits or readiness probes cause issues?

Spot Trends in Replica Behavior

If you suspect scaling problems (e.g., not enough replicas to handle load):

sum(kube_deployment_spec_replicas) by (deployment)
sum(kube_deployment_status_replicas_available) by (deployment)

Mismatch in these means rollout issues, pod crashes, or scheduling failures.

Final Diagnosis

By following this flow, you’ll isolate whether your pods are:

Unavailable (readiness or probe issues)
Throwing errors (broken logic, bad config)
Slow (upstream delays, resource throttling)
Or unstable (bad rollout, crashing containers)

Troubleshooting Application-Level Issues

If pods are running fine, nodes are healthy, no deployment issues, but users are still complaining then something could be wrong in the app itself. At this stage, the cluster looks fine, but it’s likely an internal app logic, dependency, or performance issue. So here’s how to troubleshoot it.

Trace the Symptoms from the Top

What are users actually experiencing?

Is a specific endpoint slow?
Is authentication failing?
Are pages timing out intermittently?

Start by querying RED metrics from your app’s own observability (assuming it’s instrumented with Prometheus, OpenTelemetry, etc.):

Request rate per endpoint: rate(http_requests_total{job=”your-app”}[5m])
Error rate (e.g., 4xx/5xx): rate(http_requests_total{status=~”5..”}[5m])
Latency distribution: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job=”your-app”}[5m]))

This will quickly show which part of your app is misbehaving.

Use Traces to Follow the Journey

If metrics are the “what”, traces are the “why.”

Use tracing (Jaeger, Tempo, or OpenTelemetry backends) to:

Trace slow or failed requests
Identify downstream service delays (e.g., DB, external APIs)
Measure time spent in each span

Look for patterns like:

Long DB query spans
Retries or timeouts from third-party APIs
Deadlocks or slow code paths

Profile Resource-Intensive Paths

Sometimes, the issue is an internal performance bug like memory leaks, CPU spikes, or thread contention.

Use profiling tools like:

Pyroscope, Parca, Go pprof, or Node.js Inspector
Flame graphs to visualize CPU/memory hotspots

Check Dependencies & DB Metrics

Your app might be healthy, but its dependencies might not be.

Is the database under pressure: rate(mysql_global_status_threads_running[5m])
Are Redis queries timing out: rate(redis_commands_duration_seconds_bucket[5m])
Are queue workers backed up: sum(rabbitmq_queue_messages_ready)

Also watch for:

Connection pool exhaustion
Slow queries
Locks or deadlocks

Even subtle latency in DB or cache can bubble up as app slowdowns.

External Services or 3rd Party APIs

Check whether your app relies on:

Payment gateways
Auth providers (like OAuth)
External APIs (e.g., geolocation, email, analytics)

Use Prometheus metrics or custom app logs to track:

Latency of external calls
Error rates (timeouts, HTTP 503s)
Retry storms

Add circuit breakers or timeouts to avoid cascading failures.

Validate Configuration & Feature Flags

Sometimes the issue is human :

Was a feature flag turned on for everyone?
Did a bad config rollout silently break behavior?
Was a critical env var left empty?

Review:

kubectl describe deployment your-app

Example Output:

Name: your-app
Namespace: default
CreationTimestamp: Mon, 06 May 2025 14:21:52 +0000
Labels: app=your-app
Selector: app=your-app
Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable
StrategyType: RollingUpdate
Conditions:
  Type Status Reason
  ---- ------ ------
  Available True MinimumReplicasAvailable
  Progressing True NewReplicaSetAvailable

Pod Template:
  Containers:
   your-app:
    Image: ghcr.io/your-org/your-app:2025.05.06
    Port: 8080/TCP
    Environment:
      FEATURE_BACKGROUND_REINDEXING: "true"
      DATABASE_URL: "postgres://db.svc.cluster.local"
    Mounts:
      /etc/config from config-volume (ro)
      /etc/secrets from secret-volume (ro)
Volumes:
  config-volume:
    ConfigMapName: app-config
  secret-volume:
    SecretName: db-secret

Check env vars, configMaps, and secret mounts. Also audit Git or your config source of truth. In the above example output, all pods are healthy, rollout was successful but the environment variable FEATURE_BACKGROUND_REINDEXING is enabled, likely triggering background operations that were not meant for production, causing performance regressions.

Final Diagnosis

If you’ve ruled out infrastructure and Kubernetes mechanics, your issue is almost certainly in:

Business logic
Misbehaving external systems
Unoptimized code paths
Bad configs or feature toggles

With solid RED metrics, tracing, profiling, and dependency checks you’ll isolate the slowest or weakest part of the app lifecycle.

Common Challenges in Monitoring Kubernetes

Monitoring a Kubernetes environment isn’t just about scraping some metrics and throwing them into dashboards. In real-world scenarios especially in large-scale, multi-team clusters there are unique challenges that can cripple even the best monitoring setups. Here are some of the most common ones that teams face:

Metric Overload

With so many layers that are clusters, nodes, control planes, pods, apps, it’s easy to end up with thousands of metrics. But more metrics is not equal to better observability. Without a clear signal-to-noise ratio, teams get stuck chasing anomalies that don’t matter, while missing critical signals that do.

Inconsistent Metric Sources

Kubernetes components expose metrics in different formats and via different tools (Prometheus, ELK/EFK Stack, etc). This fragmentation can lead to incomplete or duplicated data, and sometimes even conflicting insights, making root cause analysis harder.

Multi-Tenancy Complexity

In shared clusters, multiple teams deploy and monitor their own apps. Without clear namespacing, labeling, and role-based access, it becomes hard to isolate responsibility or debug performance issues without stepping on each other’s toes.

Scaling Problems

At smaller scales, you might get by with basic dashboards. But as your workloads grow, so do the cardinality of metrics, storage costs, and processing load on your observability stack. Without a scalable monitoring setup, you risk cluttered dashboards and missed alerts.

Monitoring the Monitoring System

Ironically, one of the most overlooked gaps is keeping tabs on your observability stack itself. What happens if Prometheus crashes? Or if your alert manager silently dies? Monitoring the monitor ensures you’re not blind when it matters the most.

Break-Glass Mechanisms

Sometimes, no matter how well things are set up, you need to bypass the dashboards and go straight to logs, live debugging, or kubectl inspections. Having a documented “break-glass” process with emergency steps to dig deeper can save time during production outages.

How to Overcome These Challenges With Best Practices

While Kubernetes observability can feel overwhelming, a thoughtful strategy and the right tools can make all the difference.

Focus on High-Value Metrics

Instead of tracking everything, prioritize Golden Signals, RED/USE metrics, and metrics tied to SLAs and SLOs. Create dashboards with intent, not clutter.

Standardize Your Metric Sources

Use a centralized metrics pipeline, typically Prometheus , with exporters like kube-state-metrics, node exporter, and custom app exporters. Stick to consistent naming conventions and labels to avoid confusion across teams.

Use Labels & Namespaces Effectively

Organize metrics by namespace, team, or application, and apply proper labels to distinguish tenants. Use tools like Prometheus’ relabeling and Grafana’s variable filters to slice metrics cleanly per use case.

Design for Scale

Enable metric retention policies, recording rules, and downsampling. Consider remote write to long-term storage (like or Grafana Mimir) for large environments. Test how your dashboards perform under load.

Monitor Your Monitoring

Set up alerts for your observability stack (e.g., “Is Prometheus scraping?”, “Is Alertmanager up?”). Include basic health checks for Grafana, Prometheus, exporters, and data sources.

Establish “Break-Glass” Documents

Have documented steps for when observability fails, like which logs to tail, which kubectl commands to run, or how to access emergency dashboards. Practice chaos drills so everyone knows what to do.

Tools That Help You Monitor These Metrics

Understanding what to monitor is only 50% task; the other 50% is how you actually collect, store, and visualize that data in a scalable and insightful way. The Kubernetes ecosystem has a rich set of observability tools that make this easier.

Prometheus and Grafana

Prometheusis the high standard for scraping, storing, and querying time-series metrics in Kubernetes.
Grafana lets you visualize those metrics and set up alerting.
With exporters like node-exporter and kube-state-metrics you can cover everything from node health to pod status and custom application metrics.

Best for teams looking for full control, custom dashboards, and open-source extensibility.

kube-state-metrics This is a service that listens to the Kubernetes API server and generates metrics about the state of Kubernetes objects like deployments, nodes, pods, etc. It complements Prometheus by exposing high-level cluster state metrics (e.g., number of ready pods, desired replicas, node conditions). Best for the cluster-level insights and higher-order metrics.

External Monitoring Services (VictoriaMetrics, Jaeger, OpenTelemetry, etc) These open source tools form a powerful observability stack for Kubernetes environments. VictoriaMetricshandles efficient metric storage, OpenTelemetrystandardizes tracing and metrics across services, and with engineers can monitor and troubleshoot with distributed transaction monitoring. Together, they give you flexibility, cost savings, and full control over your monitoring pipeline, without vendor lock-in.

Conclusion

Gathering data is only one aspect of monitoring Kubernetes; another is gathering the appropriate data so that prompt, well-informed decisions can be made. Knowing which metrics are your best defense, whether you’re a platform team scaling across clusters, a DevOps engineer optimizing performance, or a Site Reliability engineer fighting a late-night outage.

That is why having a well-defined observability strategy , one that cuts through clutter, highlights what is needed, and adapts as your architecture evolves is no longer optional. Teams are increasingly turning to frameworks, tooling, and purpose-built observability solutions that support this shift toward proactive, insight-driven operations. At the end of the day, metrics are your map, but only if you’re reading the right signs. Focus on these key signals, and you’ll spend less time digging through data and more time solving real problems.

Originally published at https://www.cloudraft.io on May 13, 2025.