<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vibhuti Sharma</title>
    <description>The latest articles on DEV Community by Vibhuti Sharma (@vibhuti_sharma).</description>
    <link>https://dev.to/vibhuti_sharma</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3240942%2F03c25aa6-7df8-4907-aaba-f7015318fb62.png</url>
      <title>DEV Community: Vibhuti Sharma</title>
      <link>https://dev.to/vibhuti_sharma</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vibhuti_sharma"/>
    <language>en</language>
    <item>
      <title>Monitoring AWS Batch Jobs with CloudWatch Custom Metrics</title>
      <dc:creator>Vibhuti Sharma</dc:creator>
      <pubDate>Thu, 26 Mar 2026 07:38:58 +0000</pubDate>
      <link>https://dev.to/vibhuti_sharma/monitoring-aws-batch-jobs-with-cloudwatch-custom-metrics-25ke</link>
      <guid>https://dev.to/vibhuti_sharma/monitoring-aws-batch-jobs-with-cloudwatch-custom-metrics-25ke</guid>
      <description>&lt;p&gt;AWS Batch service is used for various compute workloads like data processing pipelines, background jobs and scheduled compute tasks. AWS provides many infrastructure-level metrics for Batch in CloudWatch, however there is a significant gap when it comes to job status monitoring. For example, the number of jobs that are RUNNABLE, RUNNING, FAILED, or SUCCEEDED are not available by default in CloudWatch. These metrics are visible on the AWS Batch dashboard but it does not exist in CloudWatch as a metric. This makes it difficult to answer operational questions such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Are jobs accumulating in a RUNNABLE state?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Are the jobs failing frequently?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Is the system keeping up with workload?&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without these metrics, building meaningful dashboards or alerts for Batch workloads becomes challenging.&lt;/p&gt;

&lt;p&gt;In this blog post, we can understand how to close this observability gap by exporting custom AWS Batch job status metrics into CloudWatch, which can then be consumed by any third party observability tool. The blog post will walk you through a custom setup for exporting these metrics using EventBridge, Lambda Function, Batch API and Cloudwatch custom metrics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which AWS batch metrics does CloudWatch publish by default?
&lt;/h2&gt;

&lt;p&gt;AWS Batch publishes a limited set of infrastructure metrics to CloudWatch under the ECS/Container Insights namespace. These metrics primarily describe compute environment capacity and resource utilization, rather than the status of jobs.&lt;/p&gt;

&lt;p&gt;Examples of metrics available by default include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;StorageReadBytes: number of bytes read from storage on the instance&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;NetworkTxBytes: number of bytes transmitted by the resource&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;CpuReserved: CPU units reserved by tasks in the resource&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;StorageWriteBytes: number of bytes written to storage in the resource&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;EphemeralStorageReserved: number of bytes reserved from ephemeral storage in the resource&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;TaskCount: number of tasks running in the cluster&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MemoryReserved: memory that is reserved by tasks in the resource&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;EphemeralStorageUtilized: number of bytes used from ephemeral storage in the resource&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;NetworkRxBytes: number of bytes received by the resource&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;CpuUtilized: CPU units used by tasks in the resource&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;ServiceCount: number of services in the cluster&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;ContainerInstanceCount: number of EC2 instances running the Amazon ECS agent that are registered with a cluster&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MemoryUtilized: memory being used by tasks in the resource&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, job status metrics are not published by default. These missing metrics include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Number of RUNNABLE jobs&lt;/li&gt;
&lt;li&gt;Number of RUNNING jobs&lt;/li&gt;
&lt;li&gt;Number of FAILED jobs&lt;/li&gt;
&lt;li&gt;Number of SUCCEEDED jobs&lt;/li&gt;
&lt;li&gt;Number of SUBMITTED jobs
These metrics are critical for monitoring Batch workloads because they indicate system health and throughput.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A growing RUNNABLE job count may indicate insufficient compute capacity.&lt;/li&gt;
&lt;li&gt;A spike in FAILED jobs may indicate application or infrastructure issues.
To obtain these metrics, we need to query the AWS Batch API and publish the results ourselves.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to export AWS batch job status metrics to CloudWatch
&lt;/h2&gt;

&lt;p&gt;In this solution, we periodically queries AWS Batch API for job states and publishes the status and job counts as custom CloudWatch metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Components used&lt;/strong&gt;&lt;br&gt;
The setup consists of four components:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;EventBridge Rule&lt;/li&gt;
&lt;li&gt;Runs on a schedule (for example every 5 minutes)&lt;/li&gt;
&lt;li&gt;Triggers a Lambda function&lt;/li&gt;
&lt;li&gt;Lambda Function&lt;/li&gt;
&lt;li&gt;Calls the AWS Batch API&lt;/li&gt;
&lt;li&gt;Retrieves job counts by status&lt;/li&gt;
&lt;li&gt;Aggregates the results&lt;/li&gt;
&lt;li&gt;Publishes metrics to CloudWatch&lt;/li&gt;
&lt;li&gt;AWS Batch API&lt;/li&gt;
&lt;li&gt;Provides job information through API calls such as jobSummaryList&lt;/li&gt;
&lt;li&gt;CloudWatch Custom Metrics&lt;/li&gt;
&lt;li&gt;Stores the exported job status metrics&lt;/li&gt;
&lt;li&gt;Exposes them for dashboards and alert&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Workflow&lt;/strong&gt;&lt;br&gt;
The process works as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;EventBridge triggers the Lambda function on a schedule.&lt;/li&gt;
&lt;li&gt;Lambda queries AWS Batch for job counts in each status.&lt;/li&gt;
&lt;li&gt;The job counts are aggregated.&lt;/li&gt;
&lt;li&gt;Lambda publishes these counts as custom metrics to CloudWatch.&lt;/li&gt;
&lt;li&gt;Grafana or another observability tool reads these metrics from CloudWatch.
This architecture is serverless, inexpensive, and easy to extend, and does not require changes to existing Batch workloads.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxv7x197ggt51ykdo77v2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxv7x197ggt51ykdo77v2.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Cost considerations
&lt;/h2&gt;

&lt;p&gt;The cost of this setup is minimal because it relies entirely on serverless services and lightweight API calls.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;EventBridge:&lt;/strong&gt; EventBridge scheduled rules cost a fraction of a cent per million invocations. With a schedule of every 5 minutes, the cost is negligible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda:&lt;/strong&gt; The Lambda function only performs a small number of API calls and executes for a short duration. In most cases, this will remain well within the Lambda free tier.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CloudWatch Custom Metrics:&lt;/strong&gt; CloudWatch custom metrics are the primary cost factor. CloudWatch charges per metric per month. However, since the setup only publishes a small number of metrics (typically 4–6), the total cost remains low.
For example, publishing metrics for:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;RUNNABLE&lt;/li&gt;
&lt;li&gt;RUNNING&lt;/li&gt;
&lt;li&gt;FAILED&lt;/li&gt;
&lt;li&gt;SUCCEEDED&lt;/li&gt;
&lt;li&gt;SUBMITTED&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Results in only a few custom metrics. Overall, the monthly cost of this setup is typically very small compared to the operational visibility it provides.&lt;/p&gt;
&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;p&gt;The implementation consists of four main steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Creating the Lambda function&lt;/li&gt;
&lt;li&gt;Querying the Batch API&lt;/li&gt;
&lt;li&gt;Publishing metrics to CloudWatch&lt;/li&gt;
&lt;li&gt;Scheduling the function using EventBridge&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Lambda function logic
&lt;/h2&gt;

&lt;p&gt;The Lambda function performs the following actions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieves job queues&lt;/li&gt;
&lt;li&gt;Queries jobs by status&lt;/li&gt;
&lt;li&gt;Counts jobs in each state&lt;/li&gt;
&lt;li&gt;Publishes the counts to CloudWatch&lt;/li&gt;
&lt;li&gt;Example Python implementation:
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;

&lt;span class="n"&gt;batch_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cloudwatch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudwatch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;batch_metrics_exporter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;job_queues&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;JOB_QUEUES&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;compute_env_mapping&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;COMPUTE_ENV_MAPPING&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

        &lt;span class="n"&gt;all_metrics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;queue_name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;job_queues&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;job_counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_job_counts_by_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;queue_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;compute_env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;compute_env_mapping&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;queue_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;job_counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="n"&gt;all_metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MetricName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;Jobs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Dimensions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;JobQueue&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;queue_name&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ComputeEnvironment&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;compute_env&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                    &lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;
                &lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="n"&gt;batch_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_metrics&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;cloudwatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put_metric_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;Namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;YOUR_NAMESPACE&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;MetricData&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;all_metrics&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;statusCode&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Successfully published &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_metrics&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; metrics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;queues_processed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job_queues&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error in lambda_handler: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;statusCode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)})}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_job_counts_by_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job_queue_name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;job_statuses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SUBMITTED&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PENDING&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RUNNABLE&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;STARTING&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RUNNING&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SUCCEEDED&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;FAILED&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;job_counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Submitted&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Runnable&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Running&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Failed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Succeeded&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;job_statuses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;batch_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_jobs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jobQueue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;job_queue_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;jobStatus&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;jobSummaryList&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SUBMITTED&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PENDING&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
                &lt;span class="n"&gt;job_counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Submitted&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RUNNABLE&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;STARTING&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
                &lt;span class="n"&gt;job_counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Runnable&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RUNNING&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;job_counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Running&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;FAILED&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;job_counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Failed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SUCCEEDED&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;job_counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Succeeded&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error getting job counts for queue &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;job_queue_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;job_counts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  IAM permissions required for the Lambda function
&lt;/h2&gt;

&lt;p&gt;The Lambda function requires permissions for Batch APIs and CloudWatch metrics.&lt;/p&gt;

&lt;p&gt;Minimal IAM policy example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"batch:DescribeJobQueues"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"batch:ListJobs"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cloudwatch:PutMetricData"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Setting up the EventBridge schedule&lt;/strong&gt;&lt;br&gt;
We can create an EventBridge rule with a schedule expression. The rate can be as per the requirement.&lt;/p&gt;

&lt;p&gt;Example: rate(5 minutes)&lt;/p&gt;

&lt;p&gt;Attach the Lambda function as the target. This ensures the metrics are refreshed periodically.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to verify AWS Batch metrics in CloudWatch
&lt;/h2&gt;

&lt;p&gt;Once the Lambda function begins publishing metrics, they will appear in CloudWatch under the custom namespace used in the implementation.&lt;/p&gt;

&lt;p&gt;Navigate to: CloudWatch → Metrics → &lt;/p&gt;

&lt;p&gt;You should see metrics such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JobCount (Status=RUNNABLE)&lt;/li&gt;
&lt;li&gt;JobCount (Status=RUNNING)&lt;/li&gt;
&lt;li&gt;JobCount (Status=FAILED)&lt;/li&gt;
&lt;li&gt;JobCount (Status=SUCCEEDED)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each metric will also include dimensions such as the job queue name, allowing filtering per queue. These metrics can now be queried, visualized, or used for alerting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;One of the limitations of the Batch API is that it returns point-in-time snapshots rather than time series data.&lt;/p&gt;

&lt;p&gt;This means the metrics represent the number of jobs in each state at the moment the Lambda function runs, rather than a continuous stream of job events.&lt;/p&gt;

&lt;p&gt;However, this limitation can be addressed using PromQL queries in observability systems such as Prometheus or Grafana.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deriving job failure rates&lt;/li&gt;
&lt;li&gt;Calculating trends in runnable job backlog&lt;/li&gt;
&lt;li&gt;Detecting abnormal changes in job states&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Another limitation is data delay, which depends on the EventBridge schedule. If the rule runs every 5 minutes, the metrics will have up to a five minute delay.&lt;/p&gt;

&lt;p&gt;Reducing the schedule interval improves freshness but increases API usage.&lt;/p&gt;

&lt;h2&gt;
  
  
  When should you use this setup?
&lt;/h2&gt;

&lt;p&gt;This approach is most useful when Batch workloads involve long-running or heavy compute jobs. In such environments, understanding job queue health is important for operational stability.&lt;/p&gt;

&lt;p&gt;Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data processing pipelines&lt;/li&gt;
&lt;li&gt;Machine learning workloads&lt;/li&gt;
&lt;li&gt;ETL systems&lt;/li&gt;
&lt;li&gt;Background compute services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, the setup may be less useful for very short-lived jobs that start and complete within seconds. In those cases, the scheduled polling approach may miss transient states.&lt;/p&gt;

&lt;p&gt;Therefore, this solution is most effective when jobs run long enough to be captured within the scheduled polling interval.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;AWS Batch provides strong compute orchestration capabilities but lacks native job-level observability metrics in CloudWatch. By combining EventBridge, Lambda, the Batch API, and CloudWatch custom metrics, it is possible to export job status metrics and integrate them into existing observability dashboards.&lt;/p&gt;

&lt;p&gt;This setup provides visibility into queue backlog, job failures, and system throughput, enabling better operational monitoring of Batch workloads. In practice, this solution has proven useful for tracking job health and building meaningful dashboards around Batch-based workloads. With minimal infrastructure and low cost, it significantly improves observability for production Batch environments.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.cloudraft.io" rel="noopener noreferrer"&gt;https://www.cloudraft.io&lt;/a&gt; on March 24, 2026.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>observ</category>
      <category>aws</category>
      <category>awsbatch</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Implementing Compliance-first Observability with OpenTelemetry</title>
      <dc:creator>Vibhuti Sharma</dc:creator>
      <pubDate>Fri, 20 Jun 2025 12:52:57 +0000</pubDate>
      <link>https://dev.to/vibhuti_sharma/implementing-compliance-first-observability-with-opentelemetry-14c4</link>
      <guid>https://dev.to/vibhuti_sharma/implementing-compliance-first-observability-with-opentelemetry-14c4</guid>
      <description>&lt;h2&gt;
  
  
  Observability isn’t optional and neither is Compliance
&lt;/h2&gt;

&lt;p&gt;While talking about observability, there is something that often gets missed in conversations: it's compliance. We all know observability is essential. When you’re running any kind of modern application or infrastructure, having good visibility through logs, metrics, and traces is not just helpful, it’s how you keep systems stable, catch issues early, and move with confidence.&lt;/p&gt;

&lt;p&gt;But while we’ve gotten better at collecting and analyzing the data, we haven’t always paid enough attention to what that data contains or where it ends up. These logs and data can easily include sensitive information. Things like user details, access tokens, or internal system behavior often get logged without much thought. And if that data is exposed or mishandled, it turns into a serious risk both legally and operationally.&lt;/p&gt;

&lt;p&gt;In this blog post, I’ll walk you through how to build observability pipelines that are not only functional but also secure, compliant, and built with intention. We’ll look at how OpenTelemetry can help with that, and why its processors are one of the most effective ways to protect and control the flow of telemetry data.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Costs When Compliance Fails
&lt;/h2&gt;

&lt;p&gt;We often think of compliance as just legal formalities or contracts. But when compliance fails, the consequences are real and can impact a business way more. Even a small oversight, like an email address showing up in a debug log or a trace, containing sensitive user data being sent to an external service without proper filtering, can become a much larger issue. These incidents do not just violate policies, they also break customer's trust, trigger audits, and can lead to financial and legal damage.&lt;/p&gt;

&lt;p&gt;According to the &lt;a href="https://www.ibm.com/reports/data-breach" rel="noopener noreferrer"&gt;IBM Cost of a Data Breach Report&lt;/a&gt;, the global average cost of a breach in 2024 was over 4.9 million dollars. In regulated industries such as healthcare, finance, and insurance, that number tends to be even higher. And the cost isn’t just about regulatory fines. A significant portion comes from the loss of business, system downtime, incident response, and long-term brand reputation issues. Even if your organization is not governed by strict regulations like GDPR, HIPAA, or PCI-DSS, your users still expect their data to be treated with care. Once trust is lost, it’s incredibly difficult to win back.&lt;/p&gt;

&lt;p&gt;That is why compliance can’t be treated as something you can add later. It has to be built into the foundation, and that includes how we handle observability data. When telemetry pipelines are left unguarded, they can quietly become one of the biggest liabilities in your stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  OpenTelemetry and Its Role in Data Protection
&lt;/h2&gt;

&lt;p&gt;OpenTelemetry has quickly become the standard framework for collecting telemetry data across modern, distributed systems. It provides a consolidated approach to gathering logs, metrics, and traces and offers the flexibility to send that data to a variety of destinations, from observability platforms to self-hosted backends or data lakes.&lt;/p&gt;

&lt;p&gt;While OpenTelemetry is excellent at solving how data is collected and transported, it places the responsibility for what data is captured and how securely it is handled entirely on the user. Its flexibility is a strength, but also a risk. OpenTelemetry will not automatically prevent sensitive data from flowing through your pipelines.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What Kind of Telemetry Data Needs Protection?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Before going into the protection strategies, let’s first identify the types of data that could pose compliance risks. Common examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Personally Identifiable Information (PII):&lt;/strong&gt; emails, phone numbers, user IDs&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sensitive system metadata:&lt;/strong&gt; IP addresses, internal hostnames&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Confidential business context:&lt;/strong&gt; debug logs, internal environment tags&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Regulatory-bound attributes:&lt;/strong&gt; region identifiers (e.g., for GDPR)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If this data makes its way into your telemetry stream, it will continue through your system unless you explicitly configure rules to stop or modify it.&lt;/p&gt;

&lt;p&gt;This is where the &lt;a href="https://opentelemetry.io/docs/collector/" rel="noopener noreferrer"&gt;OpenTelemetry Collector&lt;/a&gt; becomes critical. Acting as the central hub between data sources and their destinations, the Collector offers a place where telemetry data can be inspected, filtered, transformed, or enriched before it moves any further. It is here that organizations gain control over what data stays, what data is modified, and what data never leaves the boundary at all.&lt;/p&gt;

&lt;p&gt;With the right configurations, the Collector becomes more than just a routing tool. It becomes a guardrail for enforcing data protection standards, filtering out sensitive information, and helping ensure compliance with security and privacy requirements. OpenTelemetry, when used thoughtfully, is not just an observability solution. It is a foundational piece of your data protection strategy.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbse8cl43owfrtoxk4qm1.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbse8cl43owfrtoxk4qm1.jpg" alt="OpenTelemetry Collector" width="800" height="487"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Solving Real Compliance Challenges with OpenTelemetry Processors
&lt;/h2&gt;

&lt;p&gt;Processors are one of the most important functions within the OpenTelemetry Collector when it comes to enforcing data protection and compliance. Positioned between data collection and export, they serve as the transformation and control layer where critical compliance logic can be applied before telemetry leaves your environment.&lt;/p&gt;

&lt;p&gt;Learn how to design observability pipelines that enforce data protection, support compliance regulations, and ensure secure telemetry using OpenTelemetryThe strength of processors lies in their flexibility. They allow you to redact, suppress, enrich, or reshape telemetry based on your organization's security and privacy requirements. This feature is essential when dealing with sensitive or regulated data flowing through your observability systems.&lt;/p&gt;

&lt;p&gt;Here are some of the practical ways processors help us address real-world compliance concerns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Redacting Sensitive Information&lt;/strong&gt;&lt;br&gt;
Logs and traces often contain personal or confidential data such as email addresses, user IDs, or access tokens. Processors like attributesprocessor or transformprocessor can be configured to remove or make these values anonymous automatically, helping prevent unintentional exposure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Filtering Non-Compliant Data&lt;/strong&gt;&lt;br&gt;
Telemetry that includes content that violates policies can be filtered out entirely before reaching any downstream systems. This helps reduce risk and ensures that observability does not become a liability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enforcing Data Residency and Routing Rules&lt;/strong&gt;&lt;br&gt;
For organizations subject to regional data protection laws, processors can route or drop telemetry based on attributes such as geography or service type. This ensures that data remains within defined boundaries and complies with jurisdictional requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Normalizing and Structuring Telemetry for Audits&lt;/strong&gt;&lt;br&gt;
Compliance often requires structured, consistent data. Processors can standardize field names, values, and formats so that logs, metrics, and traces align with internal audit and reporting standards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reducing Noise to Highlight What Matters&lt;/strong&gt;&lt;br&gt;
Not all telemetry is useful, and excessive data can obscure important signals. Processors help reduce noise by removing redundant spans or unnecessary attributes, making it easier to focus on meaningful insights while keeping compliance in check.&lt;/p&gt;

&lt;p&gt;By configuring processors with intent to be compliant, organizations can ensure that observability pipelines are secure, responsible, and aligned with compliance goals. This control layer not only supports regulatory requirements but also promotes better data and operational clarity. When designed properly, processors become more than just a technical feature, they can represent a proactive step toward secure and compliant observability.&lt;/p&gt;
&lt;h2&gt;
  
  
  Practical Examples of Compliance-first Telemetry Pipelines
&lt;/h2&gt;

&lt;p&gt;After exploring the role of processors in enforcing compliance, let’s look at how to bring it all together in real-world telemetry pipelines. Building compliance-first observability is not just about theory; it is about designing workflows that consistently protect data across environments.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;List of Processors in OpenTelemetry Collector&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor" rel="noopener noreferrer"&gt;OpenTelemetry Collector provides several processors&lt;/a&gt; out of the box. The most commonly used ones for compliance are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;attributesprocessor&lt;/strong&gt; – Add, remove, update, or redact specific attributes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;filterprocessor&lt;/strong&gt; – Filter spans or logs based on matching criteria&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;routingprocessor&lt;/strong&gt; – Route telemetry conditionally based on resource or attribute values&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;transformprocessor&lt;/strong&gt; – Use expressions to rename, update, or drop fields&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How to Choose the Right Processor&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We can use the specific processors for doing the job based on various use cases, like:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Processor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Remove or redact sensitive fields&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;attributesprocessor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Drop unnecessary or risky logs/spans&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;filterprocessor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Route telemetry based on geography or team&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;routingprocessor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Normalize field names for audit compliance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;transformprocessor&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Example 1: Redacting PII in application traces&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In many applications, traces can unintentionally carry personally identifiable information like user email addresses or phone numbers. To address this, you can build a pipeline that begins with the &lt;strong&gt;otlp&lt;/strong&gt; receiver, processes the trace data through an &lt;strong&gt;attributesprocessor&lt;/strong&gt; configured to detect and redact sensitive fields such as &lt;strong&gt;user.email&lt;/strong&gt; or &lt;strong&gt;user.phone&lt;/strong&gt;, and finally exports it to a tracing backend like Jaeger or another OTLP-compatible service.&lt;br&gt;
&lt;strong&gt;Example Config:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;protocols&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;grpc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

&lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;attributes/pii_redaction&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;actions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;user.email&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;delete&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;user.phone&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;delete&lt;/span&gt;

&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;jaeger&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://jaeger-collector:14250'&lt;/span&gt;
    &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;insecure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;attributes/pii_redaction&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;jaeger&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Outcome:&lt;/strong&gt;&lt;br&gt;
This pipeline deletes any attribute named &lt;code&gt;user.email&lt;/code&gt; or &lt;code&gt;user.phone&lt;/code&gt; before data is exported to Jaeger, ensuring no PII leaves the pipeline. With this setup, you preserve the diagnostic value of the trace without risking exposure of personal data. This approach helps maintain user privacy and stay aligned with data protection policies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 2: Filtering internal debug logs in production&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Developers often include verbose debug logs during development, but these logs are rarely suitable for production. In a compliance-first pipeline, you can start with a &lt;strong&gt;filelog&lt;/strong&gt; or &lt;strong&gt;fluentforward&lt;/strong&gt; receiver and pass the logs through a &lt;strong&gt;filterprocessor&lt;/strong&gt; that drops entries where severity is set to &lt;strong&gt;"DEBUG"&lt;/strong&gt; or the environment tag indicates it's development-only. The cleaned logs are then sent to a system like Google Cloud Logging or Datadog.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example Config:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;filelog&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;include&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/var/log/app/*.log'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;filter/drop_debug_logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;log_record&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;severity_text&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DEBUG'&lt;/span&gt;
&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;googlecloud&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-production-project&lt;/span&gt;
&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;filelog&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;filter/drop_debug_logs&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;googlecloud&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Outcome:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This ensures that only production-relevant and compliant log data is exported, reducing both operational risk and unnecessary storage or processing costs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 3: Ensuring data residency compliance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s say your organization collects telemetry from EU-based services and must comply with regional data residency laws. The pipeline begins with the &lt;strong&gt;otlp&lt;/strong&gt; receiver and uses a &lt;strong&gt;routingprocessor&lt;/strong&gt; to inspect attributes like &lt;strong&gt;region = eu-west1&lt;/strong&gt;. Based on this, telemetry is selectively routed to an EU-based backend only.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example Config:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;protocols&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;grpc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;routing/data_residency&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;eu-west1&lt;/span&gt;
        &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;eu_backend&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;statement&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;match(resource.attributes["region"], "eu-west1")&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
        &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;non_eu_backend&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;eu_backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;eu-collector.mycompany.com'&lt;/span&gt;
  &lt;span class="na"&gt;non_eu_backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-collector.mycompany.com'&lt;/span&gt;
&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;routing/data_residency&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Outcome:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;EU data is routed exclusively to EU-compliant systems, supporting regional legal obligations. This architecture ensures that regulated data never leaves its permitted geographic boundary, keeping your observability setup aligned with legal and contractual obligations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 4: Standardizing attributes for compliance audits&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In regulated industries, audit requirements often demand consistent telemetry formats. A compliance-aligned pipeline might start with receivers like &lt;strong&gt;prometheus&lt;/strong&gt;, &lt;strong&gt;otlp&lt;/strong&gt;, or &lt;strong&gt;filelog&lt;/strong&gt;, and pass the data through a &lt;strong&gt;transformprocessor&lt;/strong&gt; that renames fields. For instance, &lt;strong&gt;user_id&lt;/strong&gt; becomes &lt;strong&gt;user.id&lt;/strong&gt;, and &lt;strong&gt;txn_amount&lt;/strong&gt; becomes &lt;strong&gt;transaction.amount&lt;/strong&gt;. The processed data is then exported to a SIEM system or centralized log storage for long-term analysis. This kind of field normalization supports auditability and ensures that all downstream systems operate with uniform telemetry schemas, improving clarity and compliance readiness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example Config:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;protocols&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;grpc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;transform/standardize_fields&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;trace_statements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;span&lt;/span&gt;
        &lt;span class="na"&gt;statements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;rename(attributes["user_id"], "user.id")&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;rename(attributes["txn_amount"], "transaction.amount")&lt;/span&gt;
&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;logging&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;loglevel&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;debug&lt;/span&gt;
&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;transform/standardize_fields&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;logging&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Outcome:&lt;/strong&gt;&lt;br&gt;
With consistent attribute names, you improve audit readiness and make logs easier to correlate.&lt;/p&gt;

&lt;p&gt;These examples show how easy it is to tailor your observability pipeline for compliance without sacrificing performance or visibility. By using the Collector as a policy engine, you ensure that compliance checks are built into your telemetry flow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Designing a Secure and Compliant Observability Architecture
&lt;/h2&gt;

&lt;p&gt;By now, it’s clear that securing telemetry data is not just about selecting the right tools. It involves designing the entire observability architecture with compliance in mind from the very beginning.&lt;/p&gt;

&lt;p&gt;To do this effectively, observability should be treated as a data supply chain. Each stage, starting from ingestion to processing to export, must actively enforce protections, not just transfer data passively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Centralize Control&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The OpenTelemetry Collector sits at the center of a secure observability setup. It serves as the control point for managing ingestion, sanitation, transformation, routing, and export. This enables consistent enforcement of policies, regardless of where the data originates. If you need to redact PII before logs leave a Kubernetes cluster, route metrics from the EU to region-specific storage, or standardize trace data for audit readiness, the Collector is where those rules are applied.&lt;/p&gt;

&lt;p&gt;As observability grows, managing multiple Collector instances across environments can become complex. To help us manage this situation we can implement the &lt;a href="https://opentelemetry.io/docs/specs/opamp/" rel="noopener noreferrer"&gt;Open Agent Management Protocol (OpAMP)&lt;/a&gt;. OpAMP provides a standardized way to remotely manage OpenTelemetry Collectors at scale. It enables you to push configuration updates, monitor agent health, and enforce policy changes without logging into each node manually. It’s an essential addition for teams aiming to maintain observability governance while reducing operational overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keep Processing and Export Logic Outside Application Code&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A frequent mistake is embedding telemetry logic directly within application code. This introduces risk, increases complexity, and makes enforcement inconsistent across services. A more secure approach moves that logic into centrally managed Collector configurations. This allows teams to update rules without deploying new code and gives compliance teams the ability to audit pipelines independently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Encrypt Telemetry in Transit and at Rest&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All telemetry data, including logs, metrics, and traces, should be encrypted while in transit and when stored. Use TLS to secure communication between agents and Collectors, and ensure encryption at rest is enabled in your observability backends such as OpenSearch, Datadog, or GCP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid Overcollection and Excessive Retention&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Collecting or retaining more data than necessary increases your risk exposure. Implement filtering at the source and within the Collector to discard irrelevant data. Align retention policies with legal and compliance requirements to ensure that sensitive data is not kept longer than necessary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enforce Separation of Duties&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not every team member needs access to all telemetry. Design the system to enforce access controls, both through infrastructure-level mechanisms like IAM or RBAC and within observability platforms using scoped dashboards or tenant-aware indexing. This limits access, reduces internal risk, and simplifies compliance audits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Additional Layers of Data Protection Beyond Processors
&lt;/h2&gt;

&lt;p&gt;While OpenTelemetry processors play a critical role in securing and shaping telemetry data, they should be part of a broader data protection strategy. Ensuring compliance requires a layered approach that includes infrastructure-level security, backend configurations, and organizational access controls.&lt;/p&gt;

&lt;p&gt;Below are key layers that complement the processor-level protections:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. End-to-End Encryption&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Encryption must be enforced at every stage of telemetry flow. Use TLS for all communication between agents, collectors, and backend systems. Whether data is being transmitted over gRPC or HTTP, encrypted channels prevent interception and unauthorized access during transit.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. Secure and Compliant Backends&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;After data is processed, it is stored or analyzed in backends such as OpenSearch, Google Cloud Logging, or Datadog. These systems must be configured to encrypt data at rest and enforce strict access controls. Ensure that backend permissions align with your organization's compliance policies.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. Role-Based Access Control (RBAC) and Principle of Least Privilege&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Limit access to telemetry data and configuration files using IAM or RBAC mechanisms. Each user or team should have access only to the data necessary for their responsibilities. This reduces the risk of accidental exposure and simplifies audit processes.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. Protected Configuration Management&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Treat OpenTelemetry configuration files as sensitive assets. Store them in secure, version-controlled repositories with restricted access. Use secrets management tools like HashiCorp Vault or GCP Secret Manager to inject credentials and tokens securely, instead of embedding them in plaintext within configuration files.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;5. Routine Compliance Reviews and Audits&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Security and compliance are ongoing responsibilities. Schedule periodic reviews of telemetry pipelines, access controls, and retention policies. Auditing configurations and data flows regularly helps identify outdated settings, over-permissive access, or unintentional data leakage.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;6. Data Minimization Principles&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Collect only what is necessary. Overcollection not only adds noise but also increases the surface area for compliance risk. Apply filters early in the pipeline, remove legacy or redundant telemetry sources, and periodically reassess what is being collected across environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build Trust Into Your Observability Stack
&lt;/h2&gt;

&lt;p&gt;Observability has come a long way, and today, building trust into it begins with intentional design. From deciding what to collect to how data is handled, OpenTelemetry offers the flexibility and control needed to embed security and compliance into every stage of the pipeline. By shaping telemetry as it flows, you enable teams to maintain visibility while reducing risk. I hope this article provides you with practical guidance to create observability pipelines that are not just effective, but also secure and compliant by design.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Top Metrics To Watch In Kubernetes</title>
      <dc:creator>Vibhuti Sharma</dc:creator>
      <pubDate>Tue, 13 May 2025 00:00:58 +0000</pubDate>
      <link>https://dev.to/vibhuti_sharma/top-metrics-to-watch-in-kubernetes-427k</link>
      <guid>https://dev.to/vibhuti_sharma/top-metrics-to-watch-in-kubernetes-427k</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frvw2zjdct1g7g3omz6br.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frvw2zjdct1g7g3omz6br.png" alt="Image description" width="800" height="323"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Introduction&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If you’ve ever found yourself knee-deep in a Kubernetes incident, watching a production microservice fail with mysterious 5xx errors, you know the drill: alerts are firing, dashboards are lit up like a Christmas tree, and your team is scrambling to make sense of a flood of metrics across every layer of the stack. It’s not a question of if this happens-it’s when.&lt;/p&gt;

&lt;p&gt;In that high-pressure moment, the true challenge isn’t just debugging-it’s knowing where to look. For seasoned SREs and technical founders who live and breathe Kubernetes, the ability to quickly zero in on the right signals can make the difference between a five-minute fix and a five-hour outage.&lt;/p&gt;

&lt;p&gt;So what are the metrics that actually move the needle? And how do you filter signal from noise when your platform is under fire?&lt;/p&gt;

&lt;p&gt;This article breaks down the critical Kubernetes metrics that every high-performing team should keep an eye on-before the next incident catches you off guard.&lt;/p&gt;

&lt;p&gt;If you don’t have a monitoring system in place, you’re already behind the curve. Kubernetes is a complex system with many moving parts, and without proper monitoring, you’re flying blind.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why Every Minute Counts in Kubernetes Outages&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When Kubernetes systems break, the impact isn’t just technical but also it’s financial, contractual, and reputational.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Real Cost of Downtime&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://blogs.cisco.com/cloud/the-cost-of-downtime-in-multicloud-it#:~:text=According%20to%20Gartner%2C%20the%20average,World%2C%20this%20gets%20even%20worse" rel="noopener noreferrer"&gt;According to Gartner &lt;/a&gt;, the average cost of IT downtime is $5,600 per minute, which adds up to over $330,000 per hour. We do not want to imagine this happening during peak traffic, a product launch, or a high-stakes client demo. The longer you spend guessing which part of the system failed, the more your business takes the hit. Often, it’s not even clear whether the issue lies in the network, storage, or application layer, leading to costly delays in diagnosis and resolution.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Tight SLAs &amp;amp; Tighter Repercussions&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;For the teams managing Kubernetes clusters on behalf of clients, Service Level Agreements (SLAs) can feel like a sword overhead. These agreements set strict limits on factors like downtime and error rates, where breaching them doesn’t just mean a few angry emails. It can lead to financial penalties, escalations, or even losing the client altogether. Without knowing which metrics reflect health and which signal red flags, they are always one step away from trouble.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Mean Time to Recovery&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;The Mean Time to Recovery (MTTR) is a critical KPI for SRE and DevOps teams. It reflects how long it takes to detect, troubleshoot, and restore service after a failure. A low MTTR means your systems are resilient and your team is effective. But reducing MTTR is only possible if you’re looking at the right data when the incident hits, and that’s where the top Kubernetes metrics come in.&lt;/p&gt;

&lt;p&gt;That is exactly what this blog is here for. We will walk you through the most critical Kubernetes metrics to monitor, the ones that give you real insight into the health of your system, help reduce downtime, and improve your response during incidents. Whether you’re running a small dev cluster or a complex multi-tenant setup, this guide will help you prioritize the right signals.&lt;/p&gt;

&lt;h3&gt;
  
  
  Significance of The Four Golden Signals
&lt;/h3&gt;

&lt;p&gt;If you have spent any time in the world of monitoring or Site Reliability Engineering, you have probably come across the Four Golden Signals: Latency, Traffic, Errors, and Saturation. Originally popularized in the &lt;a href="https://sre.google/books/" rel="noopener noreferrer"&gt;Google SRE book- Comprehensive guide to site reliability &lt;/a&gt;, these signals remain the gold standard when it comes to what you should measure to understand your system’s health.&lt;/p&gt;

&lt;p&gt;Even in Kubernetes environments where complexity multiplies with microservices, dynamic scaling, and distributed components, &lt;a href="https://developer.cisco.com/articles/what-are-the-golden-signals/what-are-the-golden-signals-that-sre-teams-use-to-detect-issues/#what-are-the-golden-signals%20" rel="noopener noreferrer"&gt;The Four Golden Signals&lt;/a&gt;help aim at the right target. They have a bold tie to our topic here. Hence, understanding them can help yield better understanding and results.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Latency helps you detect slowdowns even before users start complaining about them. Metrics like API server latency or HTTP request durations show where bottlenecks are live.&lt;/li&gt;
&lt;li&gt;Traffic metrics (like request rate, network throughput) help you understand demand and stress levels across your system.&lt;/li&gt;
&lt;li&gt;Errors surface failing pods, HTTP 5xx rates, and crash loops are your early warning signs.&lt;/li&gt;
&lt;li&gt;Saturation tells you when you’re about to hit resource limits, whether it’s CPU, memory, or disk I/O on nodes and pods.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a distributed system like Kubernetes, problems rarely announce themselves clearly. Golden Signals offer a language to interpret cluttered data, spot anomalies, and prioritize what truly needs fixing. Knowing how your app performs against these four dimensions makes your metrics strategy more focused, your alerts more meaningful, and your team more responsive.&lt;/p&gt;

&lt;h3&gt;
  
  
  The RED &amp;amp; USE Methods
&lt;/h3&gt;

&lt;p&gt;Just like the Four Golden Signals, there are two other powerful frameworks that help teams make sense of their monitoring data, RED and USE. These methods offer a structured way to prioritize what to measure and where to look during troubleshooting. While Golden Signals give you a high-level overview of system health, RED and USE help you go deeper with intent, depending on whether you’re debugging an application-level issue or digging into infrastructure problems.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;RED Method For Applications &amp;amp; Services&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;The RED method focuses on user-facing services and microservices, and is all about how your application is performing from a user’s perspective.It tracks three critical signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Requests per second (traffic)&lt;/li&gt;
&lt;li&gt;Errors per second (failures)&lt;/li&gt;
&lt;li&gt;Duration of requests (latency)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of RED as your first defense against a bad user experience. It closely aligns with the Four Golden Signals and is commonly visualized using pre-built RED dashboards in tools like Prometheus and Grafana. For a deeper dive, check out the official blog on &lt;a href="https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/" rel="noopener noreferrer"&gt;The RED Method by Tom Wilkie&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;USE Method For Infrastructure &amp;amp; System Health&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;The USE method is shaped to lower-level system resources such as nodes, disks, and network interfaces. It tracks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Utilization — How much of a resource is being used?&lt;/li&gt;
&lt;li&gt;Saturation — Is the resource at or near capacity?&lt;/li&gt;
&lt;li&gt;Errors — Are there any failures in the resource?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is especially useful when you’re debugging performance bottlenecks or checking node health in Kubernetes. For example, using the USE method, you might quickly spot a disk I/O bottleneck or excessive memory pressure on a node.&lt;/p&gt;

&lt;p&gt;They complement each other and help you design focused dashboards, meaningful alerts, and faster incident response workflows. For a deeper dive, check out the official blog on &lt;a href="https://www.brendangregg.com/usemethod.html" rel="noopener noreferrer"&gt;The USE Method by Brendan Gregg &lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Layers of Kubernetes Monitoring
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6su31wdq1jbl1zyqnyes.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6su31wdq1jbl1zyqnyes.png" alt="Image description" width="800" height="1142"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Before we go deeper into metrics, it is important to understand where they come from. Kubernetes is a layered system, and each layer gives its own signals. If you want complete observability, you need to collect metrics from every layer.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Cluster Layer&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;This is the big-picture view. At this level, you track overall cluster health, how many nodes are active, how many are unschedulable, how many pods are in a crash loop, or if your autoscaler is working as expected. Metrics from the Kube Controller Manager, Cloud Controller, and Cluster Autoscaler belong here.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Control Plane&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;This is the brain of a cluster. Components like the API server, scheduler, and ETCD are responsible for making everything work. Metrics from this layer help you answer questions like “Is the scheduler under pressure?”, “Is the ETCD server healthy and responding on time?”, “Are API requests getting throttled?”.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Nodes&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;These are the worker machines (virtual or physical) that run your workloads. Key node-level metrics include CPU, memory, disk I/O, and network throughput. If nodes are overloaded, your pods will suffer even if your app code is flawless.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Pods &amp;amp; Containers&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;This is the execution layer. Monitoring pod status, container restarts, resource requests/limits, and OOM (Out of Memory) kills can quickly tell you if your workloads are running as expected or if they’re crashing silently in the background.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Applications&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Finally, we reach the business logic that is the code you deploy. Application-level metrics include request latency, error rates, throughput, and custom business KPIs. These metrics help tie technical issues to user-facing problems, which is especially important when debugging customer-impacting incidents.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kubernetes Metrics That Matter The Most
&lt;/h3&gt;

&lt;p&gt;Once you understand the layers of observability in Kubernetes, the next step is knowing what to look at in each layer. Not all metrics are created equal; some help you react quickly, while others help you prevent issues entirely. Here are the top metrics across each layer that monitoring teams should track.&lt;/p&gt;

&lt;h4&gt;
  
  
  Cluster-Level Metrics
&lt;/h4&gt;

&lt;p&gt;Let’s say your Kubernetes cluster is experiencing performance issues; maybe workloads are failing to schedule, pods are restarting, or users are complaining about latency. Instead of jumping into individual pod metrics, let’s start from the top. Here’s a practical flow to investigate issues at the cluster level and narrow down potential root causes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Confirm the Symptoms at Scale&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Start with basic observations. Are these problems isolated or affecting the entire cluster?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check the number of unschedulable pods:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get pods --all-namespaces --field-selector=status.phase=Pending
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAMESPACE     NAME                                READY   STATUS    RESTARTS   AGE
default       myapp-frontend-7d8f9c6d8b-abcde     0/1     Pending   0          2m
kube-system   coredns-558bd4d5db-xyz12            0/1     Pending   0          1m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Pods are presented to be in a Pending state. To determine the root cause for it do further investigation. Look for frequent restarts:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get pods --all-namespaces | grep -v '0/' | grep 'CrashLoopBackOff\|Error'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAMESPACE   NAME                             READY   STATUS             RESTARTS   AGE
default     api-server-5d9f8f6d8b-xyz12      0/1     CrashLoopBackOff   5          10m
default     db-service-7d8f9c6d8b-abcde      0/1     Error              3          8m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Are nodes under pressure?
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl top nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME     CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
node-1   1800m        75%    7000Mi          85%
node-2   1500m        80%    6500Mi          90%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can see the node is under pressure. If multiple namespaces or workloads are impacted, it’s likely a cluster-level issue, not just an app problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Assess Node Health and Availability&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Node problems ripple across the entire cluster. Let’s check how many are healthy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME     STATUS     ROLES   AGE   VERSION
node-1   Ready      worker  10d   v1.25.0
node-2   NotReady   worker  10d   v1.25.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Watch out for nodes in a NotReady or Unknown state, these can cause workload evictions, failed scheduling, and data plane failures. If some nodes are out, look at recent cluster events:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get events --sort-by=.lastTimestamp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LAST SEEN   TYPE      REASON              OBJECT                     MESSAGE
2m          Warning   NodeNotReady        node/node-2                Node is not ready
1m          Warning   FailedScheduling    pod/myapp-frontend-xyz12   0/2 nodes are available: 1 node(s) were not ready
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pay attention to messages like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NodeNotReady&lt;/li&gt;
&lt;li&gt;FailedScheduling&lt;/li&gt;
&lt;li&gt;ContainerGCFailed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Detect Resource Bottlenecks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Even if nodes are “Ready,” they might not have capacity. Check CPU and memory pressure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl describe nodes | grep -A5 "Conditions:"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Conditions:
  Type             Status   LastHeartbeatTime       Reason
  MemoryPressure   True     2025-05-06T16:00:00Z     KubeletHasInsufficientMemory
  DiskPressure     False    2025-05-06T16:00:00Z     KubeletHasSufficientDisk
  PIDPressure      False    2025-05-06T16:00:00Z     KubeletHasSufficientPID
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for MemoryPressure, DiskPressure, or PIDPressure.&lt;/p&gt;

&lt;p&gt;Check what resources the scheduler sees as allocatable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl describe node node-name | grep -A10 "Allocatable"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Allocatable:
  cpu: 2000m
  memory: 8192Mi
  pods: 110
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If everything looks maxed out, your cluster may be underprovisioned, then it’s time to scale nodes or clean up unused resources.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Investigate Networking or DNS Issues&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Another issue likely faced is latency complaints or failing pod readiness probes often come down to network problems.&lt;/p&gt;

&lt;p&gt;Use Prometheus Dashboards to find out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;rate(container_network_receive_errors_total[5m])&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check for CoreDNS issues:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl logs -n kube-system -l k8s-app=kube-dns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.:53
2025/05/06 16:05:00 [INFO] CoreDNS-1.8.0
2025/05/06 16:05:00 [INFO] plugin/reload: Running configuration MD5 = 1a2b3c4d5e6f
2025/05/06 16:05:00 [ERROR] plugin/errors: 2 123.45.67.89:12345 - 0000 /etc/resolv.conf: dial tcp: lookup example.com on 10.96.0.10:53: server misbehaving
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Spot dropped packets or erratic latencies in inter-pod communication.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connect the Dots&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now correlate your findings. Ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Are failing pods being scheduled on overloaded or failing nodes?&lt;/li&gt;
&lt;li&gt;Are pods restarting due to OOMKills or image pull issues?&lt;/li&gt;
&lt;li&gt;Do networking or DNS failures match the timing of user complaints?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By this point, a pattern should emerge. And you should be able to rule out the cause of the issue to be one of these reasons. By the example outputs from above we can rule out this to likely be a cluster-level problem caused by an over-utilized or partially unavailable node.&lt;/p&gt;

&lt;h4&gt;
  
  
  Control Plane Metrics
&lt;/h4&gt;

&lt;p&gt;Let’s say you’ve ruled out node failures and cluster resource issues, but your workloads are still acting strange. Pods remain in Pendingfor too long, deployments aren’t progressing, and even basic kubectlcommands feel sluggish.&lt;/p&gt;

&lt;p&gt;That’s your signal that the control plane might be the bottleneck. Here is how to troubleshoot Kubernetes control plane health using metrics, and trace the problem back to its source.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gauge API Server Responsiveness&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The API server is the front door to your cluster. If it’s slow, everything slows down; kubectl, CI/CD pipelines, controllers, autoscalers.&lt;/p&gt;

&lt;p&gt;Check API server latency:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;histogram_quantile(0.95, rate(apiserver_request_duration_seconds_bucket[5m]))&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A spike here means users and internal components are all experiencing degraded interactions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Look for API Server Errors&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Latency might be caused by underlying failures especially from ETCD, which backs all API state.&lt;/p&gt;

&lt;p&gt;Check for 5xx errors from the API server:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;rate(apiserver_request_total{code=~”5..”}[5m])&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A sustained increase could mean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ETCD is overloaded or unhealthy&lt;/li&gt;
&lt;li&gt;API server is under too much load&lt;/li&gt;
&lt;li&gt;Network/storage latency is impacting ETCD reads/writes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If error rates correlate with latency spikes, check ETCD performance next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Investigate Scheduler Delays&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Maybe your pods are Pending and not getting scheduled even though nodes look healthy. This could be a scheduler problem, not a resource issue.&lt;/p&gt;

&lt;p&gt;Check how long the scheduler is taking to place pods:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;histogram_quantile(0.95, rate(scheduler_scheduling_duration_seconds_bucket[5m]))&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;High values here = the scheduler is overwhelmed, blocked, or crashing.&lt;/p&gt;

&lt;p&gt;Correlate this with pod age:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get pods --all-namespaces --sort-by=.status.startTime
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAMESPACE     NAME                                      READY   STATUS    RESTARTS   AGE
default       myapp-frontend-7d8f9c6d8b-abcde           0/1     Pending   0          18m
default       api-server-5d9f8f6d8b-xyz12               0/1     CrashLoopBackOff  5  20m
default       db-service-7d8f9c6d8b-def45               0/1     Error     3          19m
kube-system   coredns-558bd4d5db-xyz12                  0/1     Pending   0          21m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;New pods are in Pending for over 15 minutes, suggesting the scheduler is delayed and API server isn’t responding fast enough to resource or binding requests. If new pods sit in Pending too long, this is your bottleneck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitor Controller Workqueues&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Controller Manager keeps the desired state in sync; scaling replicas, rolling updates, service endpoints, etc. If it’s backed up, changes won’t propagate.&lt;/p&gt;

&lt;p&gt;Look at the depth of workqueues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;sum(workqueue_depth{name=~”.+”})&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most Kubernetes controllers are designed to quickly process items in their workqueues. A queue depth of 0–5 is generally normal and healthy. It means the controller is keeping up. Short spikes (up to ~10–20) can occur during events like rolling updates or scaling, and are usually harmless if they drop quickly. Start investigating if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;workqueue_depth stays above 50–100 consistently&lt;/li&gt;
&lt;li&gt;workqueue_adds_total keeps rising rapidly&lt;/li&gt;
&lt;li&gt;workqueue_work_duration_seconds shows long processing times&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These symptoms suggest the controller is backed up, leading to delays in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rolling out deployments&lt;/li&gt;
&lt;li&gt;Updating service endpoints&lt;/li&gt;
&lt;li&gt;Reconciling desired vs. actual state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;sum(workqueue_adds_total)&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;avg(workqueue_work_duration_seconds)&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Spikes here mean your controllers are overloaded, possibly due to a flood of changes or downstream API slowdowns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pull it All Together&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;From the above example outputs we can conclude the issue to be ETCD and API server latency which is causing cascading delays in the control plane:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scheduler can’t assign pods quickly due to slow API server.&lt;/li&gt;
&lt;li&gt;Controller Manager queues are backing up as desired state changes (like ReplicaSet creations) take too long to commit.&lt;/li&gt;
&lt;li&gt;kubectl and system components (like CoreDNS or autoscalers) are affected by poor responsiveness from the API server, which relies on ETCD.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In general, let’s say you see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High API latency&lt;/li&gt;
&lt;li&gt;Elevated 5xx errors&lt;/li&gt;
&lt;li&gt;Scheduler latency spikes&lt;/li&gt;
&lt;li&gt;Controller queues backed up&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When control plane metrics go bad, symptoms ripple through the whole system. Tracking these metrics as a cohesive unit helps you catch early signals before workloads break.&lt;/p&gt;

&lt;h4&gt;
  
  
  Node-Level Metrics: Digging into the Machine Layer
&lt;/h4&gt;

&lt;p&gt;If control plane metrics look healthy but problems persist like pods getting OOMKilled, apps slowing down, or workloads behaving inconsistently; it’s time to inspect the nodes themselves. These are the machines that run your actual workloads. Here’s how to walk through node-level metrics to find the culprit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Identify Which Nodes Are Affected&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Start by getting a quick snapshot of node health:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME      STATUS     ROLES    AGE   VERSION
node-1    Ready      worker   10d   v1.25.0
node-2    NotReady   worker   10d   v1.25.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for any nodes not in Ready state. If nodes are marked NotReady, Unknown, or SchedulingDisabled, that's your first signal. Then describe them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl describe node node-2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Conditions:
  Type             Status  LastHeartbeatTime            Reason
  MemoryPressure   False   2025-05-06T16:00:00Z         KubeletHasSufficientMemory
  DiskPressure     True    2025-05-06T16:00:00Z         KubeletHasDiskPressure
  PIDPressure      False   2025-05-06T16:00:00Z         KubeletHasSufficientPID

Taints:
  node.kubernetes.io/disk-pressure:NoSchedule
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Disk pressure is explicitly reported which is likely the source of pod issues. Focus on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Conditions: Look for MemoryPressure, DiskPressure, or PIDPressure&lt;/li&gt;
&lt;li&gt;Taints: Check if workloads are being prevented from scheduling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Check Resource Saturation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If nodes are Ready but workloads are misbehaving, they might just be under pressure. Get real-time usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl top nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME     CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
node-1   1200m        60%    6000Mi          70%
node-2   800m         40%    5800Mi          68%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Based on the example output, CPU and memory are normal, likely the disk is the bottleneck. In general cases, look for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High CPU%: Indicates throttling&lt;/li&gt;
&lt;li&gt;High Memory%: Can cause OOMKills or evictions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a node is maxed out, describe the pods on it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get pods --all-namespaces -o wide | grep node-2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;default       api-cache-678d456b7b-xyz11       0/1   Evicted           0     10m   node-2
default       order-db-7c9b5d49f-vx12c         0/1   Error             2     15m   node-2
default       analytics-app-67d945c78c-qwe78   0/1   CrashLoopBackOff  4     12m   node-2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Identify noisy neighbors or pods consuming abnormal resources. Multiple pods failing, evictions suggest disk pressure-based pod disruption.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Investigate Frequent Pod Restarts or Evictions&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Pods restarting or getting evicted? Check the reason:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get pod pod-name -n namespace -o jsonpath="{.status.containerStatuses[*].lastState}"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"terminated":{"reason":"Evicted","message":"The node was low on disk."}}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OOMKilled: memory overuse&lt;/li&gt;
&lt;li&gt;Evicted: node pressure (memory, disk, or PID)&lt;/li&gt;
&lt;li&gt;CrashLoopBackOff: instability in app or runtime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then verify which node they were running on, repeated issues from the same node point to a node-level problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check Disk and Network Health&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Some failures are subtler, slow apps, stuck I/O, DNS errors. These often come from disk or network bottlenecks.&lt;/p&gt;

&lt;p&gt;Use Prometheus Dashboard:&lt;/p&gt;

&lt;p&gt;Disk I/O:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;rate(node_disk_reads_completed_total[5m])&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Network errors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;rate(node_network_receive_errs_total[5m])&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;rate(node_network_transmit_errs_total[5m])&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These can indicate bad NICs, over-saturated interfaces, or DNS resolution failures affecting pods on that node. If not, SSH into the node and use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;iostat -xz 1 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;evice:         rrqm/s wrqm/s r/s   w/s  rkB/s  wkB/s avgrq-sz avgqu-sz await  svctm  %util
nvme0n1         0.00   12.00  50.00 250.00 1024.00 8192.00 60.00    8.50     30.00 1.00   99.90
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dmesg | grep -i error
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ 10452.661212] blk_update_request: I/O error, dev nvme0n1, sector 768
[ 10452.661217] EXT4-fs error (device nvme0n1): ext4_find_entry:1463: inode #131072: comm kubelet: reading directory lblock 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for high I/O wait, dropped packets, or NIC errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Review Node Stability &amp;amp; Uptime&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sometimes the issue is churn; nodes going up/down too frequently due to reboots or cloud spot termination.&lt;/p&gt;

&lt;p&gt;Check uptime:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;uptime
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; 16:15:03 up 2 days, 2:44, 1 user, load average: 5.12, 4.98, 3.80
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or with Prometheus:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;node_time_seconds — node_boot_time_seconds&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Frequent reboots suggest infrastructure problems or autoscaler misbehavior. If it’s spot nodes, review instance interruption rates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Correlate and Isolate&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this example case, node-2 is experiencing disk I/O congestion which is confirmed by DiskPressure, pod evictions due to low disk, iostat metrics showing 99%+ utilization and 30ms I/O latency, and kernel logs showing read errors. This node is the root cause of pod disruptions and degraded application behavior. However there can also be other factors. Let’s say you find:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One node has 90%+ memory usage&lt;/li&gt;
&lt;li&gt;That node also shows disk IO spikes and network errors&lt;/li&gt;
&lt;li&gt;Most failing pods are running on that node&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Node-level issues are often the hidden root of noisy, hard-to-trace application problems. Always include node health in your diagnostic workflow; even when app logs seem to tell a different story.&lt;/p&gt;

&lt;h4&gt;
  
  
  Pod &amp;amp; Deployment-Level Issues (RED Metrics)
&lt;/h4&gt;

&lt;p&gt;If node level metrics look healthy but problems persist like some pods are slow, users are getting errors, and latency seems off; it is time to check what is wrong at the pod or deployment level? Here’s how to tackle it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spot the Symptoms&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Start by identifying which services or deployments are affected. Are users reporting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slow API responses?&lt;/li&gt;
&lt;li&gt;Errors in requests?&lt;/li&gt;
&lt;li&gt;Timeouts?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Correlate with actual service/pod behavior using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl get pods -A
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAMESPACE     NAME                                READY   STATUS             RESTARTS   AGE
default       auth-api-7f8b45dd8f-abc12            0/1     CrashLoopBackOff   5          10m
default       auth-api-7f8b45dd8f-xyz89            0/1     CrashLoopBackOff   5          10m
default       payment-api-6f9c7f9b44-123qw         1/1     Running            0          20m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for pods in CrashLoopBackOff, Pending, or Error states. For example here, auth-api pods are failing that means something is wrong with that deployment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check the Request Rate&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This tells you if the service is even receiving traffic, and whether it suddenly dropped.&lt;/p&gt;

&lt;p&gt;If you’re using Prometheus + instrumentation (e.g., HTTP handlers exporting metrics):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;rate(http_requests_total[5m])&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Look for a sharp drop in traffic it might mean the pod isn’t even reachable due to readiness/liveness issues or misconfigured ingress.&lt;/p&gt;

&lt;p&gt;Also check the load balancer/ingress controller logs (e.g., NGINX, Istio) for clues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check the Error Rate&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This reveals if pods are throwing 5xx or 4xx errors, a sign of broken internal logic or downstream service failures.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;rate(http_requests_total{status=~”5..”}[5m])&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also inspect the pods:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl logs pod-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error: Missing required environment variable DATABASE_URL
    at config.js:12:15
    at bootstrapApp (/app/index.js:34:5)
    ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for exceptions, failed database calls, or panics. Here we see the pod is crashing due to missing DATABASE_URL, which might be a config issue during deployment.&lt;/p&gt;

&lt;p&gt;Use kubectl describe pod for events like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Failing readiness/liveness probes&lt;/li&gt;
&lt;li&gt;Container crashes&lt;/li&gt;
&lt;li&gt;Volume mount errors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Warning  Unhealthy  2m (x5 over 10m)   kubelet            Readiness probe failed: HTTP probe failed with statuscode: 500
  Warning  BackOff    2m (x10 over 10m)  kubelet            Back-off restarting failed container
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Check the Request Duration (Latency)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;High latency with no errors means something is &lt;em&gt;slow&lt;/em&gt;, not broken.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If request durations spike:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check if dependent services (e.g., database, Redis) are under pressure&lt;/li&gt;
&lt;li&gt;Use tracing tools (e.g., Jaeger, OpenTelemetry) if set up&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Look at CPU throttling with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl top pod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME                             CPU(cores)   MEMORY(bytes)
auth-api-7f8b45dd8f-abc12        15m          128Mi
payment-api-6f9c7f9b44-123qw     80m          200Mi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Based on the scenario we considered, there is no resource throttling or usage issues, the crash is logic-related, not pressure-related. And in Prometheus:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;rate(container_cpu_cfs_throttled_seconds_total[5m])&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Correlate with Deployment Events&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sometimes your pods are healthy but something changed in the deployment process (bad rollout, config error).&lt;/p&gt;

&lt;p&gt;Check rollout history:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl rollout history deployment deployment-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;deployment.apps/auth-api
REVISION  CHANGE-CAUSE
1         Initial deployment
2         Misconfigured env var for DATABASE_URL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See if a new revision broke things. If yes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl rollout undo deployment auth-api
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;deployment.apps/auth-api rolled back
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also review deployment description for more information:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl describe deployment auth-api
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Name:                   auth-api
Namespace:              default
Replicas:               2 desired | 0 updated | 0 available | 2 unavailable
StrategyType:           RollingUpdate
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    False   ProgressDeadlineExceeded
  Available      False   MinimumReplicasUnavailable

Environment:
  DATABASE_URL:  &amp;lt;unset&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Were all replicas successfully scheduled?&lt;/li&gt;
&lt;li&gt;Did resource limits or readiness probes cause issues?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Spot Trends in Replica Behavior&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you suspect scaling problems (e.g., not enough replicas to handle load):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;sum(kube_deployment_spec_replicas) by (deployment)&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;sum(kube_deployment_status_replicas_available) by (deployment)&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mismatch in these means rollout issues, pod crashes, or scheduling failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final Diagnosis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By following this flow, you’ll isolate whether your pods are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unavailable&lt;/strong&gt; (readiness or probe issues)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throwing errors&lt;/strong&gt; (broken logic, bad config)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slow&lt;/strong&gt; (upstream delays, resource throttling)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Or unstable&lt;/strong&gt; (bad rollout, crashing containers)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Troubleshooting Application-Level Issues
&lt;/h4&gt;

&lt;p&gt;If pods are running fine, nodes are healthy, no deployment issues, but users are still complaining then something could be wrong in the app itself. At this stage, the cluster looks fine, but it’s likely an internal app logic, dependency, or performance issue. So here’s how to troubleshoot it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trace the Symptoms from the Top&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What are users actually experiencing?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is a specific endpoint slow?&lt;/li&gt;
&lt;li&gt;Is authentication failing?&lt;/li&gt;
&lt;li&gt;Are pages timing out intermittently?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Start by querying RED metrics from your app’s own observability (assuming it’s instrumented with Prometheus, OpenTelemetry, etc.):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Request rate per endpoint:&lt;/strong&gt; &lt;em&gt;rate(http_requests_total{job=”your-app”}[5m])&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error rate (e.g., 4xx/5xx):&lt;/strong&gt; &lt;em&gt;rate(http_requests_total{status=~”5..”}[5m])&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency distribution:&lt;/strong&gt; &lt;em&gt;histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job=”your-app”}[5m]))&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This will quickly show which part of your app is misbehaving.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Traces to Follow the Journey&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If metrics are the “what”, traces are the “why.”&lt;/p&gt;

&lt;p&gt;Use tracing (Jaeger, Tempo, or OpenTelemetry backends) to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trace slow or failed requests&lt;/li&gt;
&lt;li&gt;Identify downstream service delays (e.g., DB, external APIs)&lt;/li&gt;
&lt;li&gt;Measure time spent in each span&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Look for patterns like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Long DB query spans&lt;/li&gt;
&lt;li&gt;Retries or timeouts from third-party APIs&lt;/li&gt;
&lt;li&gt;Deadlocks or slow code paths&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Profile Resource-Intensive Paths&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sometimes, the issue is an internal performance bug like memory leaks, CPU spikes, or thread contention.&lt;/p&gt;

&lt;p&gt;Use profiling tools like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pyroscope, Parca, Go pprof, or Node.js Inspector&lt;/li&gt;
&lt;li&gt;Flame graphs to visualize CPU/memory hotspots&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Check Dependencies &amp;amp; DB Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your app might be healthy, but its dependencies might not be.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Is the database under pressure:&lt;/strong&gt; &lt;em&gt;rate(mysql_global_status_threads_running[5m])&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Are Redis queries timing out:&lt;/strong&gt; &lt;em&gt;rate(redis_commands_duration_seconds_bucket[5m])&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Are queue workers backed up:&lt;/strong&gt; &lt;em&gt;sum(rabbitmq_queue_messages_ready)&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also watch for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Connection pool exhaustion&lt;/li&gt;
&lt;li&gt;Slow queries&lt;/li&gt;
&lt;li&gt;Locks or deadlocks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even subtle latency in DB or cache can bubble up as app slowdowns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External Services or 3rd Party APIs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Check whether your app relies on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Payment gateways&lt;/li&gt;
&lt;li&gt;Auth providers (like OAuth)&lt;/li&gt;
&lt;li&gt;External APIs (e.g., geolocation, email, analytics)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use Prometheus metrics or custom app logs to track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Latency of external calls&lt;/li&gt;
&lt;li&gt;Error rates (timeouts, HTTP 503s)&lt;/li&gt;
&lt;li&gt;Retry storms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Add circuit breakers or timeouts to avoid cascading failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validate Configuration &amp;amp; Feature Flags&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sometimes the issue is  &lt;strong&gt;human&lt;/strong&gt; :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Was a feature flag turned on for everyone?&lt;/li&gt;
&lt;li&gt;Did a bad config rollout silently break behavior?&lt;/li&gt;
&lt;li&gt;Was a critical env var left empty?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Review:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl describe deployment your-app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Name: your-app
Namespace: default
CreationTimestamp: Mon, 06 May 2025 14:21:52 +0000
Labels: app=your-app
Selector: app=your-app
Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable
StrategyType: RollingUpdate
Conditions:
  Type Status Reason
  ---- ------ ------
  Available True MinimumReplicasAvailable
  Progressing True NewReplicaSetAvailable

Pod Template:
  Containers:
   your-app:
    Image: ghcr.io/your-org/your-app:2025.05.06
    Port: 8080/TCP
    Environment:
      FEATURE_BACKGROUND_REINDEXING: "true"
      DATABASE_URL: "postgres://db.svc.cluster.local"
    Mounts:
      /etc/config from config-volume (ro)
      /etc/secrets from secret-volume (ro)
Volumes:
  config-volume:
    ConfigMapName: app-config
  secret-volume:
    SecretName: db-secret
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check env vars, configMaps, and secret mounts. Also audit Git or your config source of truth. In the above example output, all pods are healthy, rollout was successful but the environment variable FEATURE_BACKGROUND_REINDEXING is enabled, likely triggering background operations that were not meant for production, causing performance regressions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final Diagnosis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you’ve ruled out infrastructure and Kubernetes mechanics, your issue is almost certainly in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business logic&lt;/li&gt;
&lt;li&gt;Misbehaving external systems&lt;/li&gt;
&lt;li&gt;Unoptimized code paths&lt;/li&gt;
&lt;li&gt;Bad configs or feature toggles&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With solid RED metrics, tracing, profiling, and dependency checks you’ll isolate the slowest or weakest part of the app lifecycle.&lt;/p&gt;

&lt;h3&gt;
  
  
  Common Challenges in Monitoring Kubernetes
&lt;/h3&gt;

&lt;p&gt;Monitoring a Kubernetes environment isn’t just about scraping some metrics and throwing them into dashboards. In real-world scenarios especially in large-scale, multi-team clusters there are unique challenges that can cripple even the best monitoring setups. Here are some of the most common ones that teams face:&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Metric Overload&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;With so many layers that are clusters, nodes, control planes, pods, apps, it’s easy to end up with thousands of metrics. But more metrics is not equal to better observability. Without a clear signal-to-noise ratio, teams get stuck chasing anomalies that don’t matter, while missing critical signals that do.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Inconsistent Metric Sources&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Kubernetes components expose metrics in different formats and via different tools (Prometheus, ELK/EFK Stack, etc). This fragmentation can lead to incomplete or duplicated data, and sometimes even conflicting insights, making root cause analysis harder.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Multi-Tenancy Complexity&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;In shared clusters, multiple teams deploy and monitor their own apps. Without clear namespacing, labeling, and role-based access, it becomes hard to isolate responsibility or debug performance issues without stepping on each other’s toes.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Scaling Problems&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;At smaller scales, you might get by with basic dashboards. But as your workloads grow, so do the cardinality of metrics, storage costs, and processing load on your observability stack. Without a scalable monitoring setup, you risk cluttered dashboards and missed alerts.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Monitoring the Monitoring System&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Ironically, one of the most overlooked gaps is keeping tabs on your observability stack itself. What happens if Prometheus crashes? Or if your alert manager silently dies? Monitoring the monitor ensures you’re not blind when it matters the most.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Break-Glass Mechanisms&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Sometimes, no matter how well things are set up, you need to bypass the dashboards and go straight to logs, live debugging, or kubectl inspections. Having a documented “break-glass” process with emergency steps to dig deeper can save time during production outages.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Overcome These Challenges With Best Practices
&lt;/h3&gt;

&lt;p&gt;While Kubernetes observability can feel overwhelming, a thoughtful strategy and the right tools can make all the difference.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Focus on High-Value Metrics&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Instead of tracking everything, prioritize Golden Signals, RED/USE metrics, and metrics tied to SLAs and SLOs. Create dashboards with intent, not clutter.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Standardize Your Metric Sources&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Use a centralized metrics pipeline, typically &lt;a href="https://www.cloudraft.io/blog/prometheus-best-practices" rel="noopener noreferrer"&gt;Prometheus &lt;/a&gt;, with exporters like kube-state-metrics, node exporter, and custom app exporters. Stick to consistent naming conventions and labels to avoid confusion across teams.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Use Labels &amp;amp; Namespaces Effectively&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Organize metrics by namespace, team, or application, and apply proper labels to distinguish tenants. Use tools like Prometheus’ relabeling and Grafana’s variable filters to slice metrics cleanly per use case.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Design for Scale&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Enable metric retention policies, recording rules, and downsampling. Consider remote write to long-term storage (like or &lt;a href="https://www.cloudraft.io/grafana-mimir-support" rel="noopener noreferrer"&gt;Grafana Mimir&lt;/a&gt;) for large environments. Test how your dashboards perform under load.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Monitor Your Monitoring&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Set up alerts for your observability stack (e.g., “Is Prometheus scraping?”, “Is Alertmanager up?”). Include basic health checks for Grafana, Prometheus, exporters, and data sources.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Establish “Break-Glass” Documents&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Have documented steps for when observability fails, like which logs to tail, which kubectl commands to run, or how to access emergency dashboards. Practice chaos drills so everyone knows what to do.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tools That Help You Monitor These Metrics
&lt;/h3&gt;

&lt;p&gt;Understanding what to monitor is only 50% task; the other 50% is how you actually collect, store, and visualize that data in a scalable and insightful way. The Kubernetes ecosystem has a rich set of observability tools that make this easier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prometheus and Grafana&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.cloudraft.io/monitoring-with-prometheus" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt;is the high standard for scraping, storing, and querying time-series metrics in Kubernetes.&lt;/li&gt;
&lt;li&gt;Grafana lets you visualize those metrics and set up alerting.&lt;/li&gt;
&lt;li&gt;With exporters like &lt;code&gt;node-exporter&lt;/code&gt; and &lt;code&gt;kube-state-metrics&lt;/code&gt; you can cover everything from node health to pod status and custom application metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Best for teams looking for full control, custom dashboards, and open-source extensibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;kube-state-metrics&lt;/strong&gt; This is a service that listens to the Kubernetes API server and generates metrics about the state of Kubernetes objects like deployments, nodes, pods, etc. It complements Prometheus by exposing high-level cluster state metrics (e.g., number of ready pods, desired replicas, node conditions). Best for the cluster-level insights and higher-order metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External Monitoring Services (VictoriaMetrics, Jaeger, OpenTelemetry, etc)&lt;/strong&gt; These open source tools form a powerful observability stack for Kubernetes environments. &lt;a href="https://docs.victoriametrics.com/guides/k8s-monitoring-via-vm-cluster/" rel="noopener noreferrer"&gt;VictoriaMetrics&lt;/a&gt;handles efficient metric storage, &lt;a href="https://www.cloudraft.io/blog/open-telemetry-auto-instrumentation" rel="noopener noreferrer"&gt;OpenTelemetry&lt;/a&gt;standardizes tracing and metrics across services, and with engineers can monitor and troubleshoot with distributed transaction monitoring. Together, they give you flexibility, cost savings, and full control over your monitoring pipeline, without vendor lock-in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Gathering data is only one aspect of monitoring Kubernetes; another is gathering the appropriate data so that prompt, well-informed decisions can be made. Knowing which metrics are your best defense, whether you’re a platform team scaling across clusters, a DevOps engineer optimizing performance, or a Site Reliability engineer fighting a late-night outage.&lt;/p&gt;

&lt;p&gt;That is why having a well-defined &lt;a href="https://www.cloudraft.io/blog/guide-to-observability" rel="noopener noreferrer"&gt;observability strategy &lt;/a&gt;, one that cuts through clutter, highlights what is needed, and adapts as your architecture evolves is no longer optional. Teams are increasingly turning to frameworks, tooling, and purpose-built observability solutions that support this shift toward proactive, insight-driven operations. At the end of the day, metrics are your map, but only if you’re reading the right signs. Focus on these key signals, and you’ll spend less time digging through data and more time solving real problems.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at&lt;/em&gt; &lt;a href="https://www.cloudraft.io/blog/top-metrics-to-watch-in-kubernetes" rel="noopener noreferrer"&gt;&lt;em&gt;https://www.cloudraft.io&lt;/em&gt;&lt;/a&gt; &lt;em&gt;on May 13, 2025.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
