Vibhuti Sharma

Posted on Mar 26

Monitoring AWS Batch Jobs with CloudWatch Custom Metrics

#observ #aws #awsbatch #monitoring

AWS Batch service is used for various compute workloads like data processing pipelines, background jobs and scheduled compute tasks. AWS provides many infrastructure-level metrics for Batch in CloudWatch, however there is a significant gap when it comes to job status monitoring. For example, the number of jobs that are RUNNABLE, RUNNING, FAILED, or SUCCEEDED are not available by default in CloudWatch. These metrics are visible on the AWS Batch dashboard but it does not exist in CloudWatch as a metric. This makes it difficult to answer operational questions such as:

Are jobs accumulating in a RUNNABLE state?
Are the jobs failing frequently?
Is the system keeping up with workload?

Without these metrics, building meaningful dashboards or alerts for Batch workloads becomes challenging.

In this blog post, we can understand how to close this observability gap by exporting custom AWS Batch job status metrics into CloudWatch, which can then be consumed by any third party observability tool. The blog post will walk you through a custom setup for exporting these metrics using EventBridge, Lambda Function, Batch API and Cloudwatch custom metrics.

Which AWS batch metrics does CloudWatch publish by default?

AWS Batch publishes a limited set of infrastructure metrics to CloudWatch under the ECS/Container Insights namespace. These metrics primarily describe compute environment capacity and resource utilization, rather than the status of jobs.

Examples of metrics available by default include:

StorageReadBytes: number of bytes read from storage on the instance
NetworkTxBytes: number of bytes transmitted by the resource
CpuReserved: CPU units reserved by tasks in the resource
StorageWriteBytes: number of bytes written to storage in the resource
EphemeralStorageReserved: number of bytes reserved from ephemeral storage in the resource
TaskCount: number of tasks running in the cluster
MemoryReserved: memory that is reserved by tasks in the resource
EphemeralStorageUtilized: number of bytes used from ephemeral storage in the resource
NetworkRxBytes: number of bytes received by the resource
CpuUtilized: CPU units used by tasks in the resource
ServiceCount: number of services in the cluster
ContainerInstanceCount: number of EC2 instances running the Amazon ECS agent that are registered with a cluster
MemoryUtilized: memory being used by tasks in the resource

However, job status metrics are not published by default. These missing metrics include:

Number of RUNNABLE jobs
Number of RUNNING jobs
Number of FAILED jobs
Number of SUCCEEDED jobs
Number of SUBMITTED jobs These metrics are critical for monitoring Batch workloads because they indicate system health and throughput.

For example:

A growing RUNNABLE job count may indicate insufficient compute capacity.
A spike in FAILED jobs may indicate application or infrastructure issues. To obtain these metrics, we need to query the AWS Batch API and publish the results ourselves.

How to export AWS batch job status metrics to CloudWatch

In this solution, we periodically queries AWS Batch API for job states and publishes the status and job counts as custom CloudWatch metrics.

Components used
The setup consists of four components:

EventBridge Rule
Runs on a schedule (for example every 5 minutes)
Triggers a Lambda function
Lambda Function
Calls the AWS Batch API
Retrieves job counts by status
Aggregates the results
Publishes metrics to CloudWatch
AWS Batch API
Provides job information through API calls such as jobSummaryList
CloudWatch Custom Metrics
Stores the exported job status metrics
Exposes them for dashboards and alert

Workflow
The process works as follows:

EventBridge triggers the Lambda function on a schedule.
Lambda queries AWS Batch for job counts in each status.
The job counts are aggregated.
Lambda publishes these counts as custom metrics to CloudWatch.
Grafana or another observability tool reads these metrics from CloudWatch. This architecture is serverless, inexpensive, and easy to extend, and does not require changes to existing Batch workloads.

Cost considerations

The cost of this setup is minimal because it relies entirely on serverless services and lightweight API calls.

EventBridge: EventBridge scheduled rules cost a fraction of a cent per million invocations. With a schedule of every 5 minutes, the cost is negligible.
Lambda: The Lambda function only performs a small number of API calls and executes for a short duration. In most cases, this will remain well within the Lambda free tier.
CloudWatch Custom Metrics: CloudWatch custom metrics are the primary cost factor. CloudWatch charges per metric per month. However, since the setup only publishes a small number of metrics (typically 4–6), the total cost remains low. For example, publishing metrics for:

RUNNABLE
RUNNING
FAILED
SUCCEEDED
SUBMITTED

Results in only a few custom metrics. Overall, the monthly cost of this setup is typically very small compared to the operational visibility it provides.

Implementation

The implementation consists of four main steps:

Creating the Lambda function
Querying the Batch API
Publishing metrics to CloudWatch
Scheduling the function using EventBridge

Lambda function logic

The Lambda function performs the following actions:

Retrieves job queues
Queries jobs by status
Counts jobs in each state
Publishes the counts to CloudWatch
Example Python implementation:

import os
import boto3
import json
from datetime import datetime
import logging

batch_client = boto3.client("batch")
cloudwatch = boto3.client("cloudwatch")

def batch_metrics_exporter(event, context):
    try:
        job_queues = json.loads(os.environ["JOB_QUEUES"])
        compute_env_mapping = json.loads(os.environ["COMPUTE_ENV_MAPPING"])

        all_metrics = []

        for queue_name in job_queues:
            job_counts = get_job_counts_by_status(batch_client, queue_name)
            compute_env = compute_env_mapping.get(queue_name, 'unknown')
            timestamp = datetime.utcnow()

            for status, count in job_counts.items():
                all_metrics.append({
                    'MetricName': f'{status}Jobs',
                    'Dimensions': [
                        {'Name': 'JobQueue', 'Value': queue_name},
                        {'Name': 'ComputeEnvironment', 'Value': compute_env}
                    ],
                    'Value': count,
                    'Unit': 'Count',
                    'Timestamp': timestamp
                })

        batch_size = 20
        for i in range(0, len(all_metrics), batch_size):
            cloudwatch.put_metric_data(
                Namespace='<YOUR_NAMESPACE>',
                MetricData=all_metrics[i:i + batch_size]
            )

        return {
            'statusCode': 200,
            'body': json.dumps({
                'message': f"Successfully published {len(all_metrics)} metrics",
                'queues_processed': len(job_queues)
            })
        }

    except Exception as e:
        logger.error(f"Error in lambda_handler: {str(e)}")
        return {"statusCode": 500, "body": json.dumps({"error": str(e)})}

def get_job_counts_by_status(batch_client, job_queue_name):
    job_statuses = ['SUBMITTED', 'PENDING', 'RUNNABLE', 'STARTING', 'RUNNING', 'SUCCEEDED', 'FAILED']
    job_counts = {'Submitted': 0, 'Runnable': 0, 'Running': 0, 'Failed': 0, 'Succeeded': 0}

    try:
        for status in job_statuses:
            response = batch_client.list_jobs(jobQueue=job_queue_name, jobStatus=status)
            count = len(response.get('jobSummaryList', []))

            if status in ['SUBMITTED', 'PENDING']:
                job_counts['Submitted'] += count
            elif status in ['RUNNABLE', 'STARTING']:
                job_counts['Runnable'] += count
            elif status == 'RUNNING':
                job_counts['Running'] += count
            elif status == 'FAILED':
                job_counts['Failed'] += count
            elif status == 'SUCCEEDED':
                job_counts['Succeeded'] += count

    except Exception as e:
        logger.error(f"Error getting job counts for queue {job_queue_name}: {str(e)}")

    return job_counts

IAM permissions required for the Lambda function

The Lambda function requires permissions for Batch APIs and CloudWatch metrics.

Minimal IAM policy example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": ["batch:DescribeJobQueues", "batch:ListJobs"],
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Action": "cloudwatch:PutMetricData",
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}

Setting up the EventBridge schedule
We can create an EventBridge rule with a schedule expression. The rate can be as per the requirement.

Example: rate(5 minutes)

Attach the Lambda function as the target. This ensures the metrics are refreshed periodically.

How to verify AWS Batch metrics in CloudWatch

Once the Lambda function begins publishing metrics, they will appear in CloudWatch under the custom namespace used in the implementation.

Navigate to: CloudWatch → Metrics →

You should see metrics such as:

JobCount (Status=RUNNABLE)
JobCount (Status=RUNNING)
JobCount (Status=FAILED)
JobCount (Status=SUCCEEDED)

Each metric will also include dimensions such as the job queue name, allowing filtering per queue. These metrics can now be queried, visualized, or used for alerting.

Limitations

One of the limitations of the Batch API is that it returns point-in-time snapshots rather than time series data.

This means the metrics represent the number of jobs in each state at the moment the Lambda function runs, rather than a continuous stream of job events.

However, this limitation can be addressed using PromQL queries in observability systems such as Prometheus or Grafana.

For example:

Deriving job failure rates
Calculating trends in runnable job backlog
Detecting abnormal changes in job states

Another limitation is data delay, which depends on the EventBridge schedule. If the rule runs every 5 minutes, the metrics will have up to a five minute delay.

Reducing the schedule interval improves freshness but increases API usage.

When should you use this setup?

This approach is most useful when Batch workloads involve long-running or heavy compute jobs. In such environments, understanding job queue health is important for operational stability.

Examples include:

Data processing pipelines
Machine learning workloads
ETL systems
Background compute services

However, the setup may be less useful for very short-lived jobs that start and complete within seconds. In those cases, the scheduled polling approach may miss transient states.

Therefore, this solution is most effective when jobs run long enough to be captured within the scheduled polling interval.

Conclusion

AWS Batch provides strong compute orchestration capabilities but lacks native job-level observability metrics in CloudWatch. By combining EventBridge, Lambda, the Batch API, and CloudWatch custom metrics, it is possible to export job status metrics and integrate them into existing observability dashboards.

This setup provides visibility into queue backlog, job failures, and system throughput, enabling better operational monitoring of Batch workloads. In practice, this solution has proven useful for tracking job health and building meaningful dashboards around Batch-based workloads. With minimal infrastructure and low cost, it significantly improves observability for production Batch environments.

Originally published at https://www.cloudraft.io on March 24, 2026.

DEV Community