DEV Community

Cover image for Monitoring AWS Batch Jobs with CloudWatch Custom Metrics
Vibhuti Sharma
Vibhuti Sharma

Posted on

Monitoring AWS Batch Jobs with CloudWatch Custom Metrics

AWS Batch service is used for various compute workloads like data processing pipelines, background jobs and scheduled compute tasks. AWS provides many infrastructure-level metrics for Batch in CloudWatch, however there is a significant gap when it comes to job status monitoring. For example, the number of jobs that are RUNNABLE, RUNNING, FAILED, or SUCCEEDED are not available by default in CloudWatch. These metrics are visible on the AWS Batch dashboard but it does not exist in CloudWatch as a metric. This makes it difficult to answer operational questions such as:

  • Are jobs accumulating in a RUNNABLE state?

  • Are the jobs failing frequently?

  • Is the system keeping up with workload?

Without these metrics, building meaningful dashboards or alerts for Batch workloads becomes challenging.

In this blog post, we can understand how to close this observability gap by exporting custom AWS Batch job status metrics into CloudWatch, which can then be consumed by any third party observability tool. The blog post will walk you through a custom setup for exporting these metrics using EventBridge, Lambda Function, Batch API and Cloudwatch custom metrics.

Which AWS batch metrics does CloudWatch publish by default?

AWS Batch publishes a limited set of infrastructure metrics to CloudWatch under the ECS/Container Insights namespace. These metrics primarily describe compute environment capacity and resource utilization, rather than the status of jobs.

Examples of metrics available by default include:

  • StorageReadBytes: number of bytes read from storage on the instance

  • NetworkTxBytes: number of bytes transmitted by the resource

  • CpuReserved: CPU units reserved by tasks in the resource

  • StorageWriteBytes: number of bytes written to storage in the resource

  • EphemeralStorageReserved: number of bytes reserved from ephemeral storage in the resource

  • TaskCount: number of tasks running in the cluster

  • MemoryReserved: memory that is reserved by tasks in the resource

  • EphemeralStorageUtilized: number of bytes used from ephemeral storage in the resource

  • NetworkRxBytes: number of bytes received by the resource

  • CpuUtilized: CPU units used by tasks in the resource

  • ServiceCount: number of services in the cluster

  • ContainerInstanceCount: number of EC2 instances running the Amazon ECS agent that are registered with a cluster

  • MemoryUtilized: memory being used by tasks in the resource

However, job status metrics are not published by default. These missing metrics include:

  • Number of RUNNABLE jobs
  • Number of RUNNING jobs
  • Number of FAILED jobs
  • Number of SUCCEEDED jobs
  • Number of SUBMITTED jobs These metrics are critical for monitoring Batch workloads because they indicate system health and throughput.

For example:

  • A growing RUNNABLE job count may indicate insufficient compute capacity.
  • A spike in FAILED jobs may indicate application or infrastructure issues. To obtain these metrics, we need to query the AWS Batch API and publish the results ourselves.

How to export AWS batch job status metrics to CloudWatch

In this solution, we periodically queries AWS Batch API for job states and publishes the status and job counts as custom CloudWatch metrics.

Components used
The setup consists of four components:

  1. EventBridge Rule
  2. Runs on a schedule (for example every 5 minutes)
  3. Triggers a Lambda function
  4. Lambda Function
  5. Calls the AWS Batch API
  6. Retrieves job counts by status
  7. Aggregates the results
  8. Publishes metrics to CloudWatch
  9. AWS Batch API
  10. Provides job information through API calls such as jobSummaryList
  11. CloudWatch Custom Metrics
  12. Stores the exported job status metrics
  13. Exposes them for dashboards and alert

Workflow
The process works as follows:

  1. EventBridge triggers the Lambda function on a schedule.
  2. Lambda queries AWS Batch for job counts in each status.
  3. The job counts are aggregated.
  4. Lambda publishes these counts as custom metrics to CloudWatch.
  5. Grafana or another observability tool reads these metrics from CloudWatch. This architecture is serverless, inexpensive, and easy to extend, and does not require changes to existing Batch workloads.

Cost considerations

The cost of this setup is minimal because it relies entirely on serverless services and lightweight API calls.

  1. EventBridge: EventBridge scheduled rules cost a fraction of a cent per million invocations. With a schedule of every 5 minutes, the cost is negligible.
  2. Lambda: The Lambda function only performs a small number of API calls and executes for a short duration. In most cases, this will remain well within the Lambda free tier.
  3. CloudWatch Custom Metrics: CloudWatch custom metrics are the primary cost factor. CloudWatch charges per metric per month. However, since the setup only publishes a small number of metrics (typically 4–6), the total cost remains low. For example, publishing metrics for:
  • RUNNABLE
  • RUNNING
  • FAILED
  • SUCCEEDED
  • SUBMITTED

Results in only a few custom metrics. Overall, the monthly cost of this setup is typically very small compared to the operational visibility it provides.

Implementation

The implementation consists of four main steps:

  1. Creating the Lambda function
  2. Querying the Batch API
  3. Publishing metrics to CloudWatch
  4. Scheduling the function using EventBridge

Lambda function logic

The Lambda function performs the following actions:

  • Retrieves job queues
  • Queries jobs by status
  • Counts jobs in each state
  • Publishes the counts to CloudWatch
  • Example Python implementation:
import os
import boto3
import json
from datetime import datetime
import logging

batch_client = boto3.client("batch")
cloudwatch = boto3.client("cloudwatch")

def batch_metrics_exporter(event, context):
    try:
        job_queues = json.loads(os.environ["JOB_QUEUES"])
        compute_env_mapping = json.loads(os.environ["COMPUTE_ENV_MAPPING"])

        all_metrics = []

        for queue_name in job_queues:
            job_counts = get_job_counts_by_status(batch_client, queue_name)
            compute_env = compute_env_mapping.get(queue_name, 'unknown')
            timestamp = datetime.utcnow()

            for status, count in job_counts.items():
                all_metrics.append({
                    'MetricName': f'{status}Jobs',
                    'Dimensions': [
                        {'Name': 'JobQueue', 'Value': queue_name},
                        {'Name': 'ComputeEnvironment', 'Value': compute_env}
                    ],
                    'Value': count,
                    'Unit': 'Count',
                    'Timestamp': timestamp
                })

        batch_size = 20
        for i in range(0, len(all_metrics), batch_size):
            cloudwatch.put_metric_data(
                Namespace='<YOUR_NAMESPACE>',
                MetricData=all_metrics[i:i + batch_size]
            )

        return {
            'statusCode': 200,
            'body': json.dumps({
                'message': f"Successfully published {len(all_metrics)} metrics",
                'queues_processed': len(job_queues)
            })
        }

    except Exception as e:
        logger.error(f"Error in lambda_handler: {str(e)}")
        return {"statusCode": 500, "body": json.dumps({"error": str(e)})}

def get_job_counts_by_status(batch_client, job_queue_name):
    job_statuses = ['SUBMITTED', 'PENDING', 'RUNNABLE', 'STARTING', 'RUNNING', 'SUCCEEDED', 'FAILED']
    job_counts = {'Submitted': 0, 'Runnable': 0, 'Running': 0, 'Failed': 0, 'Succeeded': 0}

    try:
        for status in job_statuses:
            response = batch_client.list_jobs(jobQueue=job_queue_name, jobStatus=status)
            count = len(response.get('jobSummaryList', []))

            if status in ['SUBMITTED', 'PENDING']:
                job_counts['Submitted'] += count
            elif status in ['RUNNABLE', 'STARTING']:
                job_counts['Runnable'] += count
            elif status == 'RUNNING':
                job_counts['Running'] += count
            elif status == 'FAILED':
                job_counts['Failed'] += count
            elif status == 'SUCCEEDED':
                job_counts['Succeeded'] += count

    except Exception as e:
        logger.error(f"Error getting job counts for queue {job_queue_name}: {str(e)}")

    return job_counts
Enter fullscreen mode Exit fullscreen mode

IAM permissions required for the Lambda function

The Lambda function requires permissions for Batch APIs and CloudWatch metrics.

Minimal IAM policy example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": ["batch:DescribeJobQueues", "batch:ListJobs"],
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Action": "cloudwatch:PutMetricData",
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Setting up the EventBridge schedule
We can create an EventBridge rule with a schedule expression. The rate can be as per the requirement.

Example: rate(5 minutes)

Attach the Lambda function as the target. This ensures the metrics are refreshed periodically.

How to verify AWS Batch metrics in CloudWatch

Once the Lambda function begins publishing metrics, they will appear in CloudWatch under the custom namespace used in the implementation.

Navigate to: CloudWatch → Metrics →

You should see metrics such as:

  • JobCount (Status=RUNNABLE)
  • JobCount (Status=RUNNING)
  • JobCount (Status=FAILED)
  • JobCount (Status=SUCCEEDED)

Each metric will also include dimensions such as the job queue name, allowing filtering per queue. These metrics can now be queried, visualized, or used for alerting.

Limitations

One of the limitations of the Batch API is that it returns point-in-time snapshots rather than time series data.

This means the metrics represent the number of jobs in each state at the moment the Lambda function runs, rather than a continuous stream of job events.

However, this limitation can be addressed using PromQL queries in observability systems such as Prometheus or Grafana.

For example:

  • Deriving job failure rates
  • Calculating trends in runnable job backlog
  • Detecting abnormal changes in job states

Another limitation is data delay, which depends on the EventBridge schedule. If the rule runs every 5 minutes, the metrics will have up to a five minute delay.

Reducing the schedule interval improves freshness but increases API usage.

When should you use this setup?

This approach is most useful when Batch workloads involve long-running or heavy compute jobs. In such environments, understanding job queue health is important for operational stability.

Examples include:

  • Data processing pipelines
  • Machine learning workloads
  • ETL systems
  • Background compute services

However, the setup may be less useful for very short-lived jobs that start and complete within seconds. In those cases, the scheduled polling approach may miss transient states.

Therefore, this solution is most effective when jobs run long enough to be captured within the scheduled polling interval.

Conclusion

AWS Batch provides strong compute orchestration capabilities but lacks native job-level observability metrics in CloudWatch. By combining EventBridge, Lambda, the Batch API, and CloudWatch custom metrics, it is possible to export job status metrics and integrate them into existing observability dashboards.

This setup provides visibility into queue backlog, job failures, and system throughput, enabling better operational monitoring of Batch workloads. In practice, this solution has proven useful for tracking job health and building meaningful dashboards around Batch-based workloads. With minimal infrastructure and low cost, it significantly improves observability for production Batch environments.

Originally published at https://www.cloudraft.io on March 24, 2026.

Top comments (0)