DEV Community

Event-Driven Batch Processing on AWS: From Scheduled Tasks to Auto-Scaling Workloads

As DevOps engineers, we've all been there: running EC2 instances 24/7 to process batch jobs that only run a few hours per day. The meter keeps ticking whether you're processing millions of images or sitting idle at 3 AM. Your AWS bill arrives, and you wonder: "Why am I paying for servers that do nothing most of the time?"

I've seen this pattern repeatedly across startups, enterprises, and everything in between. The traditional approach to batch processing always-on infrastructure made sense in the data center era. But in the cloud? It's like leaving your car running in the parking lot all day because you might need it later.

In this comprehensive guide, I'll walk you through three production-ready patterns for implementing batch processing on AWS using containers. You'll learn when to use each pattern, how to implement auto-scaling based on queue depth, and how to reduce your batch processing costs by up to 74%. More importantly, you'll understand the architectural decisions behind each pattern and how to adapt them to your specific use cases.

All code examples are production-ready and available on GitHub, complete with Terraform infrastructure, monitoring configurations, and deployment guides. Let's dive in.

The Problem with Traditional Batch Processing

Let's start with a real scenario that mirrors what I've seen dozens of times. Imagine you're running a SaaS platform that processes user-uploaded documents PDFs that need to be converted, analyzed, and indexed. Your current setup uses two t3.medium EC2 instances running constantly, costing about $73 per month.

But here's the reality of your workload:

  • 6 AM - 10 AM: Heavy processing (users uploading files before work)
  • 10 AM - 5 PM: Moderate, sporadic activity
  • 5 PM - 7 PM: Another spike (end-of-day uploads)
  • 7 PM - 6 AM: Nearly idle (maybe 5% utilization)

You're paying for 100% capacity but using it fully only about 25% of the time. The math is brutal: you're paying $55/month for idle servers doing nothing.

Traditional batch processing faces four major challenges:

1. Always-On Infrastructure Costs

In the on-premises world, this made sense servers were already purchased. But in the cloud, you're paying by the hour. Those idle servers are burning money while you sleep.

2. Manual Scaling is Reactive, Not Proactive

You get a sudden spike in uploads. Your servers max out. Users wait. You get paged. You manually spin up more instances. By the time they're ready, the spike has passed. Then you forget to turn them off and waste more money.

3. Poor Resource Utilization

Your jobs vary in size. Some take 30 seconds, others take 10 minutes. With fixed-size instances, you either over-provision (waste money) or under-provision (slow performance). There's no middle ground.

4. Operational Complexity

Managing servers means patching, monitoring, log rotation, security updates, and capacity planning. Every hour spent on infrastructure is an hour not spent on features that differentiate your product.

Enter event-driven batch processing with containers. Instead of always-on servers, you run tasks only when there's work to do. Instead of manual scaling, your infrastructure responds automatically to workload changes. Instead of managing servers, you define tasks and let AWS handle the rest.

Sounds too good to be true? Let me show you how it works.

Understanding Event-Driven Architecture

Before we dive into specific patterns, let's understand what "event-driven" means in this context.

Traditional batch processing is time-driven: "Run this job every day at 2 AM."

Event-driven batch processing is trigger-driven: "Run this job when something happens."

That "something" could be:

  • A message arriving in a queue
  • A file uploaded to S3
  • An API request
  • A schedule (yes, scheduled tasks can be event-driven too!)
  • A state change in your application

The key difference is responsiveness. Time-driven systems process on a fixed schedule regardless of actual need. Event-driven systems process in response to actual demand.

Comparison of time-driven vs event-driven batch processing flows

This shift changes everything:

  • Costs: Pay only when processing
  • Performance: Respond immediately to demand
  • Scalability: Automatically match capacity to workload
  • Simplicity: No capacity planning needed

Now let's explore three production-ready patterns that implement this approach.

Pattern 1: Scheduled Tasks with ECS and EventBridge

Let's start simple. Not every batch job needs to be event-driven in the reactive sense. Sometimes you genuinely need to run something on a schedule. But that doesn't mean you need servers running 24/7.

Architecture Overview

ECS Scheduled Tasks architecture showing EventBridge → ECS → S3/SNS

EventBridge (formerly CloudWatch Events) acts as your scheduler. At the specified time, it triggers an ECS task. The task runs your containerized application, completes its work, and terminates. You pay only for the compute time used typically minutes, not hours.

Real-World Use Case: Daily Sales Reports

Let me walk you through a concrete example I implemented for an e-commerce platform.

Requirement: Generate daily sales reports aggregating data from RDS, calculate metrics, create PDF reports, and upload to S3 for the executive team.

Previous Solution: A t3.small instance ($15/month) running cron, active 24/7.

New Solution:

  • Container with Python script + ReportLab for PDF generation
  • EventBridge rule: cron(0 6 * * ? *) (6 AM UTC daily)
  • Task spec: 0.5 vCPU, 1 GB memory
  • Average runtime: 4 minutes

Cost Breakdown:

Fargate vCPU: $0.04048 per vCPU per hour
Fargate Memory: $0.004445 per GB per hour

Daily cost:
(0.5 vCPU × $0.04048) + (1 GB × $0.004445) = $0.024685/hour
4 minutes = 0.067 hours
Cost per run = $0.0017

Monthly cost (30 days): $0.051
Add CloudWatch, S3, SNS: ~$1.50/month
Total: ~$1.55/month
Enter fullscreen mode Exit fullscreen mode

Compare this to the $15/month for an always-on t3.small. That's a 90% cost reduction. Over a year, that's $162 saved enough to fund several other AWS services.

Implementation Deep Dive

Here's what the task definition looks like in Terraform:

resource "aws_ecs_task_definition" "daily_report" {
  family                   = "daily-sales-report"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = "512"    # 0.5 vCPU
  memory                   = "1024"   # 1 GB

  container_definitions = jsonencode([{
    name  = "report-generator"
    image = "${aws_ecr_repository.reports.repository_url}:latest"

    environment = [
      {
        name  = "REPORT_TYPE"
        value = "daily-sales"
      },
      {
        name  = "S3_BUCKET"
        value = aws_s3_bucket.reports.id
      }
    ]

    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = "/ecs/daily-reports"
        "awslogs-region"        = "us-east-1"
        "awslogs-stream-prefix" = "report"
      }
    }
  }])
}
Enter fullscreen mode Exit fullscreen mode

The EventBridge rule that triggers it:

resource "aws_cloudwatch_event_rule" "daily_report" {
  name                = "daily-sales-report-trigger"
  description         = "Trigger daily sales report at 6 AM UTC"
  schedule_expression = "cron(0 6 * * ? *)"
}

resource "aws_cloudwatch_event_target" "ecs_task" {
  rule     = aws_cloudwatch_event_rule.daily_report.name
  arn      = aws_ecs_cluster.reports.arn
  role_arn = aws_iam_role.eventbridge_ecs.arn

  ecs_target {
    task_count          = 1
    task_definition_arn = aws_ecs_task_definition.daily_report.arn
    launch_type         = "FARGATE"

    network_configuration {
      subnets          = aws_subnet.private[*].id
      security_groups  = [aws_security_group.ecs_tasks.id]
      assign_public_ip = false
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

When to Use This Pattern

Perfect for:

  • Daily/weekly/monthly reports
  • Scheduled backups
  • Database maintenance tasks
  • Periodic data cleanup
  • Regular data synchronization
  • Certificate renewals
  • Compliance checks

Not suitable for:

  • Variable workloads with unpredictable timing
  • Jobs triggered by external events
  • Processing queues of variable size
  • Workloads requiring high parallelism

Pro Tips:

  1. Use SNS to get notified of failures silent failures are the worst
  2. Set appropriate task timeouts to prevent runaway jobs
  3. Use CloudWatch Logs Insights to track execution time trends
  4. Consider multiple tasks for different report types rather than one monolithic task

Pattern 2: Event-Driven Auto-Scaling with ECS and SQS

Now we're getting to the heart of event-driven batch processing. This is the pattern I recommend for most production workloads, and it's where the real magic happens.

Architecture Overview

Detailed event-driven architecture with SQS, ECS auto-scaling, DLQ, and monitoring

This architecture responds dynamically to workload. Messages arrive in SQS, Application Auto Scaling watches the queue depth, and ECS automatically spins up or down tasks to maintain your target backlog per task.

The Auto-Scaling Magic Explained

Let me break down exactly how auto-scaling works, because understanding this is crucial to optimizing your workload.

Application Auto Scaling uses a target tracking scaling policy. You define a target value for a metric, and AWS automatically adjusts capacity to maintain that target.

The metric we care about is "backlog per instance":

Backlog per Instance = Queue Messages / Running Tasks
Enter fullscreen mode Exit fullscreen mode

Let's say you configure a target of 5 messages per task. Here's what happens in different scenarios:

Scenario 1: Steady State

  • Queue has 5 messages
  • 1 task running
  • Backlog = 5 / 1 = 5 (matches target)
  • No scaling action

Scenario 2: Sudden Spike

  • Queue jumps to 100 messages (someone uploaded a batch)
  • Still 1 task running
  • Backlog = 100 / 1 = 100 (way above target of 5)
  • Scale-out action triggered
  • Target tasks = 100 / 5 = 20
  • But you've set max_tasks = 10
  • Service scales to 10 tasks
  • New backlog = 100 / 10 = 10 (still above target, but best we can do)

Scenario 3: Processing Complete

  • Queue is empty
  • 10 tasks running
  • Backlog = 0 / 10 = 0 (below target)
  • Scale-in action triggered after cooldown
  • Service scales to min_tasks = 1

Scenario 4: Gradual Increase

  • Queue grows from 5 to 25 messages over 10 minutes
  • Service gradually scales from 1 to 5 tasks
  • Maintains target backlog throughout

Cooldown Periods: The Unsung Heroes

Cooldown periods prevent "flapping" rapid scaling up and down that wastes money and causes instability.

target_tracking_scaling_policy_configuration {
  target_value       = 5
  scale_in_cooldown  = 300  # 5 minutes
  scale_out_cooldown = 60   # 1 minute
}
Enter fullscreen mode Exit fullscreen mode

Why different cooldowns?

  • Scale-out is fast (60 seconds): When messages are backing up, you want to respond quickly
  • Scale-in is slow (300 seconds): Tasks are cheap when idle; better to keep them around briefly in case more work arrives

I learned this the hard way. Initially, I set both to 60 seconds. The result? Constant scaling churn. A batch of 50 messages would arrive, scale up to 10 tasks, finish in 2 minutes, immediately scale down, then another batch would arrive 30 seconds later and scale back up. Each scaling action takes time, and you're actually processing slower due to all the churn.

With a 5-minute scale-in cooldown, the service stays at capacity for a bit after processing, ready for the next batch. This is especially important for "bursty" workloads with periodic spikes.

Real-World Implementation: Image Processing Pipeline

Let me share a real implementation I built for a photo sharing platform.

Requirements:

  • Process user-uploaded images (resize, optimize, generate thumbnails)
  • Variable workload: 100-1000 uploads per hour
  • Process each image in ~2 minutes
  • Store results in S3

Architecture:

  • S3 bucket for uploads (with event notifications)
  • Lambda function to create SQS messages from S3 events
  • SQS queue for processing jobs
  • ECS service with auto-scaling (1-10 tasks)
  • S3 bucket for processed images

The Application Code:

import boto3
import json
from PIL import Image
import io

sqs = boto3.client('sqs')
s3 = boto3.client('s3')

QUEUE_URL = os.environ['SQS_QUEUE_URL']
OUTPUT_BUCKET = os.environ['OUTPUT_BUCKET']

def process_image(bucket, key):
    """Download, process, and upload image"""
    # Download original
    response = s3.get_object(Bucket=bucket, Key=key)
    img = Image.open(io.BytesIO(response['Body'].read()))

    # Generate sizes
    sizes = {
        'thumbnail': (150, 150),
        'medium': (800, 800),
        'large': (1920, 1920)
    }

    results = []
    for size_name, dimensions in sizes.items():
        # Resize maintaining aspect ratio
        img_resized = img.copy()
        img_resized.thumbnail(dimensions, Image.LANCZOS)

        # Convert to bytes
        buffer = io.BytesIO()
        img_resized.save(buffer, format='JPEG', quality=85, optimize=True)
        buffer.seek(0)

        # Upload to S3
        output_key = f"processed/{size_name}/{key}"
        s3.put_object(
            Bucket=OUTPUT_BUCKET,
            Key=output_key,
            Body=buffer,
            ContentType='image/jpeg'
        )

        results.append({
            'size': size_name,
            'key': output_key,
            'dimensions': img_resized.size
        })

    return results

def process_messages():
    """Main processing loop"""
    messages_processed = 0
    max_messages = int(os.environ.get('MAX_MESSAGES', '10'))

    while messages_processed < max_messages:
        # Long polling for efficiency
        response = sqs.receive_message(
            QueueUrl=QUEUE_URL,
            MaxNumberOfMessages=1,
            WaitTimeSeconds=20,
            VisibilityTimeout=300  # 5 minutes
        )

        messages = response.get('Messages', [])
        if not messages:
            print("No messages available, exiting")
            break

        for message in messages:
            try:
                # Parse message
                body = json.loads(message['Body'])
                bucket = body['bucket']
                key = body['key']

                print(f"Processing {key}...")

                # Process image
                results = process_image(bucket, key)

                # Delete message on success
                sqs.delete_message(
                    QueueUrl=QUEUE_URL,
                    ReceiptHandle=message['ReceiptHandle']
                )

                messages_processed += 1
                print(f"Processed {key}: {results}")

            except Exception as e:
                print(f"Error processing message: {e}")
                # Message will return to queue after visibility timeout
                # After 3 failures, it goes to DLQ

    print(f"Processed {messages_processed} messages")
Enter fullscreen mode Exit fullscreen mode

Key Design Decisions:

  1. Long Polling (WaitTimeSeconds=20): Reduces empty receives, lowers costs, and increases efficiency
  2. Visibility Timeout (300s): Set to 2.5× average processing time (2 minutes × 2.5 = 5 minutes)
  3. Max Messages Limit: Task processes up to 10 messages then exits, allowing fresh tasks to start
  4. Delete Only on Success: Messages remain in queue if processing fails, triggering retries
  5. Graceful Error Handling: Exceptions are logged but don't crash the task

The Infrastructure: Terraform Configuration

Here's the complete auto-scaling configuration:

# SQS Queue
resource "aws_sqs_queue" "image_processing" {
  name                       = "image-processing-queue"
  visibility_timeout_seconds = 300
  message_retention_seconds  = 86400  # 24 hours
  receive_wait_time_seconds  = 20     # Long polling

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.dlq.arn
    maxReceiveCount     = 3
  })
}

# Dead Letter Queue
resource "aws_sqs_queue" "dlq" {
  name                      = "image-processing-dlq"
  message_retention_seconds = 1209600  # 14 days
}

# ECS Service
resource "aws_ecs_service" "image_processor" {
  name            = "image-processor"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.processor.arn
  launch_type     = "FARGATE"
  desired_count   = 1  # Start with 1, let auto-scaling adjust

  network_configuration {
    subnets          = aws_subnet.private[*].id
    security_groups  = [aws_security_group.ecs_tasks.id]
    assign_public_ip = false
  }
}

# Auto Scaling Target
resource "aws_appautoscaling_target" "ecs_target" {
  max_capacity       = 10
  min_capacity       = 1
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.image_processor.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

# Auto Scaling Policy
resource "aws_appautoscaling_policy" "scale_on_sqs" {
  name               = "scale-on-sqs-depth"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs_target.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_target.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value       = 5.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60

    customized_metric_specification {
      metrics {
        label = "Queue depth"
        id    = "m1"
        metric_stat {
          metric {
            namespace   = "AWS/SQS"
            metric_name = "ApproximateNumberOfMessagesVisible"
            dimensions {
              name  = "QueueName"
              value = aws_sqs_queue.image_processing.name
            }
          }
          stat = "Average"
        }
        return_data = false
      }

      metrics {
        label = "Running task count"
        id    = "m2"
        metric_stat {
          metric {
            namespace   = "ECS/ContainerInsights"
            metric_name = "RunningTaskCount"
            dimensions {
              name  = "ServiceName"
              value = aws_ecs_service.image_processor.name
            }
            dimensions {
              name  = "ClusterName"
              value = aws_ecs_cluster.main.name
            }
          }
          stat = "Average"
        }
        return_data = false
      }

      metrics {
        label       = "Backlog per instance"
        id          = "e1"
        expression  = "m1 / m2"
        return_data = true
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Testing and Validation

I tested this setup with realistic workloads. Here's what I observed:

Test 1: Burst Processing

  • Sent 100 messages to queue
  • Started with 1 task
  • Within 90 seconds, scaled to 10 tasks
  • Processed all 100 images in 3.5 minutes
  • Scaled back to 1 task after 5-minute cooldown
  • Total time from first message to completion: ~9 minutes

Test 2: Steady Flow

  • Sent messages at 5/minute rate
  • Service maintained 1 task (perfect for this rate)
  • No unnecessary scaling
  • Average processing latency: 2 minutes

Test 3: Variable Workload

  • Alternated between 2/minute and 20/minute
  • Service scaled between 1-4 tasks
  • Adapted smoothly to changes
  • No message backlog

CloudWatch graph showing task count responding to queue depth over time

Cost Analysis: Real Numbers

Let's calculate the actual costs for processing 10,000 images per day:

Assumptions:

  • Average processing time: 2 minutes per image
  • Task size: 0.5 vCPU, 1 GB memory
  • Images spread throughout the day (not all at once)
  • Average of 3-4 tasks running at peak, 1 at off-peak

Fargate Costs:

Per hour rates:
vCPU: $0.04048 per vCPU-hour
Memory: $0.004445 per GB-hour

Per task per hour:
(0.5 × $0.04048) + (1 × $0.004445) = $0.024685

Total processing time per day:
10,000 images × 2 minutes = 20,000 minutes = 333 hours

Daily cost: 333 hours × $0.024685 = $8.22
Monthly cost: $8.22 × 30 = $246.60
Enter fullscreen mode Exit fullscreen mode

Wait, that seems high! But remember, this is 333 task-hours. With auto-scaling averaging 3-4 tasks, that's about 100 actual hours of runtime per day. Let's recalculate:

Average concurrent tasks: 3.5
Hours per day needed: 333 task-hours / 3.5 tasks = 95 hours
But we want parallelism, so actual calendar time: 24 hours
Average load: 333 task-hours / 24 hours = 13.9 task-hours on average

More realistic calculation:
Peak hours (8 hours/day): 6 tasks average
Off-peak (16 hours/day): 1-2 tasks average

Peak: 8 hours × 6 tasks × $0.024685 = $1.18
Off-peak: 16 hours × 1.5 tasks × $0.024685 = $0.59
Daily total: $1.77
Monthly: $1.77 × 30 = $53.10
Enter fullscreen mode Exit fullscreen mode

Additional AWS costs:

  • SQS requests: 10,000 messages/day × 30 = 300,000/month = $0.12
  • S3 storage (processed images): ~$5/month
  • CloudWatch Logs: ~$2/month
  • Data transfer: ~$1/month

Total monthly cost: ~$61

Compare this to running EC2 instances 24/7:

  • To handle peak load (6 images/minute), you'd need 3× t3.medium instances
  • Cost: 3 × $30 = $90/month (with reserved instances)
  • With on-demand: 3 × $30 × 1.4 = $126/month

Savings: $65/month (52% reduction with reserved instances, or 107% vs on-demand)

But more importantly: automatic scaling means you never drop requests during spikes, and you never pay for idle capacity during slow periods.

Dead Letter Queue: Your Safety Net

The DLQ deserves special attention because it's your early warning system for problems.

When messages go to DLQ:

  • Image file is corrupted
  • S3 permissions issue
  • Application bug causing exception
  • Timeout (processing takes longer than visibility timeout)

DLQ Configuration:

resource "aws_sqs_queue" "dlq" {
  name                      = "image-processing-dlq"
  message_retention_seconds = 1209600  # 14 days - longer than main queue
}

# Alarm when ANY message appears in DLQ
resource "aws_cloudwatch_metric_alarm" "dlq_alarm" {
  alarm_name          = "image-processing-dlq-messages"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 60
  statistic           = "Average"
  threshold           = 0
  alarm_description   = "Alert when messages land in DLQ"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    QueueName = aws_sqs_queue.dlq.name
  }
}
Enter fullscreen mode Exit fullscreen mode

Investigating DLQ Messages:

# View messages in DLQ
aws sqs receive-message \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789/image-processing-dlq \
  --max-number-of-messages 10

# Common reasons and solutions:
# 1. Corrupted images → Add validation before processing
# 2. Permissions → Check IAM role has S3 access
# 3. Timeout → Increase visibility timeout or optimize processing
# 4. Bug → Fix code and redrive messages
Enter fullscreen mode Exit fullscreen mode

Redriving Messages:

Once you've fixed the issue, you can redrive messages from DLQ back to the main queue:

import boto3

sqs = boto3.client('sqs')

dlq_url = 'https://sqs.us-east-1.amazonaws.com/123456789/image-processing-dlq'
main_queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789/image-processing-queue'

# Move messages back
while True:
    response = sqs.receive_message(
        QueueUrl=dlq_url,
        MaxNumberOfMessages=10
    )

    messages = response.get('Messages', [])
    if not messages:
        break

    for message in messages:
        # Send to main queue
        sqs.send_message(
            QueueUrl=main_queue_url,
            MessageBody=message['Body']
        )

        # Delete from DLQ
        sqs.delete_message(
            QueueUrl=dlq_url,
            ReceiptHandle=message['ReceiptHandle']
        )

    print(f"Redrove {len(messages)} messages")
Enter fullscreen mode Exit fullscreen mode

When to Use This Pattern

Perfect for:

  • Image/video processing
  • Document conversion and analysis
  • ETL pipelines with variable data volumes
  • Batch notifications/emails
  • Data transformation jobs
  • File processing triggered by S3 uploads
  • Any workload with unpredictable timing or volume

Not suitable for:

  • Real-time processing (sub-second latency requirements)
  • Jobs requiring strict ordering (use FIFO queues, but note scaling limitations)
  • Extremely high-throughput workloads (>100k messages/minute may need different approach)

Pattern 3: Advanced Scheduling with EKS Jobs and CronJobs

Now let's explore the Kubernetes approach. This pattern is more complex but offers capabilities that the previous two don't.

Why Choose EKS for Batch Processing?

First, let's be clear: EKS adds complexity. You're managing Kubernetes, which has a steep learning curve. The control plane costs $73/month before you even run a single workload.

So why would you choose this?

  1. You're already running EKS for other workloads
  2. You need advanced scheduling (dependencies, DAGs, complex workflows)
  3. You want built-in parallel processing with workqueue pattern
  4. You need multi-tenancy with namespace isolation
  5. You want portability (run same workloads on-prem, other clouds)

If none of these apply, stick with Pattern 2. But if you're already in the Kubernetes ecosystem, let's make it work for batch processing.

Architecture Overview

EKS architecture with Jobs, CronJobs, IRSA, and AWS services

Kubernetes Jobs Explained

A Kubernetes Job creates one or more Pods and ensures they run to completion. Unlike Deployments (which restart failed Pods indefinitely), Jobs track successful completions and stop when done.

Simple Job Example:

apiVersion: batch/v1
kind: Job
metadata:
  name: data-processing-job
spec:
  completions: 1      # Run once successfully
  backoffLimit: 3     # Retry up to 3 times on failure

  template:
    spec:
      restartPolicy: Never  # Don't restart on failure (Job handles retries)

      containers:
      - name: processor
        image: myregistry/data-processor:latest
        env:
        - name: JOB_TYPE
          value: "daily-aggregation"

        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "1000m"
            memory: "2Gi"
Enter fullscreen mode Exit fullscreen mode

Parallel Processing:

Here's where Jobs shine. You can process multiple items in parallel:

apiVersion: batch/v1
kind: Job
metadata:
  name: parallel-data-job
spec:
  completions: 100      # Process 100 batches
  parallelism: 10       # Run 10 Pods at once
  backoffLimit: 5

  template:
    spec:
      restartPolicy: Never

      containers:
      - name: processor
        image: myregistry/data-processor:latest
        env:
        - name: BATCH_SIZE
          value: "1000"
        # Each Pod processes one batch of 1000 items
        # 10 Pods process 10 batches simultaneously
        # Total: 100 batches = 100,000 items
Enter fullscreen mode Exit fullscreen mode

This creates 10 Pods initially. As each Pod completes, a new one starts until all 100 completions are reached. You process 100,000 items with 10 workers running in parallel automatically.

CronJobs: Scheduled Kubernetes Jobs

CronJobs are Jobs that run on a schedule:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-report
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  timeZone: "America/New_York"  # Kubernetes 1.25+

  # Job history
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3

  # Concurrency control
  concurrencyPolicy: Forbid  # Don't run if previous job still running

  # Miss tolerance
  startingDeadlineSeconds: 300  # If we miss the schedule by >5 minutes, skip

  jobTemplate:
    spec:
      backoffLimit: 2

      template:
        spec:
          restartPolicy: OnFailure

          containers:
          - name: report-generator
            image: myregistry/report-generator:latest
            env:
            - name: REPORT_TYPE
              value: "daily-summary"
Enter fullscreen mode Exit fullscreen mode

Real-World Example: ETL Pipeline with Parallel Processing

Let me share an ETL implementation I built for a data analytics platform.

Requirements:

  • Extract data from 50 different data sources nightly
  • Transform and load into data warehouse
  • Each source takes 5-15 minutes to process
  • Total data: ~500 GB per night

ECS Approach Issues:

  • Would need 50 sequential runs (6-12 hours total) OR
  • Complex orchestration with Step Functions
  • Difficult to track which sources succeeded/failed

EKS Solution:

apiVersion: batch/v1
kind: Job
metadata:
  name: nightly-etl-{{ .Values.date }}
spec:
  completions: 50      # One per data source
  parallelism: 10      # Process 10 sources simultaneously
  backoffLimit: 2

  template:
    metadata:
      labels:
        app: etl-processor
        date: "{{ .Values.date }}"

    spec:
      serviceAccountName: etl-processor-sa  # IRSA for AWS access

      containers:
      - name: etl
        image: myregistry/etl-processor:v2.1.0

        env:
        - name: DATA_SOURCE
          valueFrom:
            configMapKeyRef:
              name: etl-sources
              key: source-$(JOB_COMPLETION_INDEX)

        - name: TARGET_DATABASE
          valueFrom:
            secretKeyRef:
              name: warehouse-credentials
              key: connection-string

        - name: JOB_COMPLETION_INDEX
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']

        resources:
          requests:
            cpu: "2000m"
            memory: "4Gi"
          limits:
            cpu: "4000m"
            memory: "8Gi"

        volumeMounts:
        - name: temp-data
          mountPath: /tmp/etl

      volumes:
      - name: temp-data
        emptyDir:
          sizeLimit: "20Gi"

      nodeSelector:
        workload-type: batch  # Use specific node pool
Enter fullscreen mode Exit fullscreen mode

ConfigMap with Data Sources:

apiVersion: v1
kind: ConfigMap
metadata:
  name: etl-sources
data:
  source-0: "s3://data/source-crm/"
  source-1: "s3://data/source-analytics/"
  source-2: "s3://data/source-warehouse/"
  # ... 47 more sources
Enter fullscreen mode Exit fullscreen mode

This setup processes 10 sources in parallel. As each completes, a new Pod starts for the next source. Total time: ~90 minutes instead of 6-12 hours.

Benefits:

  • Built-in parallelism (no custom orchestration)
  • Easy to track completions (kubectl get pods)
  • Automatic retries on failure
  • Simple to add/remove sources (edit ConfigMap)

IRSA: Secure AWS Access from Kubernetes

One challenge with EKS is accessing AWS services securely. IRSA (IAM Roles for Service Accounts) solves this elegantly.

Without IRSA (bad):

  • Store AWS credentials in Secrets
  • Credentials could leak
  • Difficult rotation
  • Shared credentials across workloads

With IRSA (good):

  • Each Service Account gets a unique IAM role
  • Temporary credentials injected automatically
  • No long-lived credentials
  • Fine-grained permissions per workload

Setup:

# IAM Role
resource "aws_iam_role" "etl_processor" {
  name = "eks-etl-processor-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:oidc-provider/${local.oidc_provider}"
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "${local.oidc_provider}:sub" = "system:serviceaccount:default:etl-processor-sa"
        }
      }
    }]
  })
}

# IAM Policy
resource "aws_iam_role_policy" "etl_processor" {
  role = aws_iam_role.etl_processor.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:ListBucket"
        ]
        Resource = [
          "arn:aws:s3:::data/*",
          "arn:aws:s3:::data"
        ]
      },
      {
        Effect = "Allow"
        Action = [
          "rds:DescribeDBInstances"
        ]
        Resource = "*"
      }
    ]
  })
}
Enter fullscreen mode Exit fullscreen mode

Kubernetes Service Account:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: etl-processor-sa
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/eks-etl-processor-role
Enter fullscreen mode Exit fullscreen mode

Your Pods automatically get temporary AWS credentials with only the permissions they need.

Cost Comparison

EKS Costs (for ETL example):

EKS Control Plane: $73/month

Worker Nodes (for batch processing):
- Use managed node group with auto-scaling
- t3.xlarge instances (4 vCPU, 16 GB) for batch workloads
- On-demand: $0.1664/hour × 24 × 30 = $120/month per node
- Reserved (1-year): ~$75/month per node
- Spot: ~$50/month per node

For ETL running nightly (90 minutes/day):
- 2 spot instances: $100/month
- Actually running: 90 min/day = 45 hours/month
- Effective cost: 45/720 × $100 = $6.25 for compute
- Control plane: $73
- Total: ~$79/month

BUT: Control plane is shared across all EKS workloads
If you're already running EKS, incremental cost is just the nodes: ~$6-50/month depending on spot vs reserved
Enter fullscreen mode Exit fullscreen mode

When EKS Makes Sense:

  • You're already paying for EKS control plane
  • You need to run multiple batch workloads (share the $73 cost)
  • Complex workflows benefit from Kubernetes orchestration

When ECS is Better:

  • Standalone batch workloads
  • Simpler requirements
  • Team isn't familiar with Kubernetes

Monitoring and Observability

Regardless of which pattern you choose, monitoring is crucial. Here's what I've learned about effective monitoring for batch workloads.

Essential Metrics

For ECS + SQS Pattern:

  1. Queue Depth (ApproximateNumberOfMessagesVisible)

    • Alert if > 1000 for more than 10 minutes (backlog building)
    • Graph to see daily patterns
  2. Queue Age (ApproximateAgeOfOldestMessage)

    • Alert if > 30 minutes (processing too slow)
    • Indicates scaling issues or stuck messages
  3. Task Count (RunningTaskCount vs DesiredCount)

    • Verify auto-scaling is working
    • Alert if desired != running for > 5 minutes (scaling issues)
  4. DLQ Depth (ApproximateNumberOfMessagesVisible on DLQ)

    • Alert immediately on any messages (critical failures)
  5. Error Rate (custom metric from application logs)

    • Track success vs failure rate
    • Alert if >5% failure rate

loudWatch Dashboard

CloudWatch Dashboard

Here's a CloudWatch Dashboard configuration I use:

{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "metrics": [
          [ "AWS/SQS", "ApproximateNumberOfMessagesVisible", { "stat": "Average" } ],
          [ ".", "ApproximateNumberOfMessagesNotVisible", { "stat": "Average" } ]
        ],
        "view": "timeSeries",
        "region": "us-east-1",
        "title": "Queue Depth",
        "period": 300,
        "yAxis": { "left": { "min": 0 } }
      }
    },
    {
      "type": "metric",
      "properties": {
        "metrics": [
          [ "ECS/ContainerInsights", "RunningTaskCount", { "stat": "Average" } ],
          [ ".", "DesiredTaskCount", { "stat": "Average" } ]
        ],
        "view": "timeSeries",
        "title": "ECS Tasks",
        "period": 60
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

CloudWatch Logs Insights Queries

Find All Errors:

fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) as error_count by bin(5m)
| sort @timestamp desc
Enter fullscreen mode Exit fullscreen mode

Average Processing Time:

fields @timestamp, @message
| filter @message like /Processing completed/
| parse @message "duration: * seconds" as duration
| stats avg(duration) as avg_time, max(duration) as max_time, min(duration) as min_time
Enter fullscreen mode Exit fullscreen mode

Top Error Messages:

fields @message
| filter @message like /ERROR/
| parse @message "*ERROR*: *" as prefix, error_msg
| stats count(*) as occurrences by error_msg
| sort occurrences desc
| limit 10
Enter fullscreen mode Exit fullscreen mode

Throughput Analysis:

fields @timestamp
| filter @message like /Messages Processed/
| parse @message "Messages Processed: *" as count
| stats sum(count) as total_processed by bin(1h)
| sort @timestamp
Enter fullscreen mode Exit fullscreen mode

CloudWatch Alarms

Critical Alarms (page immediately):

# DLQ has messages
resource "aws_cloudwatch_metric_alarm" "dlq_messages" {
  alarm_name          = "batch-dlq-critical"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 60
  statistic           = "Average"
  threshold           = 0
  alarm_description   = "Messages in DLQ - immediate investigation needed"
  alarm_actions       = [aws_sns_topic.critical_alerts.arn]
}

# All tasks failing
resource "aws_cloudwatch_metric_alarm" "no_running_tasks" {
  alarm_name          = "batch-no-tasks-running"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = 3
  metric_name         = "RunningTaskCount"
  namespace           = "ECS/ContainerInsights"
  period              = 300
  statistic           = "Average"
  threshold           = 1
  alarm_description   = "No tasks running but queue has messages"
}
Enter fullscreen mode Exit fullscreen mode

Warning Alarms (investigate soon):

# Old messages in queue
resource "aws_cloudwatch_metric_alarm" "old_messages" {
  alarm_name          = "batch-old-messages"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "ApproximateAgeOfOldestMessage"
  namespace           = "AWS/SQS"
  period              = 300
  statistic           = "Maximum"
  threshold           = 1800  # 30 minutes
  alarm_description   = "Messages stuck in queue"
  alarm_actions       = [aws_sns_topic.warnings.arn]
}

# High error rate
resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
  alarm_name          = "batch-high-errors"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ApplicationErrors"
  namespace           = "CustomApp/Batch"
  period              = 300
  statistic           = "Sum"
  threshold           = 10
  alarm_description   = "Too many application errors"
}
Enter fullscreen mode Exit fullscreen mode

Production Best Practices

After running these patterns in production for several clients, here are my battle-tested recommendations.

1. Design for Idempotency

This is the most important practice. Your jobs will be retried network issues, task failures, scaling events and you need to handle it gracefully.

Bad (non-idempotent):

def process_order(order_id):
    # Get order
    order = db.get_order(order_id)

    # Update inventory
    db.execute("UPDATE inventory SET quantity = quantity - 1")

    # Charge customer
    stripe.charge(order.customer_id, order.amount)

    # Mark as processed
    db.execute("UPDATE orders SET processed = true")
Enter fullscreen mode Exit fullscreen mode

If this fails after charging but before marking as processed, a retry will charge twice!

Good (idempotent):

def process_order(order_id):
    # Check if already processed
    order = db.get_order(order_id)
    if order.processed:
        logger.info(f"Order {order_id} already processed, skipping")
        return

    # Use transaction with optimistic locking
    with db.transaction():
        # Lock row
        order = db.get_order_for_update(order_id)

        if order.processed:
            return  # Another task already processed it

        # Idempotent inventory update
        result = db.execute(
            "UPDATE inventory SET quantity = quantity - 1 WHERE quantity > 0"
        )
        if result.rows_affected == 0:
            raise OutOfStockError()

        # Idempotent payment (use idempotency key)
        stripe.charge(
            order.customer_id,
            order.amount,
            idempotency_key=f"order-{order_id}"
        )

        # Mark processed
        db.execute(
            "UPDATE orders SET processed = true WHERE id = ?",
            order_id
        )
Enter fullscreen mode Exit fullscreen mode

2. Set Appropriate Timeouts

SQS Visibility Timeout should be:

  • At least 6× your average processing time
  • Accounts for retries and variability
  • Long enough to prevent duplicate processing

Example:

  • Average processing: 2 minutes
  • 99th percentile: 5 minutes
  • Visibility timeout: 5 × 6 = 30 minutes? No, too long.
  • Better: 5 × 2 = 10 minutes (allows retries within visibility window)

ECS Task Timeout:

# In task definition
"stopTimeout": 120  # Seconds to wait for graceful shutdown
Enter fullscreen mode Exit fullscreen mode

Application Timeout:

# In your code
PROCESSING_TIMEOUT = int(os.environ.get('TIMEOUT', '300'))  # 5 minutes

@timeout(PROCESSING_TIMEOUT)
def process_message(message):
    # Your processing logic
    pass
Enter fullscreen mode Exit fullscreen mode

3. Implement Structured Logging

Logs are your best friend when debugging. Make them useful.

Bad:

print("Processing started")
print("Error occurred")
Enter fullscreen mode Exit fullscreen mode

Good:

import logging
import json

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def log_structured(level, message, **kwargs):
    log_entry = {
        'timestamp': datetime.utcnow().isoformat(),
        'level': level,
        'message': message,
        **kwargs
    }
    logger.log(getattr(logging, level), json.dumps(log_entry))

# Usage
log_structured('INFO', 'Processing started',
               message_id=message_id,
               job_type='image_processing',
               input_size=file_size)

log_structured('ERROR', 'Processing failed',
               message_id=message_id,
               error=str(e),
               traceback=traceback.format_exc())
Enter fullscreen mode Exit fullscreen mode

Now your CloudWatch Logs Insights queries can easily filter and aggregate:

fields @timestamp, message, job_type, error
| filter level = "ERROR"
| stats count(*) by job_type, error
Enter fullscreen mode Exit fullscreen mode

4. Right-Size Your Tasks

Don't over-provision. Monitor actual usage and adjust.

Monitor:

# Get actual resource usage
aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks task-id \
  --include TAGS \
  --query 'tasks[0].containers[0].{CPU:cpu,Memory:memory}'
Enter fullscreen mode Exit fullscreen mode

CloudWatch Container Insights shows detailed metrics:

  • CPU utilization
  • Memory utilization
  • Network I/O

If you're using 100 MB but allocated 1024 MB, you're wasting ~90% of memory costs.

Start conservatively:

  • 256 CPU, 512 MB memory
  • Monitor for a week
  • Adjust based on actual usage
  • Leave 20-30% headroom for spikes

5. Use Spot for Non-Critical Workloads

Fargate Spot can save up to 70% on compute costs.

When to use Spot:

  • Processing can tolerate 2-minute interruption notice
  • You have retry logic (SQS handles this automatically)
  • Not time-critical (interruptions add latency)

Configuration:

resource "aws_ecs_service" "batch_processor" {
  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 100  # 100% Spot
    base              = 0
  }

  # Or mix Spot and On-Demand
  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 70
  }

  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 30
    base              = 1  # At least 1 On-Demand task
  }
}
Enter fullscreen mode Exit fullscreen mode

My recommendation: Use 100% Spot for batch processing. The interruption rate is low (<5% in my experience), and retries handle it gracefully.

6. Implement Circuit Breakers

If your downstream service is down, don't keep hammering it.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.last_failure_time = None
        self.state = 'CLOSED'  # CLOSED, OPEN, HALF_OPEN

    def call(self, func, *args, **kwargs):
        if self.state == 'OPEN':
            if time.time() - self.last_failure_time > self.timeout:
                self.state = 'HALF_OPEN'
            else:
                raise CircuitBreakerOpen("Service unavailable")

        try:
            result = func(*args, **kwargs)
            if self.state == 'HALF_OPEN':
                self.state = 'CLOSED'
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()

            if self.failure_count >= self.failure_threshold:
                self.state = 'OPEN'

            raise

# Usage
db_breaker = CircuitBreaker(failure_threshold=5, timeout=60)

def process_message(message):
    try:
        data = db_breaker.call(database.query, message['query'])
        # Process data
    except CircuitBreakerOpen:
        # Don't delete message, let it retry later
        logger.warning("Circuit breaker open, skipping message")
        return False
Enter fullscreen mode Exit fullscreen mode

7. Monitor Your Costs

Set up AWS Cost Anomaly Detection:

resource "aws_ce_anomaly_monitor" "batch_processing" {
  name              = "batch-processing-monitor"
  monitor_type      = "DIMENSIONAL"
  monitor_dimension = "SERVICE"
}

resource "aws_ce_anomaly_subscription" "batch_processing" {
  name      = "batch-cost-anomalies"
  frequency = "DAILY"

  monitor_arn_list = [
    aws_ce_anomaly_monitor.batch_processing.arn
  ]

  subscriber {
    type    = "EMAIL"
    address = "devops@example.com"
  }

  threshold_expression {
    dimension {
      key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
      values        = ["100"]  # Alert if anomaly > $100
      match_options = ["GREATER_THAN_OR_EQUAL"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

This alerts you if costs suddenly spike (e.g., infinite loop creating tasks).

Real-World Results and Lessons Learned

Let me share some actual results from production deployments.

Case Study 1: E-Commerce Image Processing

Client: Mid-size e-commerce platform
Workload: Process product images (1000-5000/day)
Previous Setup: 2× t3.medium EC2 instances running 24/7

Implementation: ECS + SQS auto-scaling pattern

Results:

  • Cost: $73/month → $22/month (70% reduction)
  • Performance: Same average latency, better p99 (auto-scaling handles spikes)
  • Operational: Zero manual scaling interventions in 6 months
  • Incidents: 1 DLQ alert (corrupted image upload, not our bug)

Lesson Learned: Set visibility timeout properly. Initially set to 2 minutes (our average processing time), led to duplicate processing during retries. Changed to 6 minutes, problem solved.

Case Study 2: Data Analytics ETL

Client: B2B SaaS analytics platform
Workload: Nightly ETL from 50 data sources
Previous Setup: Custom scheduling on EC2, sequential processing (8-10 hours)

Implementation: EKS Jobs with parallel processing

Results:

  • Processing Time: 8-10 hours → 90 minutes (80% reduction)
  • Cost: $200/month (always-on) → $85/month including EKS control plane
  • Reliability: Built-in retries eliminated manual interventions
  • Developer Experience: Much easier to add new data sources

Lesson Learned: Kubernetes has a learning curve, but parallel Job processing is powerful. For teams already using EKS, this pattern is excellent. For teams new to Kubernetes, the complexity might not be worth it ECS with Step Functions could work.

Case Study 3: Document Generation SaaS

Client: Document automation startup
Workload: Generate PDFs from templates (variable: 10-1000/hour)
Previous Setup: Always-on containers, manual scaling

Implementation: ECS + SQS auto-scaling

Results:

  • Cost: $90/month → $35/month (61% reduction)
  • Scaling: Handles traffic spikes automatically
  • Time to Market: Deployed in 1 day vs 2 weeks for custom solution

Lesson Learned: Start simple. We initially designed a complex workflow with Step Functions. Realized SQS + auto-scaling solved 95% of use cases. Deployed much faster with simpler architecture.

Getting Started: Your Action Plan

Ready to implement one of these patterns? Here's your step-by-step action plan.

Step 1: Choose Your Pattern

Use Pattern 1 (ECS Scheduled) if:

  • You need time-based execution only
  • Workload is predictable and consistent
  • You want the simplest possible setup

Use Pattern 2 (ECS + SQS) if:

  • Workload varies throughout the day/week
  • You need event-driven processing
  • You want automatic scaling
  • This is my recommendation for 80% of use cases

Use Pattern 3 (EKS Jobs) if:

  • You're already running EKS
  • You need complex parallel processing
  • You have Kubernetes expertise
  • You need advanced orchestration

Step 2: Set Up Your Development Environment

# Install required tools

# Install AWS CLI (v2)
sudo apt update
sudo apt install -y unzip curl

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
aws --version


# Configure AWS CLI
aws configure

# Install Terraform (latest)
sudo apt update && sudo apt install -y gnupg software-properties-common curl

curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] \
https://apt.releases.hashicorp.com $(lsb_release -cs) main" | \
sudo tee /etc/apt/sources.list.d/hashicorp.list

sudo apt update && sudo apt install -y terraform
terraform -version

# Install kubectl (latest stable)
sudo apt update && sudo apt install -y curl apt-transport-https gnupg

curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.30/deb/Release.key | \
sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg

echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] \
https://pkgs.k8s.io/core:/stable:/v1.30/deb/ /" | \
sudo tee /etc/apt/sources.list.d/kubernetes.list

sudo apt update && sudo apt install -y kubectl
kubectl version --client


# Clone the repository
git https://github.com/rifkhan107/aws-batch-processing-containers.git
cd aws-batch-processing-containers
Enter fullscreen mode Exit fullscreen mode

Step 3: Deploy Pattern 2 (Recommended Starting Point)

# Navigate to pattern directory
cd ecs-sqs-autoscaling

# Deploy infrastructure
cd terraform
terraform init
terraform apply

# Note the outputs
ECR_URL=$(terraform output -raw ecr_repository_url)
QUEUE_URL=$(terraform output -raw sqs_queue_url)

# Build and push Docker image
cd ../docker
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin $ECR_URL

docker build -t batch-processor .
docker tag batch-processor:latest $ECR_URL:latest
docker push $ECR_URL:latest

# Update ECS service to use new image
cd ../terraform
aws ecs update-service \
  --cluster $(terraform output -raw ecs_cluster_name) \
  --service $(terraform output -raw ecs_service_name) \
  --force-new-deployment

# Test with sample messages
LAMBDA_FUNCTION=$(terraform output -raw lambda_function_name)
aws lambda invoke \
  --function-name $LAMBDA_FUNCTION \
  --payload '{"num_messages": 20}' \
  response.json

# Watch it scale
watch -n 5 "aws ecs describe-services \
  --cluster $(terraform output -raw ecs_cluster_name) \
  --services $(terraform output -raw ecs_service_name) \
  --query 'services[0].[desiredCount,runningCount]'"
Enter fullscreen mode Exit fullscreen mode

Step 4: Customize for Your Use Case

  1. Modify the application (docker/app.py):

    • Replace processing logic with your actual work
    • Adjust environment variables
    • Add any dependencies to requirements.txt
  2. Tune auto-scaling (terraform/variables.tf):

    • Adjust target_messages_per_task (default: 5)
    • Set min_tasks and max_tasks based on your needs
    • Modify task CPU and memory
  3. Set up monitoring (monitoring/):

    • Deploy CloudWatch Dashboard
    • Configure alarms
    • Set up SNS notifications

Step 5: Test Thoroughly

# Send test load
for i in {1..5}; do
  aws lambda invoke \
    --function-name $LAMBDA_FUNCTION \
    --payload '{"num_messages": 50}' \
    response-$i.json
done

# Monitor CloudWatch Logs
aws logs tail /ecs/ecs-sqs-batch-sqs-processor --follow

# Check queue depth
aws sqs get-queue-attributes \
  --queue-url $QUEUE_URL \
  --attribute-names ApproximateNumberOfMessagesVisible

# Verify results in S3
aws s3 ls s3://$(terraform output -raw s3_bucket_name)/results/ --recursive
Enter fullscreen mode Exit fullscreen mode

Step 6: Deploy to Production

Pre-production checklist:

  • [ ] All alarms configured and tested
  • [ ] DLQ monitoring in place
  • [ ] Task timeout appropriate for your workload
  • [ ] IAM roles follow least-privilege
  • [ ] Costs estimated and approved
  • [ ] Team trained on troubleshooting
  • [ ] Runbook documented

Deployment:

  1. Deploy to staging first
  2. Run production-like load tests
  3. Monitor for 1 week
  4. Gradually route production traffic
  5. Monitor costs and performance

Conclusion

Event-driven batch processing with containers represents a fundamental shift from traditional always-on infrastructure. Instead of paying for capacity you might need, you pay for capacity you actually use. Instead of manual scaling, you get automatic responsiveness to demand.

The three patterns I've shared scheduled tasks, SQS auto-scaling, and EKS Jobs cover the vast majority of batch processing use cases. For most teams, I recommend starting with Pattern 2 (ECS + SQS auto-scaling). It offers the best balance of cost savings, automatic scaling, and operational simplicity.

Key takeaways:

  1. Event-driven saves money (50-90% in real deployments)
  2. Auto-scaling eliminates operational burden (no more manual capacity management)
  3. Start simple (Pattern 2 handles 80% of use cases)
  4. Monitor everything (especially your DLQ)
  5. Design for idempotency (retries will happen)

The complete code, infrastructure templates, and deployment guides are available on GitHub. Everything is production-ready and battle-tested.

What will you build with this? I'd love to hear about your batch processing challenges and how these patterns help solve them. Find me on Twitter, LinkedIn, or open an issue on GitHub.

Happy building, and may your batch jobs scale automatically! 🚀


Additional Resources


Top comments (0)