Renaldi for AWS Community Builders

Posted on Mar 1

Step Functions Distributed Map Best Practices for Large-Scale Batch Workloads

#architecture #aws #distributedsystems #serverless

When I need to process a very large dataset in parallel on AWS without standing up a whole batch platform, AWS Step Functions Distributed Map is one of the first tools I reach for.

It gives me a clean orchestration layer for large-scale batch jobs while still letting me control concurrency, retries, failure thresholds, and output handling in a way that is production-friendly.

In this post, I will walk through how I approach Distributed Map for large-scale batch workloads, with a focus on practical best practices I would actually use in a real system.

When I choose Distributed Map (and when I do not)

I typically consider Distributed Map when one or more of these are true:

My dataset is too large to comfortably pass around as state input
I would exceed normal inline Map limits or create a huge execution history
I need high concurrency across many items
My input already lives in S3 (CSV, JSON, object lists, etc.)
I want a serverless orchestration layer and clear visibility into the run

I usually do not start with Distributed Map if:

The batch is small and an inline Map or a single Lambda is enough
I need heavy distributed compute semantics (for example, Spark-style transformations)
My processing requires very long-running per-item tasks that are better suited to another service design

The key is to treat Distributed Map as an orchestration primitive for massively parallel item or batch processing, not as a replacement for every data platform.

Architecture at a glance

Figure 1. Reference architecture for large-scale batch processing with Step Functions Distributed Map (S3 input, batched child workflows, ResultWriter, and observability).

Core best practices I use in production

1) Start with a lower `MaxConcurrency` and scale up deliberately

This is the biggest mistake I see: people let the workflow fan out too aggressively before validating downstream capacity.

Distributed Map can run a very high number of child executions in parallel (AWS documents up to 10,000 parallel child workflow executions in distributed mode), which is great, but your downstream systems (Lambda concurrency, databases, APIs, third-party services) are usually the real bottleneck.

What I do instead:

Start with a conservative MaxConcurrency
Load test with representative input
Watch downstream latency, throttling, and error rates
Increase concurrency gradually

I treat MaxConcurrency as a safety valve, not just a performance knob.

2) Use `ItemBatcher` for throughput and cost efficiency

If each item is tiny and the per-item work is short, processing one item per child execution can create unnecessary overhead.

In those cases, I use ItemBatcher to group items into batches and process each batch in one child execution.

Why this helps:

Fewer child executions
Fewer state transitions
Better Lambda efficiency (especially for small records)
Easier downstream write batching

The tradeoff is that my worker logic must handle Items[] input and return a sensible batch-level result.

3) Keep child outputs compact and use S3 for detailed results

A common anti-pattern is returning large payloads from every child execution. That creates pressure on Step Functions payload limits and makes debugging harder.

My default pattern is:

Child workflow returns a compact summary (counts, status, pointers)
Worker writes detailed outputs/errors to S3 (if needed)
Distributed Map uses ResultWriter to export consolidated execution results

This keeps the orchestration layer lightweight while preserving traceability.

4) Define explicit failure tolerance (`ToleratedFailure*`) for noisy datasets

Large batch workloads often contain some bad records. If I expect occasional malformed rows or downstream rejections, failing the entire Map Run on the first bad batch is usually too strict.

I set one of these based on business tolerance:

ToleratedFailurePercentage
ToleratedFailureCount

This lets me absorb known-noisy inputs while still failing fast when quality degrades beyond an acceptable threshold.

Important design point: these thresholds apply to failed/timed-out child workflow executions, so I make sure my child workflow raises failures intentionally when a batch truly cannot be processed.

5) Put retries and error handling inside the child workflow

I prefer retries at the point of failure (inside the ItemProcessor states), not only at the parent level.

For example:

Retry transient Lambda/service errors
Catch and classify non-retryable failures
Return a clear batch summary when possible
Fail the child execution when the batch should count toward failure tolerance

This gives me better control over what counts as a retryable blip versus a real batch failure.

6) Choose child execution type intentionally (`EXPRESS` vs `STANDARD`)

Distributed Map runs inside a Standard parent state machine, but each child workflow can be EXPRESS or STANDARD.

My practical rule of thumb:

Use EXPRESS child workflows for high-throughput, shorter-running item/batch tasks
Use STANDARD child workflows when I need longer-running orchestration or richer execution semantics

I decide this early because it affects cost, runtime behavior, and observability patterns.

7) Design the worker to be idempotent

At scale, retries and reprocessing happen. I assume they will happen.

That means my batch worker should be idempotent, for example by:

Using deterministic item IDs
Recording processed item keys/checkpoints
Using conditional writes (where supported)
Writing outputs with deterministic object keys (or versioned prefixes)

This makes retries safe and reduces cleanup work.

8) Treat quotas and throttling as part of the design, not an afterthought

With large-scale fan-out, service quotas and API throttling become part of normal engineering.

I usually validate:

Step Functions quotas relevant to Distributed Map and Map Runs
StartExecution / API throttling expectations
Lambda concurrency and reserved concurrency
Downstream service write capacity or request limits

This saves a lot of pain later.

Example state machine (Distributed Map with batching and ResultWriter)

Below is a JSONPath-based example that shows a production-style shape I like to start from.

What this example demonstrates:

S3-backed input via ItemReader
ItemBatcher to process multiple records per child execution
Controlled parallelism via MaxConcurrency
Failure tolerance via ToleratedFailurePercentage
ResultWriter to S3
Retries and a clear failure path inside the child workflow

{
  "Comment": "Large-scale batch processing with Distributed Map",
  "StartAt": "RunDistributedBatch",
  "States": {
    "RunDistributedBatch": {
      "Type": "Map",
      "Label": "BatchIngestV1",
      "ItemReader": {
        "ReaderConfig": {
          "InputType": "CSV",
          "CSVHeaderLocation": "FIRST_ROW"
        },
        "Resource": "arn:aws:states:::s3:getObject",
        "Parameters": {
          "Bucket.$": "$.input.bucket",
          "Key.$": "$.input.key"
        }
      },
      "ItemSelector": {
        "jobId.$": "$.jobId",
        "record.$": "$$.Map.Item.Value"
      },
      "ItemBatcher": {
        "MaxItemsPerBatch": 100
      },
      "MaxConcurrency": 300,
      "ToleratedFailurePercentage": 2,
      "ItemProcessor": {
        "ProcessorConfig": {
          "Mode": "DISTRIBUTED",
          "ExecutionType": "EXPRESS"
        },
        "StartAt": "ProcessBatch",
        "States": {
          "ProcessBatch": {
            "Type": "Task",
            "Resource": "arn:aws:states:::lambda:invoke",
            "OutputPath": "$.Payload",
            "Parameters": {
              "FunctionName": "arn:aws:lambda:ap-southeast-2:123456789012:function:batch-worker",
              "Payload.$": "$"
            },
            "Retry": [
              {
                "ErrorEquals": [
                  "Lambda.ServiceException",
                  "Lambda.AWSLambdaException",
                  "Lambda.SdkClientException",
                  "States.TaskFailed"
                ],
                "IntervalSeconds": 2,
                "BackoffRate": 2.0,
                "MaxAttempts": 3
              }
            ],
            "Catch": [
              {
                "ErrorEquals": ["States.ALL"],
                "ResultPath": "$.error",
                "Next": "MarkBatchFailed"
              }
            ],
            "End": true
          },
          "MarkBatchFailed": {
            "Type": "Fail",
            "Cause": "Batch worker failed after retries"
          }
        }
      },
      "ResultWriter": {
        "WriterConfig": {
          "Transformation": "COMPACT",
          "OutputType": "JSONL"
        },
        "Resource": "arn:aws:states:::s3:putObject",
        "Parameters": {
          "Bucket.$": "$.resultWriter.bucket",
          "Prefix.$": "$.resultWriter.prefix"
        }
      },
      "End": true
    }
  }
}

Notes on this definition (why I like this shape)

I use ItemReader so the parent execution input stays small
I use ItemBatcher to reduce overhead for tiny records
I set MaxConcurrency explicitly so I do not accidentally overwhelm downstream systems
I use a small but non-zero failure tolerance only if the workload can tolerate some bad batches
I use ResultWriter so the Map Run does not need to return a giant in-memory result array

Example batch worker (Python Lambda)

This is a simplified worker pattern I use for batched processing. The main ideas are:

Accept Items[] from ItemBatcher
Process each item safely
Return a compact summary
Fail the batch only when appropriate (so failure tolerance behaves meaningfully)

import json
import hashlib
from typing import Dict, Any, List

def _stable_item_id(item: Dict[str, Any]) -> str:
    raw = json.dumps(item, sort_keys=True, separators=(",", ":"))
    return hashlib.sha256(raw.encode("utf-8")).hexdigest()

def handler(event: Dict[str, Any], context) -> Dict[str, Any]:
    # When ItemBatcher is used, Step Functions passes batched input as {"Items": [...]}
    items: List[Dict[str, Any]] = event.get("Items", [])
    successes = 0
    failures = 0
    failed_items = []

    for item_wrapper in items:
        # In this example, ItemSelector wrapped the original row into {"jobId": ..., "record": ...}
        record = item_wrapper.get("record", {})
        item_id = _stable_item_id(record)

        try:
            # Example validation (replace with real business logic)
            if not record:
                raise ValueError("Empty record")

            # TODO: process the record (write to DynamoDB/S3/etc.)
            # Make downstream writes idempotent (conditional write / deterministic key)
            successes += 1

        except Exception as exc:
            failures += 1
            failed_items.append({
                "itemId": item_id,
                "error": str(exc)
            })

    total = len(items)
    failure_ratio = (failures / total) if total else 0.0

    # Business decision:
    # fail the child execution only when the batch quality is unacceptable.
    if total == 0 or failure_ratio > 0.20 or successes == 0:
        raise RuntimeError(
            f"Batch failed: successes={successes}, failures={failures}, total={total}"
        )

    return {
        "status": "PARTIAL_SUCCESS" if failures else "SUCCESS",
        "totalItems": total,
        "successes": successes,
        "failures": failures,
        "failedItemSample": failed_items[:10]
    }

Worker implementation tips I strongly recommend

Keep the response small
Write detailed failure artifacts to S3 if you need full diagnostics
Use deterministic keys / conditional writes for idempotency
Separate retryable vs non-retryable exceptions in your code (if possible)
Emit metrics (success count, failure count, latency) from the worker

One retry gotcha I always call out (important)

I am very careful with adding a Retry block on the Distributed Map state itself.

Why? Because AWS documents that if you define retriers on the Distributed Map state, the retry policy applies to all child workflow executions started by that Map state, not just the one that failed. In practice, that can create a lot of duplicate work and surprise costs if you are not expecting it.

My default approach:

Put retries inside the child workflow first (fine-grained retries)
Use Map-level Retry only when I intentionally want to rerun the whole Map state / Map Run behavior
Make sure idempotency is in place before enabling broader retries

Observability and operations checklist (what I watch)

When I productionize this pattern, I monitor the workflow at three levels:

A) Parent state machine

Execution failures/timeouts
End-to-end duration
Throttling indicators (if any)

B) Map Run / child workflow level

Child execution success/failure trends
Failure rate relative to my tolerated threshold
Backlog behavior when concurrency is constrained

C) Worker and downstream systems

Lambda concurrency and duration
Error rate and retries
Database/API throttling
Write latency and saturation

I also make sure the team has a runbook for:

What to do when failure tolerance is exceeded
How to inspect Map Run details and failing child executions
When and how to redrive after fixing data or IAM issues

Common mistakes I try to avoid

1) Leaving `MaxConcurrency` effectively unlimited

This can look great in testing and then melt a downstream dependency in production.

2) Returning huge outputs from child workflows

This adds payload pressure and makes the state machine harder to operate.

3) Using batch sizes without thinking about item size

Small records and large records behave very differently. Batch by both count and (if needed) input bytes based on your real data.

4) Treating all failures the same

Malformed data, throttling, and transient AWS service errors should not all be handled identically.

5) Skipping idempotency

At scale, retries are normal. Idempotency is what keeps retries safe.

A practical rollout strategy I use

If I am introducing Distributed Map into an existing system, I usually do it in phases:

Prototype with a small sample dataset
Enable explicit MaxConcurrency
Add ItemBatcher
Add ResultWriter and compact outputs
Load test with realistic record sizes
Tune retries, batch size, and failure thresholds
Document redrive/runbook steps

This sequence helps me avoid premature complexity while still landing on a production-grade design.

Final thoughts

Distributed Map is one of those features that becomes incredibly powerful once I stop thinking of it as “just a bigger map loop” and start treating it as a batch orchestration framework with explicit controls.

The best results usually come from combining:

S3-backed input
Deliberate concurrency limits
Batched child processing
Compact outputs + S3 result export
Clear failure semantics
Good observability and runbooks

That combination gives me something that scales well and is still operationally sane.

DEV Community