When I need to process a very large dataset in parallel on AWS without standing up a whole batch platform, AWS Step Functions Distributed Map is one of the first tools I reach for.
It gives me a clean orchestration layer for large-scale batch jobs while still letting me control concurrency, retries, failure thresholds, and output handling in a way that is production-friendly.
In this post, I will walk through how I approach Distributed Map for large-scale batch workloads, with a focus on practical best practices I would actually use in a real system.
When I choose Distributed Map (and when I do not)
I typically consider Distributed Map when one or more of these are true:
- My dataset is too large to comfortably pass around as state input
- I would exceed normal inline
Maplimits or create a huge execution history - I need high concurrency across many items
- My input already lives in S3 (CSV, JSON, object lists, etc.)
- I want a serverless orchestration layer and clear visibility into the run
I usually do not start with Distributed Map if:
- The batch is small and an inline
Mapor a single Lambda is enough - I need heavy distributed compute semantics (for example, Spark-style transformations)
- My processing requires very long-running per-item tasks that are better suited to another service design
The key is to treat Distributed Map as an orchestration primitive for massively parallel item or batch processing, not as a replacement for every data platform.
Architecture at a glance
Figure 1. Reference architecture for large-scale batch processing with Step Functions Distributed Map (S3 input, batched child workflows, ResultWriter, and observability).
Core best practices I use in production
1) Start with a lower MaxConcurrency and scale up deliberately
This is the biggest mistake I see: people let the workflow fan out too aggressively before validating downstream capacity.
Distributed Map can run a very high number of child executions in parallel (AWS documents up to 10,000 parallel child workflow executions in distributed mode), which is great, but your downstream systems (Lambda concurrency, databases, APIs, third-party services) are usually the real bottleneck.
What I do instead:
- Start with a conservative
MaxConcurrency - Load test with representative input
- Watch downstream latency, throttling, and error rates
- Increase concurrency gradually
I treat MaxConcurrency as a safety valve, not just a performance knob.
2) Use ItemBatcher for throughput and cost efficiency
If each item is tiny and the per-item work is short, processing one item per child execution can create unnecessary overhead.
In those cases, I use ItemBatcher to group items into batches and process each batch in one child execution.
Why this helps:
- Fewer child executions
- Fewer state transitions
- Better Lambda efficiency (especially for small records)
- Easier downstream write batching
The tradeoff is that my worker logic must handle Items[] input and return a sensible batch-level result.
3) Keep child outputs compact and use S3 for detailed results
A common anti-pattern is returning large payloads from every child execution. That creates pressure on Step Functions payload limits and makes debugging harder.
My default pattern is:
- Child workflow returns a compact summary (counts, status, pointers)
- Worker writes detailed outputs/errors to S3 (if needed)
- Distributed Map uses
ResultWriterto export consolidated execution results
This keeps the orchestration layer lightweight while preserving traceability.
4) Define explicit failure tolerance (ToleratedFailure*) for noisy datasets
Large batch workloads often contain some bad records. If I expect occasional malformed rows or downstream rejections, failing the entire Map Run on the first bad batch is usually too strict.
I set one of these based on business tolerance:
ToleratedFailurePercentageToleratedFailureCount
This lets me absorb known-noisy inputs while still failing fast when quality degrades beyond an acceptable threshold.
Important design point: these thresholds apply to failed/timed-out child workflow executions, so I make sure my child workflow raises failures intentionally when a batch truly cannot be processed.
5) Put retries and error handling inside the child workflow
I prefer retries at the point of failure (inside the ItemProcessor states), not only at the parent level.
For example:
- Retry transient Lambda/service errors
- Catch and classify non-retryable failures
- Return a clear batch summary when possible
- Fail the child execution when the batch should count toward failure tolerance
This gives me better control over what counts as a retryable blip versus a real batch failure.
6) Choose child execution type intentionally (EXPRESS vs STANDARD)
Distributed Map runs inside a Standard parent state machine, but each child workflow can be EXPRESS or STANDARD.
My practical rule of thumb:
- Use EXPRESS child workflows for high-throughput, shorter-running item/batch tasks
- Use STANDARD child workflows when I need longer-running orchestration or richer execution semantics
I decide this early because it affects cost, runtime behavior, and observability patterns.
7) Design the worker to be idempotent
At scale, retries and reprocessing happen. I assume they will happen.
That means my batch worker should be idempotent, for example by:
- Using deterministic item IDs
- Recording processed item keys/checkpoints
- Using conditional writes (where supported)
- Writing outputs with deterministic object keys (or versioned prefixes)
This makes retries safe and reduces cleanup work.
8) Treat quotas and throttling as part of the design, not an afterthought
With large-scale fan-out, service quotas and API throttling become part of normal engineering.
I usually validate:
- Step Functions quotas relevant to Distributed Map and Map Runs
-
StartExecution/ API throttling expectations - Lambda concurrency and reserved concurrency
- Downstream service write capacity or request limits
This saves a lot of pain later.
Example state machine (Distributed Map with batching and ResultWriter)
Below is a JSONPath-based example that shows a production-style shape I like to start from.
What this example demonstrates:
- S3-backed input via
ItemReader -
ItemBatcherto process multiple records per child execution - Controlled parallelism via
MaxConcurrency - Failure tolerance via
ToleratedFailurePercentage -
ResultWriterto S3 - Retries and a clear failure path inside the child workflow
{
"Comment": "Large-scale batch processing with Distributed Map",
"StartAt": "RunDistributedBatch",
"States": {
"RunDistributedBatch": {
"Type": "Map",
"Label": "BatchIngestV1",
"ItemReader": {
"ReaderConfig": {
"InputType": "CSV",
"CSVHeaderLocation": "FIRST_ROW"
},
"Resource": "arn:aws:states:::s3:getObject",
"Parameters": {
"Bucket.$": "$.input.bucket",
"Key.$": "$.input.key"
}
},
"ItemSelector": {
"jobId.$": "$.jobId",
"record.$": "$$.Map.Item.Value"
},
"ItemBatcher": {
"MaxItemsPerBatch": 100
},
"MaxConcurrency": 300,
"ToleratedFailurePercentage": 2,
"ItemProcessor": {
"ProcessorConfig": {
"Mode": "DISTRIBUTED",
"ExecutionType": "EXPRESS"
},
"StartAt": "ProcessBatch",
"States": {
"ProcessBatch": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"OutputPath": "$.Payload",
"Parameters": {
"FunctionName": "arn:aws:lambda:ap-southeast-2:123456789012:function:batch-worker",
"Payload.$": "$"
},
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.SdkClientException",
"States.TaskFailed"
],
"IntervalSeconds": 2,
"BackoffRate": 2.0,
"MaxAttempts": 3
}
],
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"ResultPath": "$.error",
"Next": "MarkBatchFailed"
}
],
"End": true
},
"MarkBatchFailed": {
"Type": "Fail",
"Cause": "Batch worker failed after retries"
}
}
},
"ResultWriter": {
"WriterConfig": {
"Transformation": "COMPACT",
"OutputType": "JSONL"
},
"Resource": "arn:aws:states:::s3:putObject",
"Parameters": {
"Bucket.$": "$.resultWriter.bucket",
"Prefix.$": "$.resultWriter.prefix"
}
},
"End": true
}
}
}
Notes on this definition (why I like this shape)
- I use
ItemReaderso the parent execution input stays small - I use
ItemBatcherto reduce overhead for tiny records - I set
MaxConcurrencyexplicitly so I do not accidentally overwhelm downstream systems - I use a small but non-zero failure tolerance only if the workload can tolerate some bad batches
- I use
ResultWriterso the Map Run does not need to return a giant in-memory result array
Example batch worker (Python Lambda)
This is a simplified worker pattern I use for batched processing. The main ideas are:
- Accept
Items[]fromItemBatcher - Process each item safely
- Return a compact summary
- Fail the batch only when appropriate (so failure tolerance behaves meaningfully)
import json
import hashlib
from typing import Dict, Any, List
def _stable_item_id(item: Dict[str, Any]) -> str:
raw = json.dumps(item, sort_keys=True, separators=(",", ":"))
return hashlib.sha256(raw.encode("utf-8")).hexdigest()
def handler(event: Dict[str, Any], context) -> Dict[str, Any]:
# When ItemBatcher is used, Step Functions passes batched input as {"Items": [...]}
items: List[Dict[str, Any]] = event.get("Items", [])
successes = 0
failures = 0
failed_items = []
for item_wrapper in items:
# In this example, ItemSelector wrapped the original row into {"jobId": ..., "record": ...}
record = item_wrapper.get("record", {})
item_id = _stable_item_id(record)
try:
# Example validation (replace with real business logic)
if not record:
raise ValueError("Empty record")
# TODO: process the record (write to DynamoDB/S3/etc.)
# Make downstream writes idempotent (conditional write / deterministic key)
successes += 1
except Exception as exc:
failures += 1
failed_items.append({
"itemId": item_id,
"error": str(exc)
})
total = len(items)
failure_ratio = (failures / total) if total else 0.0
# Business decision:
# fail the child execution only when the batch quality is unacceptable.
if total == 0 or failure_ratio > 0.20 or successes == 0:
raise RuntimeError(
f"Batch failed: successes={successes}, failures={failures}, total={total}"
)
return {
"status": "PARTIAL_SUCCESS" if failures else "SUCCESS",
"totalItems": total,
"successes": successes,
"failures": failures,
"failedItemSample": failed_items[:10]
}
Worker implementation tips I strongly recommend
- Keep the response small
- Write detailed failure artifacts to S3 if you need full diagnostics
- Use deterministic keys / conditional writes for idempotency
- Separate retryable vs non-retryable exceptions in your code (if possible)
- Emit metrics (success count, failure count, latency) from the worker
One retry gotcha I always call out (important)
I am very careful with adding a Retry block on the Distributed Map state itself.
Why? Because AWS documents that if you define retriers on the Distributed Map state, the retry policy applies to all child workflow executions started by that Map state, not just the one that failed. In practice, that can create a lot of duplicate work and surprise costs if you are not expecting it.
My default approach:
- Put retries inside the child workflow first (fine-grained retries)
- Use Map-level
Retryonly when I intentionally want to rerun the whole Map state / Map Run behavior - Make sure idempotency is in place before enabling broader retries
Observability and operations checklist (what I watch)
When I productionize this pattern, I monitor the workflow at three levels:
A) Parent state machine
- Execution failures/timeouts
- End-to-end duration
- Throttling indicators (if any)
B) Map Run / child workflow level
- Child execution success/failure trends
- Failure rate relative to my tolerated threshold
- Backlog behavior when concurrency is constrained
C) Worker and downstream systems
- Lambda concurrency and duration
- Error rate and retries
- Database/API throttling
- Write latency and saturation
I also make sure the team has a runbook for:
- What to do when failure tolerance is exceeded
- How to inspect Map Run details and failing child executions
- When and how to redrive after fixing data or IAM issues
Common mistakes I try to avoid
1) Leaving MaxConcurrency effectively unlimited
This can look great in testing and then melt a downstream dependency in production.
2) Returning huge outputs from child workflows
This adds payload pressure and makes the state machine harder to operate.
3) Using batch sizes without thinking about item size
Small records and large records behave very differently. Batch by both count and (if needed) input bytes based on your real data.
4) Treating all failures the same
Malformed data, throttling, and transient AWS service errors should not all be handled identically.
5) Skipping idempotency
At scale, retries are normal. Idempotency is what keeps retries safe.
A practical rollout strategy I use
If I am introducing Distributed Map into an existing system, I usually do it in phases:
- Prototype with a small sample dataset
- Enable explicit
MaxConcurrency - Add
ItemBatcher - Add
ResultWriterand compact outputs - Load test with realistic record sizes
- Tune retries, batch size, and failure thresholds
- Document redrive/runbook steps
This sequence helps me avoid premature complexity while still landing on a production-grade design.
Final thoughts
Distributed Map is one of those features that becomes incredibly powerful once I stop thinking of it as “just a bigger map loop” and start treating it as a batch orchestration framework with explicit controls.
The best results usually come from combining:
- S3-backed input
- Deliberate concurrency limits
- Batched child processing
- Compact outputs + S3 result export
- Clear failure semantics
- Good observability and runbooks
That combination gives me something that scales well and is still operationally sane.
References
- AWS Step Functions: Distributed Map (Developer Guide)
- AWS Step Functions: Map state (inline vs distributed)
- AWS Step Functions: ItemReader (Map)
- AWS Step Functions: ItemBatcher (Map)
- AWS Step Functions: ResultWriter (Map)
- AWS Step Functions: Viewing a Distributed Map Run execution
- AWS Step Functions: Monitoring metrics with CloudWatch
- AWS Step Functions: Redriving Map Runs
- AWS Step Functions: Service quotas
- AWS Step Functions Workshop: Large-scale data processing

Top comments (0)