There is a point in many serverless platforms where a Step Functions workflow that once felt elegant starts to feel like a mini application platform of its own.
I have seen this happen in teams that are doing many things correctly: they standardized orchestration, they improved visibility, and they moved fragile glue logic out of Lambdas. Then six months later, the workflow has 100+ states, a maze of Choice branches, deeply nested payload transformations, and a deployment blast radius that makes everyone nervous.
This post is about recognizing workflow sprawl early and decomposing a Step Functions workflow into a more maintainable architecture without losing the benefits of orchestration.
I will cover:
- Signs of workflow sprawl
- Splitting by domain and subprocess boundaries
- Parent-child workflow patterns
- Contracting inputs and outputs
- Versioning workflows safely
- An end-to-end walkthrough with architecture and code
- Implementation discussion and migration guidance
I will use AWS Step Functions terminology throughout, but the architectural thinking applies broadly to workflow systems.
Why this matters
A large workflow is not automatically a bad workflow.
In fact, I often start with a single orchestration when I want to make the business process visible quickly. The problem is not “too many states” by itself. The problem is when a workflow stops reflecting a coherent business flow and instead becomes:
- a catch-all for multiple domains
- a deployment bottleneck
- a fragile contract hub
- a place where teams are afraid to change anything
At that point, I treat it like I would a code monolith that has outgrown its boundaries: decompose intentionally, not reactively.
What I mean by a "Step Function monolith"
For this post, a Step Function becomes a monolith when one state machine accumulates responsibilities that should be owned by separate domains or subprocesses.
Typical symptoms include:
- Order orchestration, payment rules, inventory logic, fraud checks, and notifications all embedded in one ASL definition
- Repeated transformation states to make one team's output fit another team's input
- Error handling branches duplicated across unrelated parts of the flow
- A single workflow release requiring coordination across multiple teams
This is not just a readability issue. It affects operability, testing, and change safety.
Signs of workflow sprawl
These are the patterns I look for during architecture reviews.
1) One workflow owns too many domains
If a single state machine is enforcing rules that belong to Payments, Inventory, Fraud, Fulfillment, and Notifications, it is likely doing too much.
A good orchestrator should coordinate domains, not absorb their internal logic.
2) The ASL definition becomes hard to reason about
Signs include:
- many long
Choicechains - repeated
Pass/transform states just to reshape data - large
CatchandRetryblocks copied across multiple branches - difficulty tracing the happy path from start to finish
If I need a map just to explain the workflow in a design review, decomposition is usually overdue.
3) Payloads become "workflow-shaped" instead of domain-shaped
A common smell is a giant state payload that keeps growing because every future step might need something.
Symptoms:
- many fields carried "just in case"
- internal step-specific fields leaking into later steps
- brittle JSONPath references across distant states
- accidental coupling to intermediate output shapes
This is often the strongest signal that input/output contracts need to be tightened.
4) Change blast radius is too large
If a small payment change forces re-testing the full order pipeline end-to-end, you are paying monolith tax in a serverless system.
I watch for:
- frequent merge conflicts in the same workflow definition
- unrelated teams blocking each other
- release windows for “workflow changes”
- fear of touching central error paths
5) Execution histories are huge and troubleshooting is slow
When executions become long and noisy, step histories are harder to navigate. Even when the workflow is functionally correct, operator experience degrades.
This matters during incidents. The fastest diagnosis usually comes from clear orchestration boundaries and localized subprocess execution histories.
6) Reuse pressure leads to copy/paste orchestration
If teams are duplicating chunks of states for common subprocesses (for example, document validation, payment authorization, fraud scoring), that is a strong indicator those chunks should become child workflows.
7) Mixed execution profiles are forced into one workflow
Examples:
- a mostly synchronous checkout path mixed with long-running fulfillment polling
- high-throughput lightweight paths mixed with complex human approval steps
- latency-sensitive branches mixed with eventual-consistency branches
These often want different execution patterns, retry policies, and operational ownership.
Decomposition principles I use
When I decompose a Step Functions workflow, I do not split it by "number of states." I split it by architectural responsibility.
Principle 1: Keep the parent workflow focused on orchestration decisions
The parent should answer questions like:
- Which subprocess runs next?
- Should we continue or compensate?
- What is the overall status?
- Which events should be emitted?
It should not implement deep domain logic that belongs in a domain-owned subprocess.
Principle 2: Split by domain or stable subprocess boundary
Great candidates for child workflows are subprocesses that are:
- domain-owned (Payments, KYC, Inventory)
- reusable across multiple parent workflows
- likely to evolve independently
- complex enough to justify dedicated retries/error handling
- testable as a standalone business unit
Principle 3: Define explicit input and output contracts
Do not pass the entire parent state to every child.
Instead, define:
- a minimal child input contract
- a stable child output contract
- an error/failure contract (where applicable)
- version metadata in the contract or state machine aliasing strategy
This is the workflow equivalent of well-designed service APIs.
Principle 4: Decompose to reduce blast radius, not to maximize nesting
Nested workflows are powerful, but over-nesting can create its own complexity.
I avoid decomposition that creates:
- wrappers around trivial single-step tasks
- nested workflows with no clear ownership
- chains of parent -> child -> grandchild just for aesthetics
The goal is better changeability and operability, not "micro-workflows everywhere."
Principle 5: Preserve the business narrative
After decomposition, I still want to be able to explain the parent workflow in plain language.
For example:
Validate order -> Process payment -> Reserve inventory -> Create shipment -> Notify customer
If the parent becomes an opaque set of “InvokeChildX” states with no business story, the design needs refinement.
Parent-child workflow patterns
There is no single nesting pattern that fits every case. I typically use a small set of patterns and choose deliberately.
Pattern A: Synchronous child workflow (request/response style orchestration)
The parent waits for the child to finish and uses the output immediately.
Use when:
- the next parent decision depends on child output
- the subprocess is part of the critical path
- you want localized retries inside the child workflow
Examples:
- payment authorization
- fraud decision
- document validation
Pattern B: Asynchronous child workflow (fire and track)
The parent starts a child workflow and continues later based on an event, callback, or polling strategy.
Use when:
- the subprocess is long-running
- an external system controls timing
- human approval or batch windows are involved
Examples:
- fulfillment handoff
- partner settlement
- manual review
Pattern C: Parallel child workflows for independent branches
The parent starts independent subprocesses in parallel and joins after they complete.
Use when:
- tasks are independent and safe to run concurrently
- you want to reduce overall latency
- failures should be isolated per branch
Examples:
- fraud + tax calculation + personalization scoring (depending on domain semantics)
Pattern D: Domain subprocess library
Create reusable child workflows that multiple parents can call.
Use when:
- you repeatedly implement the same orchestration chunk
- the subprocess is clearly owned by one team
- contract stability is good enough for reuse
Examples:
- identity verification
- payment capture
- notification fan-out preparation
Contracting inputs and outputs (the most important part)
In my experience, decomposition succeeds or fails based on contract discipline.
If I split a workflow but still pass the full parent payload into every child, I have only moved complexity around. I have not reduced coupling.
What a good child contract looks like
A child workflow contract should be:
- minimal: only fields the child needs
- explicit: named fields, stable structure
- typed: validated at boundaries
- versionable: compatible evolution plan
- auditable: includes correlation metadata
I usually use an envelope like this:
{
"meta": {
"correlationId": "corr-123",
"causationId": "exec-parent-abc",
"contractVersion": "1.0"
},
"request": {
"orderId": "ORD-100045",
"customerId": "CUST-9001",
"amount": 119.85,
"currency": "AUD",
"paymentMethodToken": "tok_123"
}
}
And I expect a child output like:
{
"meta": {
"correlationId": "corr-123",
"contractVersion": "1.0"
},
"result": {
"authorized": true,
"authorizationId": "auth_789",
"processorReference": "psp-456"
}
}
Contract boundaries I define explicitly
For each child workflow, I define:
- Input shape
- Success output shape
- Business failure output shape (if returned rather than thrown)
- Technical failure behavior (exception / failed execution)
- Timeout expectations
- Idempotency expectations
- Ownership and support team
This makes nested workflows composable, not just callable.
Keep transformation logic close to the boundary
If the parent needs to adapt a parent model into a child request, I do that immediately before the child call. I do not let “temporary shape conversion” leak across the rest of the workflow.
Likewise, I normalize child output once after return, then continue with a clean parent-level model.
Versioning workflows safely
Workflow decomposition increases the number of deployable units. That is good for blast radius, but it also means you need a safe versioning strategy.
My rule: version the workflow and the contract
I treat these as separate concerns:
- Workflow version: the ASL implementation/version/alias of the child state machine
- Contract version: the input/output schema version the parent and child agree on
Sometimes a workflow changes without changing the contract. Sometimes a contract changes while the business purpose remains the same. I do not force those to be the same version number.
Safe versioning practices I use
1) Invoke child workflows through aliases
The parent should usually call a child alias ARN (for example, :PROD) rather than a raw, latest definition ARN.
This gives me a stable target I can move during deployment rollouts and rollbacks.
2) Use immutable workflow versions behind aliases
For production workflows, I want immutable versions behind aliases so I can answer:
- Which version processed this execution?
- Can I rollback without redefining the workflow?
- Can I shift traffic gradually?
3) Keep contract compatibility during rollout windows
If Parent v3 is rolling out while Child Payments:PROD shifts from v10 to v11, I want a compatibility window where both versions honor the same contract or the parent chooses a matching alias (PAYMENTS_V1, PAYMENTS_V2).
4) Prefer additive contract changes
Safer changes:
- add optional output fields
- add optional input fields
- add new reason codes without changing existing semantics (with care)
Riskier changes:
- renaming fields
- changing meaning of status codes
- changing failure behavior from “return business failure” to “throw”
- changing data types
5) Test parent-child compatibility explicitly
I maintain fixtures and contract tests for parent-child integration, especially around:
- missing optional fields
- unexpected extra fields
- business failure responses
- timeout and retry behavior
Reference Architecture
End-to-end walkthrough: decomposing an Order Processing workflow
I will use a realistic example because this is where the trade-offs become visible.
The original monolithic workflow (before)
We start with one large OrderProcessing state machine that does all of this:
- validate order
- fraud check
- authorize payment
- reserve inventory
- create shipment request
- send notifications
- persist status updates
- handle retries and compensation for multiple domains
It works, but over time:
- Payments team changes create merge conflicts with Fulfillment changes
- The workflow definition is difficult to review
- Troubleshooting a failed shipment step requires scrolling through unrelated payment/fraud logic
- Reusable subprocesses (payments, notifications) are duplicated elsewhere
The decomposed target architecture (after)
I split the design into:
Parent workflow: OrderOrchestrator
- coordinates the overall business flow
- invokes child workflows
- makes continuation/compensation decisions
- emits parent-level events/status transitions
Child workflows
PaymentProcessingWorkflowInventoryReservationWorkflowFulfillmentSubmissionWorkflow-
CustomerNotificationWorkflow(optional, often event-driven instead)
Each child workflow owns:
- local retries
- domain-specific branching
- domain telemetry
- domain-specific error normalization
Why this split works
This decomposition aligns with domain boundaries and independent change cadence:
- Payments evolves frequently due to PSP integration and fraud strategy
- Inventory may change due to warehouse logic
- Fulfillment is often async and externally coupled
- Notifications are loosely coupled and may be event-driven
The parent remains readable and focused on business progression.
Architecture and flow (walkthrough narrative)
Here is the end-to-end flow in the decomposed design.
1) API receives CreateOrder request
The API layer validates basic request shape, stamps a correlation ID, and starts the parent OrderOrchestrator workflow (or publishes a command that triggers it, depending on your system style).
2) Parent workflow performs lightweight order validation
The parent may perform only orchestration-level checks (for example, required presence checks if not already done), then constructs a contracted input for the payment child workflow.
3) Parent invokes PaymentProcessingWorkflow as a synchronous child
The parent waits for payment output because the next step depends on authorization success.
The child workflow:
- performs fraud/risk checks (if owned by Payments)
- authorizes payment with PSP
- normalizes provider-specific responses
- returns a stable result contract
The parent receives only what it needs, not the child’s full internal state.
4) Parent invokes InventoryReservationWorkflow
If payment is authorized, the parent calls inventory reservation as another synchronous child and receives a normalized reservation result.
5) Parent branches based on combined business outcomes
The parent now makes a high-level decision:
- continue to fulfillment
- compensate payment if inventory failed
- reject order
- send manual review
This is exactly where a parent orchestrator adds value.
6) Parent starts FulfillmentSubmissionWorkflow
This may be synchronous or asynchronous depending on downstream fulfillment systems.
If asynchronous:
- the parent may start the child and persist a pending status
- later completion may resume a follow-up workflow or emit events that drive downstream steps
7) Notifications and analytics are triggered
I often prefer event-driven notification/analytics fan-out instead of keeping them in the critical path. If kept as a child workflow, I keep the contract minimal and failure policy explicit (for example, notification failure should not fail order creation).
8) Parent publishes final order status and completes
The parent emits a domain event (for example, OrderAccepted, OrderPendingFulfillment, or OrderRejected) and completes with a stable external result.
Implementation discussion
Now I will show concrete examples of how I implement this pattern.
Parent workflow (ASL) using nested child workflows
This example uses Step Functions service integration to start child workflows and wait for results. I use startExecution.sync:2 because it returns child output as JSON rather than a JSON-encoded string, which makes downstream data handling cleaner.
{
"Comment": "Order orchestrator parent workflow",
"StartAt": "BuildPaymentRequest",
"States": {
"BuildPaymentRequest": {
"Type": "Pass",
"Parameters": {
"meta": {
"correlationId.$": "$.meta.correlationId",
"causationId.$": "$$.Execution.Id",
"contractVersion": "1.0"
},
"request": {
"orderId.$": "$.order.orderId",
"customerId.$": "$.order.customerId",
"amount.$": "$.order.totalAmount",
"currency.$": "$.order.currency",
"paymentMethodToken.$": "$.order.paymentMethodToken"
}
},
"ResultPath": "$.paymentCall",
"Next": "InvokePaymentChild"
},
"InvokePaymentChild": {
"Type": "Task",
"Resource": "arn:aws:states:::states:startExecution.sync:2",
"Parameters": {
"StateMachineArn": "${PaymentWorkflowAliasArn}",
"Input": {
"meta.$": "$.paymentCall.meta",
"request.$": "$.paymentCall.request",
"AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$": "$$.Execution.Id"
}
},
"ResultPath": "$.paymentExecution",
"Retry": [
{
"ErrorEquals": ["StepFunctions.ExecutionLimitExceeded"],
"IntervalSeconds": 2,
"BackoffRate": 2,
"MaxAttempts": 3
}
],
"Next": "NormalizePaymentResult"
},
"NormalizePaymentResult": {
"Type": "Pass",
"Parameters": {
"authorized.$": "$.paymentExecution.Output.result.authorized",
"authorizationId.$": "$.paymentExecution.Output.result.authorizationId",
"processorReference.$": "$.paymentExecution.Output.result.processorReference"
},
"ResultPath": "$.payment",
"Next": "PaymentDecision"
},
"PaymentDecision": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.payment.authorized",
"BooleanEquals": true,
"Next": "BuildInventoryRequest"
}
],
"Default": "RejectOrder"
},
"BuildInventoryRequest": {
"Type": "Pass",
"Parameters": {
"meta": {
"correlationId.$": "$.meta.correlationId",
"causationId.$": "$$.Execution.Id",
"contractVersion": "1.0"
},
"request": {
"orderId.$": "$.order.orderId",
"items.$": "$.order.items",
"warehousePreference.$": "$.order.warehousePreference"
}
},
"ResultPath": "$.inventoryCall",
"Next": "InvokeInventoryChild"
},
"InvokeInventoryChild": {
"Type": "Task",
"Resource": "arn:aws:states:::states:startExecution.sync:2",
"Parameters": {
"StateMachineArn": "${InventoryWorkflowAliasArn}",
"Input": {
"meta.$": "$.inventoryCall.meta",
"request.$": "$.inventoryCall.request",
"AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$": "$$.Execution.Id"
}
},
"ResultPath": "$.inventoryExecution",
"Next": "InventoryDecision"
},
"InventoryDecision": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.inventoryExecution.Output.result.reserved",
"BooleanEquals": true,
"Next": "StartFulfillmentChild"
}
],
"Default": "CompensatePayment"
},
"StartFulfillmentChild": {
"Type": "Task",
"Resource": "arn:aws:states:::states:startExecution",
"Parameters": {
"StateMachineArn": "${FulfillmentWorkflowAliasArn}",
"Input": {
"meta": {
"correlationId.$": "$.meta.correlationId",
"causationId.$": "$$.Execution.Id",
"contractVersion": "1.0"
},
"request": {
"orderId.$": "$.order.orderId",
"reservationId.$": "$.inventoryExecution.Output.result.reservationId",
"deliveryAddress.$": "$.order.deliveryAddress"
},
"AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$": "$$.Execution.Id"
}
},
"ResultPath": "$.fulfillmentStart",
"Next": "CompleteAccepted"
},
"CompensatePayment": {
"Type": "Task",
"Resource": "arn:aws:states:::states:startExecution.sync:2",
"Parameters": {
"StateMachineArn": "${PaymentCompensationWorkflowAliasArn}",
"Input": {
"meta": {
"correlationId.$": "$.meta.correlationId",
"causationId.$": "$$.Execution.Id",
"contractVersion": "1.0"
},
"request": {
"orderId.$": "$.order.orderId",
"authorizationId.$": "$.payment.authorizationId"
},
"AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$": "$$.Execution.Id"
}
},
"ResultPath": "$.paymentCompensation",
"Next": "RejectOrder"
},
"RejectOrder": {
"Type": "Succeed"
},
"CompleteAccepted": {
"Type": "Succeed"
}
}
}
Why this parent is easier to maintain
The parent workflow now:
- focuses on sequencing and business decisions
- calls domain-owned child workflows through aliases
- passes minimal, explicit contracts
- can evolve orchestration without rewriting domain subprocess internals
That is the kind of decomposition I want.
Child workflow example: PaymentProcessingWorkflow
I keep the child focused and domain-owned. This example is simplified, but it shows the pattern.
{
"Comment": "Payment processing child workflow",
"StartAt": "ValidateContract",
"States": {
"ValidateContract": {
"Type": "Choice",
"Choices": [
{
"And": [
{ "Variable": "$.meta.contractVersion", "StringEquals": "1.0" },
{ "Variable": "$.request.orderId", "IsPresent": true },
{ "Variable": "$.request.amount", "IsPresent": true },
{ "Variable": "$.request.paymentMethodToken", "IsPresent": true }
],
"Next": "AuthorizePayment"
}
],
"Default": "ContractError"
},
"AuthorizePayment": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "${AuthorizePaymentFnArn}",
"Payload.$": "$"
},
"ResultSelector": {
"result.$": "$.Payload"
},
"ResultPath": "$.auth",
"Retry": [
{
"ErrorEquals": ["Lambda.ServiceException", "Lambda.SdkClientException", "States.TaskFailed"],
"IntervalSeconds": 2,
"BackoffRate": 2,
"MaxAttempts": 3
}
],
"Next": "BuildSuccessResponse"
},
"BuildSuccessResponse": {
"Type": "Pass",
"Parameters": {
"meta": {
"correlationId.$": "$.meta.correlationId",
"contractVersion": "1.0"
},
"result": {
"authorized.$": "$.auth.result.authorized",
"authorizationId.$": "$.auth.result.authorizationId",
"processorReference.$": "$.auth.result.processorReference"
}
},
"End": true
},
"ContractError": {
"Type": "Fail",
"Error": "ContractValidationError",
"Cause": "Invalid child workflow input contract"
}
}
}
Design choice I recommend
Notice that the child returns a normalized result contract, not raw PSP payloads. This prevents the parent from becoming coupled to provider-specific fields and keeps domain ownership intact.
TypeScript contract definitions (shared library)
I typically create a small shared library for workflow contracts (or generate types from JSON Schema/OpenAPI where appropriate).
// packages/workflow-contracts/src/payment.ts
export interface WorkflowMeta {
correlationId: string;
causationId?: string;
contractVersion: "1.0" | "1.1";
}
export interface PaymentChildRequestV1 {
meta: WorkflowMeta & { contractVersion: "1.0" };
request: {
orderId: string;
customerId: string;
amount: number;
currency: string;
paymentMethodToken: string;
};
}
export interface PaymentChildSuccessV1 {
meta: {
correlationId: string;
contractVersion: "1.0";
};
result: {
authorized: boolean;
authorizationId: string;
processorReference: string;
};
}
export interface PaymentChildBusinessFailureV1 {
meta: {
correlationId: string;
contractVersion: "1.0";
};
result: {
authorized: false;
reasonCode: "RISK_REJECTED" | "INSUFFICIENT_FUNDS" | "PROCESSOR_DECLINED";
processorReference?: string;
};
}
This type layer does not replace runtime validation, but it dramatically improves correctness in parent-child integration code and tests.
CDK wiring example (parent and child aliases)
This example shows the shape of how I wire aliases and pass alias ARNs to the parent workflow.
import * as cdk from "aws-cdk-lib";
import * as sfn from "aws-cdk-lib/aws-stepfunctions";
import { Construct } from "constructs";
export class OrderWorkflowsStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
// Assume these are already defined with actual definitions
const paymentChild = new sfn.StateMachine(this, "PaymentWorkflow", {
definitionBody: sfn.DefinitionBody.fromString('{"StartAt":"Done","States":{"Done":{"Type":"Succeed"}}}')
});
const inventoryChild = new sfn.StateMachine(this, "InventoryWorkflow", {
definitionBody: sfn.DefinitionBody.fromString('{"StartAt":"Done","States":{"Done":{"Type":"Succeed"}}}')
});
// Publish immutable versions (illustrative)
const paymentVersion = new sfn.CfnStateMachineVersion(this, "PaymentWorkflowVersion", {
stateMachineArn: paymentChild.stateMachineArn
});
new sfn.CfnStateMachineAlias(this, "PaymentWorkflowProdAlias", {
name: "PROD",
routingConfiguration: [
{
stateMachineVersionArn: paymentVersion.attrStateMachineVersionArn,
weight: 100
}
]
});
const inventoryVersion = new sfn.CfnStateMachineVersion(this, "InventoryWorkflowVersion", {
stateMachineArn: inventoryChild.stateMachineArn
});
new sfn.CfnStateMachineAlias(this, "InventoryWorkflowProdAlias", {
name: "PROD",
routingConfiguration: [
{
stateMachineVersionArn: inventoryVersion.attrStateMachineVersionArn,
weight: 100
}
]
});
// Parent definition would consume these alias ARNs (via substitutions/templating)
new cdk.CfnOutput(this, "PaymentWorkflowAliasArn", {
value: `${paymentChild.stateMachineArn}:PROD`
});
new cdk.CfnOutput(this, "InventoryWorkflowAliasArn", {
value: `${inventoryChild.stateMachineArn}:PROD`
});
// In production, ensure the parent role has least-privilege for nested calls.
}
}
What I pay attention to in deployment pipelines
For child workflows, I want CI/CD to support:
- contract tests
- workflow unit/integration tests
- publish new version
- move alias gradually (canary/linear where appropriate)
- rollback alias quickly if needed
This is where decomposition pays off operationally. I can deploy a Payment child workflow change without touching the Inventory child or the parent orchestrator if the contract remains stable.
IAM and permissions for nested workflows (important operational detail)
Nested workflows are straightforward conceptually, but the IAM details matter.
When the parent waits synchronously for a child, the integration behavior requires more than only states:StartExecution. I always validate the parent execution role permissions for nested patterns during deployment and in pre-prod tests, because missing permissions can lead to confusing delays or stuck behavior.
I also scope permissions narrowly to the child workflows the parent is actually allowed to call. Decomposition should improve boundaries, not weaken them.
Observability after decomposition
A common concern is that decomposition makes tracing harder because the work is spread across multiple executions.
In practice, I have found the opposite to be true when I propagate correlation metadata correctly.
What I propagate into every child
correlationId-
causationId(usually the parent execution ID) - contract version
- domain entity ID (for example,
orderId)
What I log in each child
- child workflow name and alias/version (where possible)
- start/end timestamps
- business outcome
- retry counts / terminal error classification
This makes it much easier to answer:
- Which child failed?
- Was it a contract issue or domain issue?
- Which version of the child handled the request?
- Did rollback change the outcome?
How to split by domain and subprocess in practice
When teams ask me “where exactly should we split?”, I usually run a quick decomposition workshop with these prompts:
Prompt 1: Which parts change for different business reasons?
If payment changes because of PSP behavior and inventory changes because of warehouse logic, those belong in different subprocesses.
Prompt 2: Which parts require different failure semantics?
If notification failure should not fail order acceptance, that is a strong candidate for decoupling from the parent critical path.
Prompt 3: Which parts are reusable?
If onboarding, checkout, and subscription renewal all need the same payment authorization flow, that is a candidate child workflow.
Prompt 4: Which parts have different owners/on-call teams?
Team boundaries are not the only factor, but they matter operationally. A child workflow with clear ownership improves support and release confidence.
Prompt 5: Which parts make the parent harder to read than the business process itself?
That is usually the part I extract first.
Migration strategy: from one monolith workflow to decomposed workflows safely
I do not recommend a big-bang rewrite. I prefer incremental extraction.
Step 1: Identify one extraction candidate
Pick a subprocess with clear boundaries (for example, Payments).
Step 2: Define the contract before extracting
Write:
- child input schema/type
- child output schema/type
- failure behavior
- timeouts and retries
Step 3: Extract the logic into a child workflow
Keep behavior equivalent first. Avoid redesigning everything in the same change.
Step 4: Update parent to call child via alias
Use a stable alias (for example, PROD) so future child changes do not require parent definition changes.
Step 5: Add compatibility and regression tests
Test:
- happy path
- business failure path
- timeout/retry path
- malformed contract path
Step 6: Repeat for the next extraction
After 1-2 successful extractions, teams usually become much more comfortable with the pattern.
What not to do
I have seen a few anti-patterns appear during decomposition efforts.
Anti-pattern 1: "Micro-workflow everything"
Creating a child workflow for every tiny step adds ceremony without improving maintainability.
Anti-pattern 2: Passing the entire parent payload into every child
This preserves hidden coupling and makes contracts meaningless.
Anti-pattern 3: Parent depends on child internals
If the parent reads deeply nested provider-specific details returned by a child, you have recreated coupling through outputs.
Anti-pattern 4: No versioning strategy
Without aliases/versions and contract discipline, decomposition can increase operational risk instead of reducing it.
Anti-pattern 5: Decomposition without ownership
If nobody owns a child workflow end-to-end, incidents become harder, not easier.
Final thoughts
A Step Functions workflow becoming “too large” is not the real problem. The real problem is when workflow boundaries stop matching business and domain boundaries.
When that happens, decomposition is not about making the diagram prettier. It is about restoring:
- change safety
- testability
- ownership
- observability
- architectural clarity
The pattern I keep coming back to is simple:
- Parent workflow for orchestration decisions and business progression
- Child workflows for domain-owned subprocesses
- Explicit contracts for inputs/outputs
- Versioned deployments via immutable versions + aliases
- Strong observability metadata across execution boundaries
That is how I keep Step Functions as an orchestration asset, rather than letting it become a serverless monolith.
References
- AWS Step Functions Developer Guide (nested workflows, service integrations)
- AWS Step Functions Developer Guide (starting workflows from a task state /
StartExecution) - AWS Step Functions Developer Guide (versions and aliases)
- AWS Step Functions Developer Guide (continuous deployments with versions and aliases)
- AWS Step Functions Developer Guide (best practices)
- AWS Step Functions service quotas documentation
- AWS IAM documentation (least privilege for service integrations)

Top comments (0)