Renaldi for AWS Community Builders

Posted on May 3

Serverless Workflow Decomposition: When a Step Function Becomes a Monolith

#aws #serverless #programming #opensource

There is a point in many serverless platforms where a Step Functions workflow that once felt elegant starts to feel like a mini application platform of its own.

I have seen this happen in teams that are doing many things correctly: they standardized orchestration, they improved visibility, and they moved fragile glue logic out of Lambdas. Then six months later, the workflow has 100+ states, a maze of Choice branches, deeply nested payload transformations, and a deployment blast radius that makes everyone nervous.

This post is about recognizing workflow sprawl early and decomposing a Step Functions workflow into a more maintainable architecture without losing the benefits of orchestration.

I will cover:

Signs of workflow sprawl
Splitting by domain and subprocess boundaries
Parent-child workflow patterns
Contracting inputs and outputs
Versioning workflows safely
An end-to-end walkthrough with architecture and code
Implementation discussion and migration guidance

I will use AWS Step Functions terminology throughout, but the architectural thinking applies broadly to workflow systems.

Why this matters

A large workflow is not automatically a bad workflow.

In fact, I often start with a single orchestration when I want to make the business process visible quickly. The problem is not “too many states” by itself. The problem is when a workflow stops reflecting a coherent business flow and instead becomes:

a catch-all for multiple domains
a deployment bottleneck
a fragile contract hub
a place where teams are afraid to change anything

At that point, I treat it like I would a code monolith that has outgrown its boundaries: decompose intentionally, not reactively.

What I mean by a "Step Function monolith"

For this post, a Step Function becomes a monolith when one state machine accumulates responsibilities that should be owned by separate domains or subprocesses.

Typical symptoms include:

Order orchestration, payment rules, inventory logic, fraud checks, and notifications all embedded in one ASL definition
Repeated transformation states to make one team's output fit another team's input
Error handling branches duplicated across unrelated parts of the flow
A single workflow release requiring coordination across multiple teams

This is not just a readability issue. It affects operability, testing, and change safety.

Signs of workflow sprawl

These are the patterns I look for during architecture reviews.

1) One workflow owns too many domains

If a single state machine is enforcing rules that belong to Payments, Inventory, Fraud, Fulfillment, and Notifications, it is likely doing too much.

A good orchestrator should coordinate domains, not absorb their internal logic.

2) The ASL definition becomes hard to reason about

Signs include:

many long Choice chains
repeated Pass/transform states just to reshape data
large Catch and Retry blocks copied across multiple branches
difficulty tracing the happy path from start to finish

If I need a map just to explain the workflow in a design review, decomposition is usually overdue.

3) Payloads become "workflow-shaped" instead of domain-shaped

A common smell is a giant state payload that keeps growing because every future step might need something.

Symptoms:

many fields carried "just in case"
internal step-specific fields leaking into later steps
brittle JSONPath references across distant states
accidental coupling to intermediate output shapes

This is often the strongest signal that input/output contracts need to be tightened.

4) Change blast radius is too large

If a small payment change forces re-testing the full order pipeline end-to-end, you are paying monolith tax in a serverless system.

I watch for:

frequent merge conflicts in the same workflow definition
unrelated teams blocking each other
release windows for “workflow changes”
fear of touching central error paths

5) Execution histories are huge and troubleshooting is slow

When executions become long and noisy, step histories are harder to navigate. Even when the workflow is functionally correct, operator experience degrades.

This matters during incidents. The fastest diagnosis usually comes from clear orchestration boundaries and localized subprocess execution histories.

6) Reuse pressure leads to copy/paste orchestration

If teams are duplicating chunks of states for common subprocesses (for example, document validation, payment authorization, fraud scoring), that is a strong indicator those chunks should become child workflows.

7) Mixed execution profiles are forced into one workflow

Examples:

a mostly synchronous checkout path mixed with long-running fulfillment polling
high-throughput lightweight paths mixed with complex human approval steps
latency-sensitive branches mixed with eventual-consistency branches

These often want different execution patterns, retry policies, and operational ownership.

Decomposition principles I use

When I decompose a Step Functions workflow, I do not split it by "number of states." I split it by architectural responsibility.

Principle 1: Keep the parent workflow focused on orchestration decisions

The parent should answer questions like:

Which subprocess runs next?
Should we continue or compensate?
What is the overall status?
Which events should be emitted?

It should not implement deep domain logic that belongs in a domain-owned subprocess.

Principle 2: Split by domain or stable subprocess boundary

Great candidates for child workflows are subprocesses that are:

domain-owned (Payments, KYC, Inventory)
reusable across multiple parent workflows
likely to evolve independently
complex enough to justify dedicated retries/error handling
testable as a standalone business unit

Principle 3: Define explicit input and output contracts

Do not pass the entire parent state to every child.

Instead, define:

a minimal child input contract
a stable child output contract
an error/failure contract (where applicable)
version metadata in the contract or state machine aliasing strategy

This is the workflow equivalent of well-designed service APIs.

Principle 4: Decompose to reduce blast radius, not to maximize nesting

Nested workflows are powerful, but over-nesting can create its own complexity.

I avoid decomposition that creates:

wrappers around trivial single-step tasks
nested workflows with no clear ownership
chains of parent -> child -> grandchild just for aesthetics

The goal is better changeability and operability, not "micro-workflows everywhere."

Principle 5: Preserve the business narrative

After decomposition, I still want to be able to explain the parent workflow in plain language.

For example:

Validate order -> Process payment -> Reserve inventory -> Create shipment -> Notify customer

If the parent becomes an opaque set of “InvokeChildX” states with no business story, the design needs refinement.

Parent-child workflow patterns

There is no single nesting pattern that fits every case. I typically use a small set of patterns and choose deliberately.

Pattern A: Synchronous child workflow (request/response style orchestration)

The parent waits for the child to finish and uses the output immediately.

Use when:

the next parent decision depends on child output
the subprocess is part of the critical path
you want localized retries inside the child workflow

Examples:

payment authorization
fraud decision
document validation

Pattern B: Asynchronous child workflow (fire and track)

The parent starts a child workflow and continues later based on an event, callback, or polling strategy.

Use when:

the subprocess is long-running
an external system controls timing
human approval or batch windows are involved

Examples:

fulfillment handoff
partner settlement
manual review

Pattern C: Parallel child workflows for independent branches

The parent starts independent subprocesses in parallel and joins after they complete.

Use when:

tasks are independent and safe to run concurrently
you want to reduce overall latency
failures should be isolated per branch

Examples:

fraud + tax calculation + personalization scoring (depending on domain semantics)

Pattern D: Domain subprocess library

Create reusable child workflows that multiple parents can call.

Use when:

you repeatedly implement the same orchestration chunk
the subprocess is clearly owned by one team
contract stability is good enough for reuse

Examples:

identity verification
payment capture
notification fan-out preparation

Contracting inputs and outputs (the most important part)

In my experience, decomposition succeeds or fails based on contract discipline.

If I split a workflow but still pass the full parent payload into every child, I have only moved complexity around. I have not reduced coupling.

What a good child contract looks like

A child workflow contract should be:

minimal: only fields the child needs
explicit: named fields, stable structure
typed: validated at boundaries
versionable: compatible evolution plan
auditable: includes correlation metadata

I usually use an envelope like this:

{
  "meta": {
    "correlationId": "corr-123",
    "causationId": "exec-parent-abc",
    "contractVersion": "1.0"
  },
  "request": {
    "orderId": "ORD-100045",
    "customerId": "CUST-9001",
    "amount": 119.85,
    "currency": "AUD",
    "paymentMethodToken": "tok_123"
  }
}

And I expect a child output like:

{
  "meta": {
    "correlationId": "corr-123",
    "contractVersion": "1.0"
  },
  "result": {
    "authorized": true,
    "authorizationId": "auth_789",
    "processorReference": "psp-456"
  }
}

Contract boundaries I define explicitly

For each child workflow, I define:

Input shape
Success output shape
Business failure output shape (if returned rather than thrown)
Technical failure behavior (exception / failed execution)
Timeout expectations
Idempotency expectations
Ownership and support team

This makes nested workflows composable, not just callable.

Keep transformation logic close to the boundary

If the parent needs to adapt a parent model into a child request, I do that immediately before the child call. I do not let “temporary shape conversion” leak across the rest of the workflow.

Likewise, I normalize child output once after return, then continue with a clean parent-level model.

Versioning workflows safely

Workflow decomposition increases the number of deployable units. That is good for blast radius, but it also means you need a safe versioning strategy.

My rule: version the workflow and the contract

I treat these as separate concerns:

Workflow version: the ASL implementation/version/alias of the child state machine
Contract version: the input/output schema version the parent and child agree on

Sometimes a workflow changes without changing the contract. Sometimes a contract changes while the business purpose remains the same. I do not force those to be the same version number.

Safe versioning practices I use

1) Invoke child workflows through aliases

The parent should usually call a child alias ARN (for example, :PROD) rather than a raw, latest definition ARN.

This gives me a stable target I can move during deployment rollouts and rollbacks.

2) Use immutable workflow versions behind aliases

For production workflows, I want immutable versions behind aliases so I can answer:

Which version processed this execution?
Can I rollback without redefining the workflow?
Can I shift traffic gradually?

3) Keep contract compatibility during rollout windows

If Parent v3 is rolling out while Child Payments:PROD shifts from v10 to v11, I want a compatibility window where both versions honor the same contract or the parent chooses a matching alias (PAYMENTS_V1, PAYMENTS_V2).

4) Prefer additive contract changes

Safer changes:

add optional output fields
add optional input fields
add new reason codes without changing existing semantics (with care)

Riskier changes:

renaming fields
changing meaning of status codes
changing failure behavior from “return business failure” to “throw”
changing data types

5) Test parent-child compatibility explicitly

I maintain fixtures and contract tests for parent-child integration, especially around:

missing optional fields
unexpected extra fields
business failure responses
timeout and retry behavior

Reference Architecture

End-to-end walkthrough: decomposing an Order Processing workflow

I will use a realistic example because this is where the trade-offs become visible.

The original monolithic workflow (before)

We start with one large OrderProcessing state machine that does all of this:

validate order
fraud check
authorize payment
reserve inventory
create shipment request
send notifications
persist status updates
handle retries and compensation for multiple domains

It works, but over time:

Payments team changes create merge conflicts with Fulfillment changes
The workflow definition is difficult to review
Troubleshooting a failed shipment step requires scrolling through unrelated payment/fraud logic
Reusable subprocesses (payments, notifications) are duplicated elsewhere

The decomposed target architecture (after)

I split the design into:

Parent workflow: OrderOrchestrator

coordinates the overall business flow
invokes child workflows
makes continuation/compensation decisions
emits parent-level events/status transitions

Child workflows

PaymentProcessingWorkflow
InventoryReservationWorkflow
FulfillmentSubmissionWorkflow
CustomerNotificationWorkflow (optional, often event-driven instead)

Each child workflow owns:

local retries
domain-specific branching
domain telemetry
domain-specific error normalization

Why this split works

This decomposition aligns with domain boundaries and independent change cadence:

Payments evolves frequently due to PSP integration and fraud strategy
Inventory may change due to warehouse logic
Fulfillment is often async and externally coupled
Notifications are loosely coupled and may be event-driven

The parent remains readable and focused on business progression.

Architecture and flow (walkthrough narrative)

Here is the end-to-end flow in the decomposed design.

1) API receives `CreateOrder` request

The API layer validates basic request shape, stamps a correlation ID, and starts the parent OrderOrchestrator workflow (or publishes a command that triggers it, depending on your system style).

2) Parent workflow performs lightweight order validation

The parent may perform only orchestration-level checks (for example, required presence checks if not already done), then constructs a contracted input for the payment child workflow.

3) Parent invokes `PaymentProcessingWorkflow` as a synchronous child

The parent waits for payment output because the next step depends on authorization success.

The child workflow:

performs fraud/risk checks (if owned by Payments)
authorizes payment with PSP
normalizes provider-specific responses
returns a stable result contract

The parent receives only what it needs, not the child’s full internal state.

4) Parent invokes `InventoryReservationWorkflow`

If payment is authorized, the parent calls inventory reservation as another synchronous child and receives a normalized reservation result.

5) Parent branches based on combined business outcomes

The parent now makes a high-level decision:

continue to fulfillment
compensate payment if inventory failed
reject order
send manual review

This is exactly where a parent orchestrator adds value.

6) Parent starts `FulfillmentSubmissionWorkflow`

This may be synchronous or asynchronous depending on downstream fulfillment systems.

If asynchronous:

the parent may start the child and persist a pending status
later completion may resume a follow-up workflow or emit events that drive downstream steps

7) Notifications and analytics are triggered

I often prefer event-driven notification/analytics fan-out instead of keeping them in the critical path. If kept as a child workflow, I keep the contract minimal and failure policy explicit (for example, notification failure should not fail order creation).

8) Parent publishes final order status and completes

The parent emits a domain event (for example, OrderAccepted, OrderPendingFulfillment, or OrderRejected) and completes with a stable external result.

Implementation discussion

Now I will show concrete examples of how I implement this pattern.

Parent workflow (ASL) using nested child workflows

This example uses Step Functions service integration to start child workflows and wait for results. I use startExecution.sync:2 because it returns child output as JSON rather than a JSON-encoded string, which makes downstream data handling cleaner.

{
  "Comment": "Order orchestrator parent workflow",
  "StartAt": "BuildPaymentRequest",
  "States": {
    "BuildPaymentRequest": {
      "Type": "Pass",
      "Parameters": {
        "meta": {
          "correlationId.$": "$.meta.correlationId",
          "causationId.$": "$$.Execution.Id",
          "contractVersion": "1.0"
        },
        "request": {
          "orderId.$": "$.order.orderId",
          "customerId.$": "$.order.customerId",
          "amount.$": "$.order.totalAmount",
          "currency.$": "$.order.currency",
          "paymentMethodToken.$": "$.order.paymentMethodToken"
        }
      },
      "ResultPath": "$.paymentCall",
      "Next": "InvokePaymentChild"
    },
    "InvokePaymentChild": {
      "Type": "Task",
      "Resource": "arn:aws:states:::states:startExecution.sync:2",
      "Parameters": {
        "StateMachineArn": "${PaymentWorkflowAliasArn}",
        "Input": {
          "meta.$": "$.paymentCall.meta",
          "request.$": "$.paymentCall.request",
          "AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$": "$$.Execution.Id"
        }
      },
      "ResultPath": "$.paymentExecution",
      "Retry": [
        {
          "ErrorEquals": ["StepFunctions.ExecutionLimitExceeded"],
          "IntervalSeconds": 2,
          "BackoffRate": 2,
          "MaxAttempts": 3
        }
      ],
      "Next": "NormalizePaymentResult"
    },
    "NormalizePaymentResult": {
      "Type": "Pass",
      "Parameters": {
        "authorized.$": "$.paymentExecution.Output.result.authorized",
        "authorizationId.$": "$.paymentExecution.Output.result.authorizationId",
        "processorReference.$": "$.paymentExecution.Output.result.processorReference"
      },
      "ResultPath": "$.payment",
      "Next": "PaymentDecision"
    },
    "PaymentDecision": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.payment.authorized",
          "BooleanEquals": true,
          "Next": "BuildInventoryRequest"
        }
      ],
      "Default": "RejectOrder"
    },
    "BuildInventoryRequest": {
      "Type": "Pass",
      "Parameters": {
        "meta": {
          "correlationId.$": "$.meta.correlationId",
          "causationId.$": "$$.Execution.Id",
          "contractVersion": "1.0"
        },
        "request": {
          "orderId.$": "$.order.orderId",
          "items.$": "$.order.items",
          "warehousePreference.$": "$.order.warehousePreference"
        }
      },
      "ResultPath": "$.inventoryCall",
      "Next": "InvokeInventoryChild"
    },
    "InvokeInventoryChild": {
      "Type": "Task",
      "Resource": "arn:aws:states:::states:startExecution.sync:2",
      "Parameters": {
        "StateMachineArn": "${InventoryWorkflowAliasArn}",
        "Input": {
          "meta.$": "$.inventoryCall.meta",
          "request.$": "$.inventoryCall.request",
          "AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$": "$$.Execution.Id"
        }
      },
      "ResultPath": "$.inventoryExecution",
      "Next": "InventoryDecision"
    },
    "InventoryDecision": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.inventoryExecution.Output.result.reserved",
          "BooleanEquals": true,
          "Next": "StartFulfillmentChild"
        }
      ],
      "Default": "CompensatePayment"
    },
    "StartFulfillmentChild": {
      "Type": "Task",
      "Resource": "arn:aws:states:::states:startExecution",
      "Parameters": {
        "StateMachineArn": "${FulfillmentWorkflowAliasArn}",
        "Input": {
          "meta": {
            "correlationId.$": "$.meta.correlationId",
            "causationId.$": "$$.Execution.Id",
            "contractVersion": "1.0"
          },
          "request": {
            "orderId.$": "$.order.orderId",
            "reservationId.$": "$.inventoryExecution.Output.result.reservationId",
            "deliveryAddress.$": "$.order.deliveryAddress"
          },
          "AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$": "$$.Execution.Id"
        }
      },
      "ResultPath": "$.fulfillmentStart",
      "Next": "CompleteAccepted"
    },
    "CompensatePayment": {
      "Type": "Task",
      "Resource": "arn:aws:states:::states:startExecution.sync:2",
      "Parameters": {
        "StateMachineArn": "${PaymentCompensationWorkflowAliasArn}",
        "Input": {
          "meta": {
            "correlationId.$": "$.meta.correlationId",
            "causationId.$": "$$.Execution.Id",
            "contractVersion": "1.0"
          },
          "request": {
            "orderId.$": "$.order.orderId",
            "authorizationId.$": "$.payment.authorizationId"
          },
          "AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$": "$$.Execution.Id"
        }
      },
      "ResultPath": "$.paymentCompensation",
      "Next": "RejectOrder"
    },
    "RejectOrder": {
      "Type": "Succeed"
    },
    "CompleteAccepted": {
      "Type": "Succeed"
    }
  }
}

Why this parent is easier to maintain

The parent workflow now:

focuses on sequencing and business decisions
calls domain-owned child workflows through aliases
passes minimal, explicit contracts
can evolve orchestration without rewriting domain subprocess internals

That is the kind of decomposition I want.

Child workflow example: `PaymentProcessingWorkflow`

I keep the child focused and domain-owned. This example is simplified, but it shows the pattern.

{
  "Comment": "Payment processing child workflow",
  "StartAt": "ValidateContract",
  "States": {
    "ValidateContract": {
      "Type": "Choice",
      "Choices": [
        {
          "And": [
            { "Variable": "$.meta.contractVersion", "StringEquals": "1.0" },
            { "Variable": "$.request.orderId", "IsPresent": true },
            { "Variable": "$.request.amount", "IsPresent": true },
            { "Variable": "$.request.paymentMethodToken", "IsPresent": true }
          ],
          "Next": "AuthorizePayment"
        }
      ],
      "Default": "ContractError"
    },
    "AuthorizePayment": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "${AuthorizePaymentFnArn}",
        "Payload.$": "$"
      },
      "ResultSelector": {
        "result.$": "$.Payload"
      },
      "ResultPath": "$.auth",
      "Retry": [
        {
          "ErrorEquals": ["Lambda.ServiceException", "Lambda.SdkClientException", "States.TaskFailed"],
          "IntervalSeconds": 2,
          "BackoffRate": 2,
          "MaxAttempts": 3
        }
      ],
      "Next": "BuildSuccessResponse"
    },
    "BuildSuccessResponse": {
      "Type": "Pass",
      "Parameters": {
        "meta": {
          "correlationId.$": "$.meta.correlationId",
          "contractVersion": "1.0"
        },
        "result": {
          "authorized.$": "$.auth.result.authorized",
          "authorizationId.$": "$.auth.result.authorizationId",
          "processorReference.$": "$.auth.result.processorReference"
        }
      },
      "End": true
    },
    "ContractError": {
      "Type": "Fail",
      "Error": "ContractValidationError",
      "Cause": "Invalid child workflow input contract"
    }
  }
}

Design choice I recommend

Notice that the child returns a normalized result contract, not raw PSP payloads. This prevents the parent from becoming coupled to provider-specific fields and keeps domain ownership intact.

TypeScript contract definitions (shared library)

I typically create a small shared library for workflow contracts (or generate types from JSON Schema/OpenAPI where appropriate).

// packages/workflow-contracts/src/payment.ts

export interface WorkflowMeta {
  correlationId: string;
  causationId?: string;
  contractVersion: "1.0" | "1.1";
}

export interface PaymentChildRequestV1 {
  meta: WorkflowMeta & { contractVersion: "1.0" };
  request: {
    orderId: string;
    customerId: string;
    amount: number;
    currency: string;
    paymentMethodToken: string;
  };
}

export interface PaymentChildSuccessV1 {
  meta: {
    correlationId: string;
    contractVersion: "1.0";
  };
  result: {
    authorized: boolean;
    authorizationId: string;
    processorReference: string;
  };
}

export interface PaymentChildBusinessFailureV1 {
  meta: {
    correlationId: string;
    contractVersion: "1.0";
  };
  result: {
    authorized: false;
    reasonCode: "RISK_REJECTED" | "INSUFFICIENT_FUNDS" | "PROCESSOR_DECLINED";
    processorReference?: string;
  };
}

This type layer does not replace runtime validation, but it dramatically improves correctness in parent-child integration code and tests.

CDK wiring example (parent and child aliases)

This example shows the shape of how I wire aliases and pass alias ARNs to the parent workflow.

import * as cdk from "aws-cdk-lib";
import * as sfn from "aws-cdk-lib/aws-stepfunctions";
import { Construct } from "constructs";

export class OrderWorkflowsStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Assume these are already defined with actual definitions
    const paymentChild = new sfn.StateMachine(this, "PaymentWorkflow", {
      definitionBody: sfn.DefinitionBody.fromString('{"StartAt":"Done","States":{"Done":{"Type":"Succeed"}}}')
    });

    const inventoryChild = new sfn.StateMachine(this, "InventoryWorkflow", {
      definitionBody: sfn.DefinitionBody.fromString('{"StartAt":"Done","States":{"Done":{"Type":"Succeed"}}}')
    });

    // Publish immutable versions (illustrative)
    const paymentVersion = new sfn.CfnStateMachineVersion(this, "PaymentWorkflowVersion", {
      stateMachineArn: paymentChild.stateMachineArn
    });

    new sfn.CfnStateMachineAlias(this, "PaymentWorkflowProdAlias", {
      name: "PROD",
      routingConfiguration: [
        {
          stateMachineVersionArn: paymentVersion.attrStateMachineVersionArn,
          weight: 100
        }
      ]
    });

    const inventoryVersion = new sfn.CfnStateMachineVersion(this, "InventoryWorkflowVersion", {
      stateMachineArn: inventoryChild.stateMachineArn
    });

    new sfn.CfnStateMachineAlias(this, "InventoryWorkflowProdAlias", {
      name: "PROD",
      routingConfiguration: [
        {
          stateMachineVersionArn: inventoryVersion.attrStateMachineVersionArn,
          weight: 100
        }
      ]
    });

    // Parent definition would consume these alias ARNs (via substitutions/templating)
    new cdk.CfnOutput(this, "PaymentWorkflowAliasArn", {
      value: `${paymentChild.stateMachineArn}:PROD`
    });

    new cdk.CfnOutput(this, "InventoryWorkflowAliasArn", {
      value: `${inventoryChild.stateMachineArn}:PROD`
    });

    // In production, ensure the parent role has least-privilege for nested calls.
  }
}

What I pay attention to in deployment pipelines

For child workflows, I want CI/CD to support:

contract tests
workflow unit/integration tests
publish new version
move alias gradually (canary/linear where appropriate)
rollback alias quickly if needed

This is where decomposition pays off operationally. I can deploy a Payment child workflow change without touching the Inventory child or the parent orchestrator if the contract remains stable.

IAM and permissions for nested workflows (important operational detail)

Nested workflows are straightforward conceptually, but the IAM details matter.

When the parent waits synchronously for a child, the integration behavior requires more than only states:StartExecution. I always validate the parent execution role permissions for nested patterns during deployment and in pre-prod tests, because missing permissions can lead to confusing delays or stuck behavior.

I also scope permissions narrowly to the child workflows the parent is actually allowed to call. Decomposition should improve boundaries, not weaken them.

Observability after decomposition

A common concern is that decomposition makes tracing harder because the work is spread across multiple executions.

In practice, I have found the opposite to be true when I propagate correlation metadata correctly.

What I propagate into every child

correlationId
causationId (usually the parent execution ID)
contract version
domain entity ID (for example, orderId)

What I log in each child

child workflow name and alias/version (where possible)
start/end timestamps
business outcome
retry counts / terminal error classification

This makes it much easier to answer:

Which child failed?
Was it a contract issue or domain issue?
Which version of the child handled the request?
Did rollback change the outcome?

How to split by domain and subprocess in practice

When teams ask me “where exactly should we split?”, I usually run a quick decomposition workshop with these prompts:

Prompt 1: Which parts change for different business reasons?

If payment changes because of PSP behavior and inventory changes because of warehouse logic, those belong in different subprocesses.

Prompt 2: Which parts require different failure semantics?

If notification failure should not fail order acceptance, that is a strong candidate for decoupling from the parent critical path.

Prompt 3: Which parts are reusable?

If onboarding, checkout, and subscription renewal all need the same payment authorization flow, that is a candidate child workflow.

Prompt 4: Which parts have different owners/on-call teams?

Team boundaries are not the only factor, but they matter operationally. A child workflow with clear ownership improves support and release confidence.

Prompt 5: Which parts make the parent harder to read than the business process itself?

That is usually the part I extract first.

Migration strategy: from one monolith workflow to decomposed workflows safely

I do not recommend a big-bang rewrite. I prefer incremental extraction.

Step 1: Identify one extraction candidate

Pick a subprocess with clear boundaries (for example, Payments).

Step 2: Define the contract before extracting

Write:

child input schema/type
child output schema/type
failure behavior
timeouts and retries

Step 3: Extract the logic into a child workflow

Keep behavior equivalent first. Avoid redesigning everything in the same change.

Step 4: Update parent to call child via alias

Use a stable alias (for example, PROD) so future child changes do not require parent definition changes.

Step 5: Add compatibility and regression tests

Test:

happy path
business failure path
timeout/retry path
malformed contract path

Step 6: Repeat for the next extraction

After 1-2 successful extractions, teams usually become much more comfortable with the pattern.

What not to do

I have seen a few anti-patterns appear during decomposition efforts.

Anti-pattern 1: "Micro-workflow everything"

Creating a child workflow for every tiny step adds ceremony without improving maintainability.

Anti-pattern 2: Passing the entire parent payload into every child

This preserves hidden coupling and makes contracts meaningless.

Anti-pattern 3: Parent depends on child internals

If the parent reads deeply nested provider-specific details returned by a child, you have recreated coupling through outputs.

Anti-pattern 4: No versioning strategy

Without aliases/versions and contract discipline, decomposition can increase operational risk instead of reducing it.

Anti-pattern 5: Decomposition without ownership

If nobody owns a child workflow end-to-end, incidents become harder, not easier.

Final thoughts

A Step Functions workflow becoming “too large” is not the real problem. The real problem is when workflow boundaries stop matching business and domain boundaries.

When that happens, decomposition is not about making the diagram prettier. It is about restoring:

change safety
testability
ownership
observability
architectural clarity

The pattern I keep coming back to is simple:

Parent workflow for orchestration decisions and business progression
Child workflows for domain-owned subprocesses
Explicit contracts for inputs/outputs
Versioned deployments via immutable versions + aliases
Strong observability metadata across execution boundaries

That is how I keep Step Functions as an orchestration asset, rather than letting it become a serverless monolith.

References

AWS Step Functions Developer Guide (nested workflows, service integrations)
AWS Step Functions Developer Guide (starting workflows from a task state / StartExecution)
AWS Step Functions Developer Guide (versions and aliases)
AWS Step Functions Developer Guide (continuous deployments with versions and aliases)
AWS Step Functions Developer Guide (best practices)
AWS Step Functions service quotas documentation
AWS IAM documentation (least privilege for service integrations)

Why this matters

What I mean by a "Step Function monolith"

Signs of workflow sprawl

1) One workflow owns too many domains

2) The ASL definition becomes hard to reason about

3) Payloads become "workflow-shaped" instead of domain-shaped

4) Change blast radius is too large

5) Execution histories are huge and troubleshooting is slow

6) Reuse pressure leads to copy/paste orchestration

7) Mixed execution profiles are forced into one workflow

Decomposition principles I use

Principle 1: Keep the parent workflow focused on orchestration decisions

Principle 2: Split by domain or stable subprocess boundary

Principle 3: Define explicit input and output contracts

Principle 4: Decompose to reduce blast radius, not to maximize nesting

Principle 5: Preserve the business narrative

Parent-child workflow patterns

Pattern A: Synchronous child workflow (request/response style orchestration)

Pattern B: Asynchronous child workflow (fire and track)

Pattern C: Parallel child workflows for independent branches

Pattern D: Domain subprocess library

Contracting inputs and outputs (the most important part)

What a good child contract looks like

Contract boundaries I define explicitly

Keep transformation logic close to the boundary

Versioning workflows safely

My rule: version the workflow and the contract

Safe versioning practices I use

1) Invoke child workflows through aliases

2) Use immutable workflow versions behind aliases

3) Keep contract compatibility during rollout windows

4) Prefer additive contract changes

5) Test parent-child compatibility explicitly

Reference Architecture

End-to-end walkthrough: decomposing an Order Processing workflow

The original monolithic workflow (before)

The decomposed target architecture (after)

Why this split works

Architecture and flow (walkthrough narrative)

1) API receives CreateOrder request

2) Parent workflow performs lightweight order validation

3) Parent invokes PaymentProcessingWorkflow as a synchronous child

4) Parent invokes InventoryReservationWorkflow

5) Parent branches based on combined business outcomes

6) Parent starts FulfillmentSubmissionWorkflow

7) Notifications and analytics are triggered

8) Parent publishes final order status and completes

Implementation discussion

Parent workflow (ASL) using nested child workflows

Why this parent is easier to maintain

Child workflow example: PaymentProcessingWorkflow

Design choice I recommend

TypeScript contract definitions (shared library)

CDK wiring example (parent and child aliases)

What I pay attention to in deployment pipelines

IAM and permissions for nested workflows (important operational detail)

Observability after decomposition

What I propagate into every child

What I log in each child

How to split by domain and subprocess in practice

Prompt 1: Which parts change for different business reasons?

Prompt 2: Which parts require different failure semantics?

Prompt 3: Which parts are reusable?

Prompt 4: Which parts have different owners/on-call teams?

Prompt 5: Which parts make the parent harder to read than the business process itself?

Migration strategy: from one monolith workflow to decomposed workflows safely

Step 1: Identify one extraction candidate

Step 2: Define the contract before extracting

Step 3: Extract the logic into a child workflow

Step 4: Update parent to call child via alias

Step 5: Add compatibility and regression tests

Step 6: Repeat for the next extraction

What not to do

Anti-pattern 1: "Micro-workflow everything"

Anti-pattern 2: Passing the entire parent payload into every child

Anti-pattern 3: Parent depends on child internals

Anti-pattern 4: No versioning strategy

Anti-pattern 5: Decomposition without ownership

Final thoughts

References

1) API receives `CreateOrder` request

3) Parent invokes `PaymentProcessingWorkflow` as a synchronous child

4) Parent invokes `InventoryReservationWorkflow`

6) Parent starts `FulfillmentSubmissionWorkflow`

Child workflow example: `PaymentProcessingWorkflow`