Dhananjay Lakkawar

Posted on Jun 10

The AI "Pause Button": Human-in-the-Loop Workflows with AWS Step Functions

#ai #automation #aws #tutorial

📺 Short on time? Watch the 5-minute explainer video instead.

Introduction:

There is a terrifying moment in every AI startup's lifecycle.

It is the moment the engineering team realizes that giving an LLM the ability
to draft emails is vastly different from giving an LLM the permission to
execute code, drop database tables, or refund customer credit cards.

Founders and CTOs usually face a false dichotomy:

Fully Autonomous: Give the AI agent root access and pray it doesn't hallucinate a massive financial error.
Powerless AI: Strip the agent of its execution capabilities, reducing it back to a glorified read-only chatbot.

There is a third path. You can build powerful, action-oriented AI agents
without risking your infrastructure or bank account.

The secret is inserting a deterministic "Pause Button" into your
non-deterministic AI workflows.

Why Does Traditional Compute Fail for AI Pause Patterns?

The instinct is to reach for what you know: a long-running Python script on
EC2, a Kubernetes pod with a blocking while loop, a Fargate container that
holds workflow state in memory and waits for a webhook.

All of these approaches technically work. They also burn money at rest.

When an AI workflow is waiting for a human to click a Slack button, the server
sits idle — consuming memory and CPU, billing you by the second. If your
manager takes three days to check their messages, you pay for 72 hours of idle
compute. AWS Lambda is no better for this use case: it has a hard
15-minute execution timeout
— it literally cannot wait for a human.

The solution is to stop conflating compute with state. Your workflow state
does not need to live in a running process. It can live in AWS Step Functions.

AWS Step Functions Standard Workflows can hold execution state for up to
365 days,
billing based on state transitions rather than execution duration. A workflow
paused for 72 hours waiting for human input costs $0.00 in idle compute.

How Does .waitForTaskToken Actually Work?

AWS Step Functions has a native integration called .waitForTaskToken. When an
execution reaches a state configured with this resource suffix, three things
happen in sequence:

Step Functions generates a unique cryptographic Task Token for that execution instance.
It passes the token to a downstream service (Lambda, SNS, SQS) via the state's input payload.
The execution completely pauses — no process running, no timer ticking — and waits for something external to call back with that exact token.

The execution resumes only when you call one of two AWS SDK methods:

sfn.send_task_success(taskToken=token, output=...) — resumes the workflow and passes data forward.
sfn.send_task_failure(taskToken=token, error=..., cause=...) — routes execution to a failure/catch branch.

That callback can come from anywhere that can make an HTTP request: a Slack
button, an email link, a mobile app, an internal admin dashboard. Step
Functions doesn't care about the source — only the token.

The Architecture: Four Deterministic Steps

Here is the exact structure for a Refund Agent that requires human approval
before executing a financial transaction.

Step	Component	What Happens	Compute Running?
1. Intercept	AI Agent + Step Functions	Agent outputs `{"action": "issue_refund", "amount": 500, "risk_level": "HIGH"}`. Step Functions evaluates the risk classification and routes HIGH-risk actions to the wait state.	Briefly
2. Token Generation	Step Functions + Lambda	Step Functions generates a task token. Lambda stores the token in DynamoDB, formats a Slack approval message, and POSTs to the webhook. Lambda then terminates.	Lambda spins down
3. Human Review	Step Functions (paused)	Workflow is frozen. Zero Lambda invocations. Zero containers. Execution state is persisted by AWS across multiple Availability Zones.	No
4. Resumption	API Gateway + Lambda + Step Functions	Human clicks Approve in Slack. API Gateway triggers the callback Lambda. Lambda calls `send_task_success`. Step Functions wakes up and proceeds to execute the Stripe refund.	Briefly

The State Machine Definition

The .waitForTaskToken resource suffix instructs Step Functions to pause and
wait for a callback. $$.Task.Token is the intrinsic function that injects the
generated token into the downstream Lambda's event payload.

{
  "Comment": "AI Refund Agent with Human Approval Gate",
  "StartAt": "EvaluateRisk",
  "States": {
    "EvaluateRisk": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.risk_level",
          "StringEquals": "HIGH",
          "Next": "WaitForHumanApproval"
        }
      ],
      "Default": "ExecuteAction"
    },
    "WaitForHumanApproval": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
      "Parameters": {
        "FunctionName": "notify-approver",
        "Payload": {
          "task_token.$": "$$.Task.Token",
          "action.$":     "$.action",
          "amount.$":     "$.amount",
          "reason.$":     "$.reason"
        }
      },
      "TimeoutSeconds": 172800,
      "Catch": [
        {
          "ErrorEquals": ["States.Timeout"],
          "Next": "AutoDeny"
        }
      ],
      "Next": "ExecuteAction"
    },
    "ExecuteAction": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "execute-refund",
        "Payload.$": "$"
      },
      "End": true
    },
    "AutoDeny": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "notify-denial",
        "Payload.$": "$"
      },
      "End": true
    }
  }
}

Lambda 1: The Notifier

This function receives the task token, stores it in DynamoDB (never in a URL
parameter), sends the Slack message, and terminates. The Step Function remains
paused after this function exits — there is no process keeping it alive.

# notify_approver.py
import json, os, time, uuid
import boto3, urllib3

dynamodb = boto3.resource('dynamodb')
table    = dynamodb.Table(os.environ['SESSIONS_TABLE'])

def lambda_handler(event, context):
    session_id = str(uuid.uuid4())

    # Store token in DynamoDB — never expose it directly in Slack URLs.
    # Slack message URLs land in server logs, browser history, and
    # CloudFront access logs. A session_id reference is safe; the raw
    # token is not.
    table.put_item(Item={
        'session_id': session_id,
        'task_token': event['task_token'],
        'ttl':        int(time.time()) + 172800   # matches TimeoutSeconds
    })

    callback = os.environ['CALLBACK_URL']
    payload  = {
        "text": (
            f":rotating_light: *AI Action Approval Required*\n"
            f"*Action:* {event['action']} — ${event['amount']}\n"
            f"*Reason:* {event['reason']}"
        ),
        "attachments": [{
            "actions": [
                {"type": "button", "text": "Approve", "style": "primary",
                 "url": f"{callback}?session={session_id}&decision=approve"},
                {"type": "button", "text": "Deny",    "style": "danger",
                 "url": f"{callback}?session={session_id}&decision=deny"}
            ]
        }]
    }

    urllib3.PoolManager().request(
        'POST', os.environ['SLACK_WEBHOOK_URL'],
        body=json.dumps(payload),
        headers={'Content-Type': 'application/json'}
    )
    # Lambda exits here. Compute drops to zero. Step Function waits.

Lambda 2: The Callback

Called by API Gateway when the human clicks Approve or Deny. Fetches the real
token from DynamoDB using the session reference, then calls the Step Functions
SDK to resume or fail the paused execution.

# process_callback.py
import json, os
import boto3

sfn      = boto3.client('stepfunctions')
dynamodb = boto3.resource('dynamodb')
table    = dynamodb.Table(os.environ['SESSIONS_TABLE'])

def lambda_handler(event, context):
    params     = event['queryStringParameters']
    session_id = params['session']
    decision   = params['decision']

    item       = table.get_item(Key={'session_id': session_id})['Item']
    task_token = item['task_token']

    if decision == 'approve':
        sfn.send_task_success(
            taskToken=task_token,
            output=json.dumps({"decision": "approved"})
        )
    else:
        sfn.send_task_failure(
            taskToken=task_token,
            error="HumanDenied",
            cause="Reviewer denied the action via Slack."
        )

    return {"statusCode": 200, "body": json.dumps({"status": "recorded"})}

What Does It Actually Cost to Wait?

The scenario: your application processes 10,000 high-risk AI decisions per
month. Average human review time is 4 hours.

	Always-On Fargate	Step Functions (.waitForTaskToken)
Idle wait cost	Billed continuously	$0.00
Compute-hours at scale	~40,000 hrs/month	0 hrs
Estimated monthly cost	~$1,500	$1.35
State durability	Lost if container OOMs or AZ fails	Multi-AZ, managed by AWS
Max pause duration	Unlimited (billing accumulates)	365 days
Failure mode	Container crash silently drops state	`States.Timeout` routes to catch branch

The $1.35 figure breaks down directly from the
Step Functions pricing page:

State transitions: Standard Workflows charge $0.025 per 1,000 transitions. A basic approval workflow uses ~5 transitions per execution: (10,000 × 5) / 1,000 × $0.025 = $1.25
API Gateway + Lambda callback: ~$0.10 at this volume.
Idle wait time: $0.00.

Decoupling execution state from compute reduces infrastructure cost by
99.9% at 10,000 decisions/month — while adding multi-AZ state durability
that a single Fargate container cannot match. If an Availability Zone fails
while your workflow is paused, Step Functions maintains the execution state
across the region.

What Are the Production Tradeoffs?

This pattern has three sharp edges. Know them before you go live.

1. The Task Token Is a Security Credential

The Task Token is the literal key to your workflow. Anyone holding it can call
send_task_success and resume your agent's execution chain — including
executing the refund, the code deployment, or the database migration you were
trying to gate.

What to do:

Never pass the raw token through a URL query parameter. URLs land in CloudFront access logs, Nginx logs, browser history, and Slack's own message storage.
Store the token in DynamoDB with a TTL. Pass only a short-lived session_id through the URL. Your callback Lambda retrieves the actual token from DynamoDB after the request arrives.
For stronger guarantees: validate the Slack request signature in your callback Lambda, or place an IAM-authenticated API Gateway endpoint in front of the callback route.

2. You Must Use Standard Workflows

Step Functions Express Workflows
are designed for high-throughput IoT and event-processing pipelines. They have
a hard 5-minute execution limit and do not support the .waitForTaskToken
callback pattern.

If you accidentally configure this on an Express Workflow, it will time out
after 5 minutes, silently discard any incoming callback, and route to the
failure branch — your human approved the action, but the workflow already moved
on without them.

3. Always Configure TimeoutSeconds

Without a timeout, a pending execution sits in your AWS console until the
365-day platform ceiling triggers. The example above uses 172800 seconds
(48 hours), which matches the DynamoDB TTL on the stored token.

When the timeout fires, States.Timeout is thrown. The Catch block routes
to an AutoDeny state that notifies the customer their request was
automatically declined. A stale, hanging workflow is always worse than a
deterministic denial message.

Build Fast. Govern Carefully.

You do not have to choose between moving fast with AI and protecting your
business.

By treating Large Language Models as non-deterministic action generators
and AWS Step Functions as the deterministic execution gate, you get both:

The AI drafts operational plans and classifies its own risk level.
Humans retain final execution authority over irreversible actions.
The infrastructure between those two moments costs essentially nothing.

The false dichotomy between "autonomous AI" and "powerless chatbot" is an
infrastructure problem disguised as a philosophical one. The
.waitForTaskToken pattern resolves it in under 100 lines of code.

Build the agent. But keep your finger on the pause button.

How is your team handling high-risk LLM executions? Are you using Step
Functions, a custom approval queue, or something else entirely? Drop it in the comments.

DEV Community