Martin Oehlert

Posted on Jul 3

Human Interaction and External Events: Approval Workflows

#azure #dotnet #azurefunctions #serverless

An expense report needs a manager's sign-off before the reimbursement goes out, and that manager might click approve in five minutes or come back from a trip in five days. The question Parts 1 and 2 never had to answer is how the workflow waits that long: a chain or a fan-out runs to completion in seconds, but an approval step has to suspend on a person and stay suspended without holding a thread or billing you for the idle time in between. Durable Functions answers the waiting half with a single await that costs no compute while it is parked. The half most guides skip is the other one: the runtime hands you an instance ID when the workflow starts, and finding that one paused instance again when the approval finally arrives is your code's job, not the platform's.

WaitForExternalEvent pattern

The mechanic at the center of every approval workflow is one line: await context.WaitForExternalEvent<T>(eventName). It suspends the orchestration until something outside the function raises an event with that name, and the typed payload it returns carries whatever the approver decided.

Every code sample below is from the companion sample (isolated worker, .NET 10). Here is the expense-approval orchestrator: it parks on the wait, then hands the decision to an activity that settles or rejects the report.

using Microsoft.Azure.Functions.Worker;
using Microsoft.DurableTask;

public record ExpenseReport(string ReportId, string Employee, decimal Amount, string Category);
public record ApprovalDecision(string DecisionId, bool Approved, string Approver, string? Note);
public record SettlementInput(ExpenseReport Report, ApprovalDecision Decision);

public static class ExpenseApprovalOrchestrator
{
    [Function(nameof(ExpenseApprovalOrchestrator))]
    public static async Task<string> Run(
        [OrchestrationTrigger] TaskOrchestrationContext context)
    {
        var report = context.GetInput<ExpenseReport>()!;

        // Suspends here until an "ApprovalDecision" event is raised for this instance.
        // No thread is held and no compute is billed while the orchestration waits.
        ApprovalDecision decision =
            await context.WaitForExternalEvent<ApprovalDecision>("ApprovalDecision");

        return await context.CallActivityAsync<string>(
            nameof(SettleExpenseActivity), new SettlementInput(report, decision));
    }
}

The WaitForExternalEvent<ApprovalDecision>("ApprovalDecision") call is the whole pause. When the orchestrator reaches it, the runtime checkpoints the instance and unloads it; execution does not resume until an event named ApprovalDecision is raised against this instance ID, at which point the raised JSON is deserialized into an ApprovalDecision and the await returns it. The first argument is the event name, the contract both ends agree on, and matching is case-insensitive: a wait on "ApprovalDecision" is satisfied by an event raised as "approvaldecision". The type argument is the payload shape; if the raised JSON cannot be converted to it, the wait throws rather than returning a half-filled object.

The activity is an ordinary function that acts on the decision.

public static class SettleExpenseActivity
{
    [Function(nameof(SettleExpenseActivity))]
    public static string Run([ActivityTrigger] SettlementInput input)
    {
        var (report, decision) = input;
        if (!decision.Approved)
            return $"Report {report.ReportId} rejected by {decision.Approver}.";

        // Real work belongs here: queue the reimbursement, post to the ledger, notify the employee.
        return $"Report {report.ReportId} approved by {decision.Approver}; {report.Amount:C} scheduled for payment.";
    }
}

What makes this safe to wait on for days is that the suspension costs no compute. Once the orchestrator yields at the wait, there is no thread blocked, no instance kept warm, nothing to bill. On the Consumption and Flex plans you pay for actual execution, not idle wait time, so an approval parked for a week is free until the event arrives. The wait is also replay-safe in exactly the sense Part 1 set up: the pending event is part of the orchestration's durable state, so the worker can be stopped, scaled in, or recycled, and the instance is reawakened when the event shows up. An event that arrives early is not a problem either; if it is raised before the orchestrator has reached the wait, it is buffered in the instance state and dispatched the moment the wait is reached, so a fast approver who beats the orchestration to the wait line does not lose their decision.

One honest gotcha to design for up front: event delivery on the Azure Storage backend is at-least-once, so a restart or scale event can deliver the same approval twice. That is why the payload here carries a DecisionId. If the downstream activity is not naturally idempotent, dedupe on that ID so a duplicate delivery does not pay the same expense report twice. (The MSSQL provider consumes events transactionally and does not produce duplicates, but coding for at-least-once keeps the orchestrator portable across backends.)

This wait is indefinite: nothing here ever gives up. A real approval workflow needs a deadline so a report that no one ever touches does not sit parked forever, and bounding the wait with a durable timer is the next section.

Approval endpoint design

The orchestrator only knows how to wait. Two HTTP endpoints surround it: one to start the workflow and one to deliver the approver's answer. The start endpoint is the async HTTP pattern from Part 2, schedule the orchestration and hand back a status URL without blocking.

using Microsoft.Azure.Functions.Worker;
using Microsoft.Azure.Functions.Worker.Http;
using Microsoft.DurableTask.Client;

public static class StartExpenseApprovalClient
{
    [Function(nameof(StartExpenseApprovalClient))]
    public static async Task<HttpResponseData> Run(
        [HttpTrigger(AuthorizationLevel.Function, "post", Route = "expenses")]
            HttpRequestData req,
        [DurableClient] DurableTaskClient client)
    {
        ExpenseReport report = await req.ReadFromJsonAsync<ExpenseReport>()
            ?? throw new InvalidOperationException("Expense report body is required.");

        string instanceId = await client.ScheduleNewOrchestrationInstanceAsync(
            nameof(ExpenseApprovalOrchestrator), report);

        // 202 + management URLs; the orchestration is now parked on its WaitForExternalEvent.
        return await client.CreateCheckStatusResponseAsync(req, instanceId);
    }
}

ScheduleNewOrchestrationInstanceAsync enqueues the orchestration and returns its instance ID without waiting for it to run, and CreateCheckStatusResponseAsync builds the HTTP 202 response: a Location header pointing at the status-query endpoint and a JSON body of management URLs for the instance. By the time the caller has that 202, the orchestration is already parked on its WaitForExternalEvent. Hold on to that returned instanceId; it is the only handle that reaches the parked instance, and the approval endpoint is useless without it.

The approval endpoint is the other half. It reads the manager's decision and raises the event the orchestrator is blocked on.

public record ApprovalRequest(bool Approved, string Approver, string? Note);

public static class SubmitApprovalClient
{
    [Function(nameof(SubmitApprovalClient))]
    public static async Task<HttpResponseData> Run(
        [HttpTrigger(AuthorizationLevel.Function, "post", Route = "expenses/{instanceId}/decision")]
            HttpRequestData req,
        string instanceId,
        [DurableClient] DurableTaskClient client)
    {
        ApprovalRequest body = await req.ReadFromJsonAsync<ApprovalRequest>()
            ?? throw new InvalidOperationException("Approval body is required.");

        var decision = new ApprovalDecision(
            DecisionId: Guid.NewGuid().ToString("N"),
            Approved: body.Approved,
            Approver: body.Approver,
            Note: body.Note);

        // Raises the event the orchestrator is waiting on. Returns when the event is
        // enqueued, not when the orchestrator has consumed it.
        await client.RaiseEventAsync(instanceId, "ApprovalDecision", decision);

        return await client.CreateCheckStatusResponseAsync(req, instanceId);
    }
}

RaiseEventAsync(instanceId, "ApprovalDecision", decision) is the mirror image of the wait. The event name has to match the orchestrator's wait name (again, case-insensitively), and the decision object is JSON-serialized and deserialized into the orchestrator's ApprovalDecision on the other side. The returned task completes when the event is enqueued, not when the orchestrator wakes up and consumes it, so a 202 here means "your decision is on its way," not "the report is settled." The caller polls the status URL to see the workflow reach Completed. Generating the DecisionId server-side is what makes the earlier dedupe work: it gives every raised decision a stable identity even if the platform delivers it twice.

There are two ways to raise this event, and the choice is the reason this endpoint exists at all. The 202 body from the start call already contains a sendEventPostUri, the built-in raise-event API: a caller can POST the decision straight to .../instances/{instanceId}/raiseEvent/ApprovalDecision with no code from you. It is the quickest path, and it returns useful status codes (404 for an unknown instance, 410 for one that already finished). What it does not give you is a place to put your own concerns. The custom endpoint above exists so you can own the route, authentication, validation, and audit: check that this approver is allowed to sign off this report, record who decided and when, reject a malformed body before it ever reaches the orchestration. If none of that matters for your case, the built-in webhook is less to maintain.

The honest gotcha lives in the failure mode of the SDK call. A bare RaiseEventAsync to a completed or non-existent instance is silently discarded: no exception, no error, nothing. Raise ApprovalDecision against a stale or mistyped instance ID and the call returns happily while the event evaporates, and the approver sees a success they did not get. If you need to tell an approver that the workflow they are signing off no longer exists, pre-check with GetInstanceAsync and inspect the runtime status before raising, or use the built-in HTTP API and surface its 404 and 410 to the caller. The silent path is convenient until the instance ID is wrong.

Which raises the question this endpoint quietly assumes away: it takes instanceId from the route as if the caller already knows it. Where that ID comes from, how the approval link in the manager's email ends up carrying the right one, and how you avoid losing it, is its own problem, and the section on instance ID storage is where it gets solved.

Timeout and escalation

A wait that never gives up is a workflow you cannot operate. The previous section left the orchestration parked on WaitForExternalEvent with no exit; a report that no manager ever touches stays Running until someone terminates it by hand. The fix is to race the event against a durable timer and let whichever finishes first decide the outcome.

Start the timer and the wait, then await Task.WhenAny on the pair. The orchestration wakes on the first of the two to complete.

public static class ExpenseApprovalOrchestrator
{
    [Function(nameof(ExpenseApprovalOrchestrator))]
    public static async Task<string> Run(
        [OrchestrationTrigger] TaskOrchestrationContext context)
    {
        var report = context.GetInput<ExpenseReport>()!;

        // Deadline comes off the orchestration clock, NOT DateTime.UtcNow. Part 1's
        // determinism rule: every replay must compute the same instant, and
        // CurrentUtcDateTime is frozen to the original execution time on replay.
        DateTime deadline = context.CurrentUtcDateTime.AddDays(3);

        using var cts = new CancellationTokenSource();
        Task<ApprovalDecision> approvalTask =
            context.WaitForExternalEvent<ApprovalDecision>("ApprovalDecision");
        Task timeoutTask = context.CreateTimer(deadline, cts.Token);

        Task winner = await Task.WhenAny(approvalTask, timeoutTask);

        if (winner == approvalTask)
        {
            // The approval landed first. Cancel the timer before moving on.
            cts.Cancel();
            return await context.CallActivityAsync<string>(
                nameof(SettleExpenseActivity),
                new SettlementInput(report, approvalTask.Result));
        }

        // The timer won: nobody decided in time. Fail closed by auto-rejecting.
        var timedOut = new ApprovalDecision(
            DecisionId: $"timeout-{report.ReportId}",
            Approved: false,
            Approver: "system (timeout)",
            Note: $"No decision by {deadline:o}.");
        return await context.CallActivityAsync<string>(
            nameof(SettleExpenseActivity), new SettlementInput(report, timedOut));
    }
}

Two details carry the weight here. The first is the deadline source. context.CurrentUtcDateTime is the replay-safe clock from Part 1: on the first execution it is the real time, and on every replay after a checkpoint it returns that same original value, so AddDays(3) resolves to one fixed instant no matter how many times the orchestrator re-runs. Use DateTime.UtcNow instead and the deadline drifts forward on every replay, which breaks determinism and can move the timer past where it should have fired.

The second is the cts.Cancel() on the approval branch, and it is easy to read as optional cleanup when it is not. CreateTimer registers a durable timer in the instance state, and the framework will not let an orchestration reach Completed while a timer it created is still outstanding. Skip the cancel and your approved report settles its activity, then sits in Running for the rest of the three days until the abandoned timer finally fires. Cancelling the token does not abort anything in flight; it tells the runtime to drop the pending timer so the orchestrator can finish now. The using on the CancellationTokenSource disposes it when the method exits.

Fail-closed is a choice, not a rule. Auto-rejecting on timeout is the conservative default: an expense nobody approved should not be paid. The richer variant is to escalate rather than reject. Instead of returning, the timeout branch notifies a second approver (in the expense case, the manager's manager) and waits again with a fresh timer.

// Escalation variant for the timeout branch: re-notify, then wait once more.
await context.CallActivityAsync(nameof(NotifyEscalationApproverActivity), report);

using var escalationCts = new CancellationTokenSource();
Task<ApprovalDecision> escalatedApproval =
    context.WaitForExternalEvent<ApprovalDecision>("ApprovalDecision");
Task escalationTimeout =
    context.CreateTimer(context.CurrentUtcDateTime.AddDays(2), escalationCts.Token);

if (await Task.WhenAny(escalatedApproval, escalationTimeout) == escalatedApproval)
{
    escalationCts.Cancel();
    return await context.CallActivityAsync<string>(
        nameof(SettleExpenseActivity),
        new SettlementInput(report, escalatedApproval.Result));
}
// Still nothing after the second window: now fail closed.

Each escalation round is the same race with a new deadline and a new CancellationTokenSource, so the same two rules apply every time: derive the deadline from CurrentUtcDateTime, cancel the timer when the event wins. You can wrap the round in a loop to escalate up a chain of approvers, but give it a hard ceiling; an unbounded escalation loop is the indefinite wait you just removed, wearing a different hat.

Instance ID storage patterns

The approval endpoint took instanceId straight from its route, as if the caller already had it. The start endpoint, meanwhile, returned that ID inside a 202 and then dropped it. So when the manager opens an email three days later and clicks approve, what fills in the {instanceId} segment of expenses/{instanceId}/decision? This is the half the intro flagged: Durable Functions hands you the instance ID at start time and keeps no index from a business entity to its instance ID. Mapping report R-2048 back to the orchestration that is waiting on it is your code's job, and there are three ways to do it.

Option 1: an external store keyed by the business entity. Write a row at start time (ReportId to instanceId) into whatever database or lookup service you already run, then read it back in the approval endpoint. This is the production default. It is authoritative and queryable, it supports many runs mapping to one entity and full history, and it survives instance purging. The cost is that you now own a second piece of state: an extra write on start, an extra read on approval, and the consistency between that row and the orchestration is yours to keep (make the write idempotent or transactional with the start so you cannot end up with a row pointing at an instance that never scheduled, or an instance with no row).

Option 2: make the instance ID the business key, and store nothing. ScheduleNewOrchestrationInstanceAsync lets you supply the ID instead of taking an autogenerated GUID, via StartOrchestrationOptions.InstanceId. If the ID is the report key, the approval endpoint reconstructs it from the route with no lookup at all.

string instanceId = await client.ScheduleNewOrchestrationInstanceAsync(
    nameof(ExpenseApprovalOrchestrator),
    report,
    new StartOrchestrationOptions { InstanceId = $"expense-{report.ReportId}" });

Now the email link is just expenses/expense-{ReportId}/decision, built from data you already have, and there is no table to keep in sync. The constraints are real, though, and the runtime enforces them unevenly across storage providers, so honor them regardless: the ID must be unique within the task hub, 1 to 100 characters, must not start with @, and must not contain /, \, #, ?, or control characters. Raw GUIDs are fine; emails and file paths usually need encoding first. The mapping is strictly one-to-one, so this fits short, naturally unique, single-run keys and not much else. A report ID like expense-R-2048 qualifies; a customer who can file many reports does not.

Option 2 also inherits a gotcha worth stating plainly: scheduling an instance ID that already exists is not a safe atomic create-if-absent. The documented pattern is check-then-start (call GetInstanceAsync, inspect RuntimeStatus, and start only if the instance is missing or in a terminal state), and Microsoft flags a concurrency race even then: two requests for the same key can both pass the check and both report success while only one orchestration actually runs. If a duplicate submit must never double-schedule, you need a lock outside Durable Functions, which starts to erode the "store nothing" advantage.

Option 3: a Durable Entity as a registry. Entities are supported in the .NET isolated worker (the "not in isolated" caveat you may have read refers to in-orchestration critical sections, not entities), their operations run serially so there is no intra-entity race, and the index stays inside Durable Functions instead of a separate database. Treat this as viable but unproven: no official guidance endorses an entity as the business-key index, entities favor durability over latency so client reads can be stale, and routing every lookup through one hot registry entity serializes all traffic into a throughput bottleneck. Reach for it only if you have a specific reason to avoid an external store.

One temptation to rule out: the query APIs are not a reverse lookup. GetInstanceAsync needs the ID you are trying to find. GetAllInstancesAsync(OrchestrationQuery) filters only on runtime status, time range, and orchestration name, with no predicate for an arbitrary business key, so finding R-2048 that way means scanning every instance client-side, an O(n) walk that degrades as history grows and breaks once completed instances are purged. Custom status (SetCustomStatus) is for surfacing progress and caps at 16 KB; tags are queryable only in the scheduler dashboard, not from code. None of them is a point lookup. A business key to instance ID mapping always comes back to an app-owned index (option 1) or a derivable ID (option 2).

For the expense workflow the call is short. If a report ID is already a clean single-run key, option 2 removes a whole moving part: the email link encodes the ID and there is nothing to persist or reconcile. The moment you need many approvals per report, audit history, or a key that does not survive the ID character rules, option 1's extra row pays for itself.

Wrapping up

The pause is the easy half. WaitForExternalEvent suspends for days on a single await and bills you nothing while it waits, and a CreateTimer race keeps that wait from becoming a leak. The half that decides whether this works in production is the one most walkthroughs skip: the instance ID is a handle the platform hands you once and never indexes, so reuniting a parked workflow with the human who finally answers it is a design decision you make on purpose, at start time. Get that wrong and the approver clicks a link that raises an event into the void.

So make the call deliberately on your next approval workflow: do you persist the instance ID in an external table keyed by the business entity, or derive it from the business key so there is nothing to store?

DEV Community