TL;DR: The bug that finally made this click for me: I had an orchestrator function doing a direct HTTP call using
axios. get()inside the orchestrator body.
📖 Reading time: ~27 min
What's in this article
- Why I Kept Confusing These Three Forms (Until I Got Burned)
- Quick Setup So We're on the Same Page
- Form 1: The Orchestrator Function — The One with the Weird Rules
- Form 2: The Activity Function — Where the Real Work Happens
- Form 3: The Entity Function — Stateful Actors That Actually Persist
- Comparing the Three Forms Side by Side
- A Real Pattern: Combining All Three in One Flow
- The Rough Edges I Hit That the Docs Glossed Over
Why I Kept Confusing These Three Forms (Until I Got Burned)
The bug that finally made this click for me: I had an orchestrator function doing a direct HTTP call using axios.get() inside the orchestrator body. Not inside an activity — inside the orchestrator itself. Local testing? Flawless. Deployed it, threw some load at it, and watched the whole Durable Functions worker grind to a halt. Timeouts everywhere, history replay going sideways, function instances piling up. The runtime wasn't crashing — it was doing exactly what it was supposed to do. I was the one breaking the rules.
The thing that trips up most developers early on is assuming these three forms are just organizational patterns — like you could put your HTTP call in the orchestrator, you'd just be doing it "wrong" stylistically. That's not how it works. The runtime actively enforces boundaries. Orchestrators get replayed from history on every await. If you do real I/O in there — a network call, a database query, reading a file, even Date.now() — you'll get different results on replay than you did on first execution. The framework can't guarantee idempotency, your state drifts, and you get the kind of failure that only shows up under concurrent load, which is precisely when you need it to be reliable.
Here's how I now think about ownership, because that framing made it finally stick:
- Orchestrator function — owns control flow only. It decides what runs, in what order, with what inputs. It reads history, not the world. No I/O, no randomness, no side effects. Think of it as a pure state machine that the runtime can rewind and replay at any time.
- Activity function — owns actual work. HTTP calls, database writes, sending emails, calling third-party APIs — this is where all of that lives. Activities run exactly once per call (with retry semantics you configure), and they can block as long as they need to.
- Entity function — owns durable state. This is the one most people discover last and then wonder how they ever lived without it. An entity is a persistent object with identity — like a counter, a shopping cart, a rate limiter — that survives across function invocations without you manually reading and writing to storage.
The gotcha with entities that burned me a second time: entity functions are also replayed, same as orchestrators. So the same rules apply — no direct I/O inside an entity's operation handlers. If an entity needs to call an API to update its state, that call belongs in an activity that the orchestrator schedules, with the result passed into the entity as a signal or operation input. Once I mapped that out in a diagram for my team, the "why is my entity state corrupted" tickets stopped coming in.
If you want to sanity-check your orchestrator code before running it, the fastest heuristic I use is: could this line return a different value if you ran it twice? If yes, it belongs in an activity. That catches Math.random(), new Date(), fetch(), database reads — basically anything that touches the outside world or produces non-deterministic output. For a broader look at automating your dev workflows beyond just Durable Functions, check out the Ultimate Productivity Guide: Automate Your Workflow in 2026.
Quick Setup So We're on the Same Page
The thing that burned me first was mixing up the NuGet packages between the isolated worker model and the in-process model. They look almost identical in search results, the error messages when you use the wrong one say things like IDurableOrchestrationContext could not be found without explaining why, and Stack Overflow answers from 2021 will confidently point you at the wrong package. Stick with isolated worker — it's where Microsoft is putting new features, and it's what .NET 8 assumes you want.
Here's the exact install sequence I use on a clean machine. The Core Tools version matters — v4 is required for .NET 8 isolated support:
# Install Azure Functions Core Tools v4 globally
npm install -g azure-functions-core-tools@4 --unsafe-perm true
# Scaffold a new isolated worker project
func init MyDurableApp --worker-runtime dotnet-isolated
cd MyDurableApp
# Add the RIGHT package — note "Worker" in the name, not "WebJobs"
dotnet add package Microsoft.Azure.Functions.Worker.Extensions.DurableTask
# If you accidentally added the wrong one, remove it
dotnet remove package Microsoft.Azure.WebJobs.Extensions.DurableTask
For local storage, Azurite is the current replacement for the old Azure Storage Emulator (which was Windows-only and is now deprecated). Run it via npm: npm install -g azurite then azurite --silent --location ./azurite-data in a separate terminal. Your local.settings.json needs exactly this to point at it:
{
"IsEncrypted": false,
"Values": {
"AzureWebJobsStorage": "UseDevelopmentStorage=true",
"FUNCTIONS_WORKER_RUNTIME": "dotnet-isolated"
}
}
The isolated vs in-process distinction is the gotcha nobody explains upfront. In-process means your function code runs inside the Functions host process — tighter coupling, faster cold start, but it's locked to whatever .NET version the host supports and Microsoft has signaled it's on maintenance mode. Isolated means a separate .NET worker process communicates with the host over gRPC — slightly more startup overhead, but you get full .NET 8 features, proper dependency injection, and the new SDK. The NuGet package names encode this difference: Microsoft.Azure.Functions.Worker.Extensions.DurableTask for isolated vs Microsoft.Azure.WebJobs.Extensions.DurableTask for in-process. If your project has both in its .csproj you'll get bizarre runtime conflicts with no clear error message pointing at the duplication.
After scaffolding, your .csproj should look roughly like this before you add anything else:
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<TargetFramework>net8.0</TargetFramework>
<OutputType>Exe</OutputType>
<AzureFunctionsVersion>v4</AzureFunctionsVersion>
<RootNamespace>MyDurableApp</RootNamespace>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="Microsoft.Azure.Functions.Worker" Version="1.21.0" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.DurableTask" Version="1.1.3" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Http" Version="3.1.0" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.16.4" />
</ItemGroup>
</Project>
One last thing: func start will silently succeed even if Azurite isn't running, then explode on the first actual orchestration trigger with a storage connection error. Always start Azurite first, confirm it's listening on ports 10000–10002, then start the Functions host. Running func start --verbose at least gives you the storage connection attempt in the logs so you can see exactly when it fails.
Form 1: The Orchestrator Function — The One with the Weird Rules
The thing that catches almost everyone off guard the first time they write an orchestrator function is that the code runs more than once. Not because of a bug. By design. The Durable Functions runtime replays your orchestrator from the beginning every time it wakes up from a timer, an activity result, or an external event. That single fact explains every weird constraint the orchestrator has, and once it clicks, the rules stop feeling arbitrary.
The orchestrator's job is to define the shape of your workflow — what runs in what order, what runs in parallel, what waits. It calls activity functions (the ones that do actual work), coordinates fan-out/fan-in patterns, sets timers, and handles the results. What it does not do is the work itself. That distinction is the whole design.
Here's a real multi-step order processing orchestrator. This is close to production code I've shipped:
[FunctionName("OrderOrchestrator")]
public static async Task RunOrchestrator(
[OrchestrationTrigger] IDurableOrchestrationContext context)
{
var input = context.GetInput<OrderInput>();
// Each CallActivityAsync replays during history reconstruction
var validationResult = await context.CallActivityAsync<string>("ValidateOrder", input);
if (validationResult != "OK")
throw new InvalidOperationException($"Validation failed: {validationResult}");
// Fan-out: kick off inventory + payment checks in parallel
var inventoryTask = context.CallActivityAsync<bool>("CheckInventory", input);
var paymentTask = context.CallActivityAsync<bool>("AuthorizePayment", input);
await Task.WhenAll(inventoryTask, paymentTask);
// Timer that doesn't eat a thread — uses durable timer underneath
await context.CreateTimer(
context.CurrentUtcDateTime.AddMinutes(5),
CancellationToken.None
);
await context.CallActivityAsync("FulfillOrder", input);
}
The replay mechanism works like this: every time the orchestrator wakes up, the runtime re-executes your function top to bottom, but it replays completed activity calls from a stored history table instead of actually running them again. The activities themselves don't re-execute. Your orchestrator code does. This is why if you stick a Console.WriteLine or a logger call directly in the orchestrator body, you'll see it fire once per replay — which can be a dozen times for a long-running workflow. The output isn't garbage, it's evidence the replay is working. But it'll confuse you badly until you understand what's happening.
What actually breaks the replay is any non-deterministic call. The runtime is comparing what your code does now against what it recorded in history. If you call DateTime.Now, you get a different value on replay than the original execution. If you call Guid.NewGuid(), same problem — different value, history mismatch, corrupted state. If you fire an HttpClient directly from the orchestrator, you're making a real network call on every replay, which is both wrong and potentially expensive. The banned list in practice:
-
DateTime.Now/DateTime.UtcNow— usecontext.CurrentUtcDateTime -
Guid.NewGuid()— usecontext.NewGuid() - Any direct
HttpClient, database call, or file I/O — push it into an activity -
Thread.Sleepor rawTask.Delay— usecontext.CreateTimerwhich survives host restarts - Static mutable state — it persists across replays unpredictably
The context.CurrentUtcDateTime substitution looks trivially minor and I've seen senior devs skip it thinking it won't matter for their use case. It will. The first time your orchestrator replays after a host restart and your timer logic is comparing a freshly-generated DateTime.UtcNow against a stored checkpoint timestamp from three hours ago, the workflow takes a path it never should have taken. The fix is one token swap, but diagnosing why your workflow silently skipped a fulfillment step at 3am is not a fun morning. Use the context equivalents for everything time-related, every time, no exceptions.
Form 2: The Activity Function — Where the Real Work Happens
The most important mental shift when working with Durable Functions is understanding that orchestrator functions are intentionally lobotomized — no I/O, no randomness, no DateTime.Now. All of that gets pushed into activity functions. This separation isn't a limitation, it's the architecture. Your orchestrator is a replay-safe state machine; your activity is where you actually do stuff.
Activity functions have zero restrictions. Call a Postgres database, hit a third-party REST API, write a file to blob storage, send an email — it all lives here. I've had teams fight this design because it felt like unnecessary indirection, but the payoff shows up when your orchestrator crashes mid-execution and replays cleanly from checkpoint, re-running only the activities that haven't completed yet. That wouldn't work if orchestrators were doing I/O themselves.
Here's a realistic activity that calls an external API and returns a strongly-typed result. This is the actual pattern I use in production, not a toy example:
// Triggered by orchestrator via context.CallActivityWithRetryAsync
[FunctionName("FetchUserProfile")]
public static async Task<UserProfile> FetchUserProfileAsync(
[ActivityTrigger] string userId,
ILogger log)
{
// HttpClient should be injected via DI in real apps — static here for brevity
using var http = new HttpClient();
http.BaseAddress = new Uri("https://api.yourservice.com");
http.DefaultRequestHeaders.Add("Authorization", "Bearer " + Environment.GetEnvironmentVariable("API_KEY"));
var response = await http.GetAsync($"/users/{userId}");
response.EnsureSuccessStatusCode();
// System.Text.Json in .NET 6+ — returns null-safe typed object
var profile = await response.Content.ReadFromJsonAsync<UserProfile>()
?? throw new InvalidOperationException($"Empty response for user {userId}");
log.LogInformation("Fetched profile for {UserId}, tier={Tier}", userId, profile.Tier);
return profile; // This gets serialized to JSON and written to the history table
}
// Calling side in the orchestrator — this is where retry config lives
var retryPolicy = new RetryOptions(
firstRetryInterval: TimeSpan.FromSeconds(5),
maxNumberOfAttempts: 4)
{
// Exponential backoff: 5s, 10s, 20s, 40s
BackoffCoefficient = 2.0,
// Don't retry on business logic errors, only transient ones
Handle = ex => ex is HttpRequestException or TimeoutException
};
var profile = await context.CallActivityWithRetryAsync<UserProfile>(
"FetchUserProfile",
retryPolicy,
userId);
The retry config is more nuanced than the docs make it look. MaxNumberOfAttempts counts the first attempt — so setting it to 4 means 1 original + 3 retries. I've seen devs set it to 1 thinking they'd get a retry, then wonder why failed activities blew up immediately. Also: Handle is your escape hatch for distinguishing transient errors from permanent ones. A 404 from your API isn't worth retrying 4 times; a 503 is. Wire that up from day one or you'll waste money on pointless retries.
The gotcha that burned me at 2am: activity inputs and outputs must round-trip through JSON cleanly. The Durable Functions runtime serializes everything through Newtonsoft.Json (or System.Text.Json depending on your config) before writing to the history table. Pass a Stream, a CancellationToken, a delegate, or anything with circular references and you'll get a JsonException at runtime, not compile time. I've also been bitten by passing a class with a DateTimeOffset field that serialized fine but deserialized to UTC when the original was local time. The rule: treat your activity input/output types the same way you'd treat a DTO going over a REST API. Keep them flat, simple, and explicitly test serialization round-trips in your unit tests.
Granularity matters more than most tutorials admit. Every activity invocation creates a queue message, a history table entry, and a round-trip through Azure Storage. If you design an orchestrator that calls 500 tiny activities — one per row in a CSV, say — you're paying that overhead 500 times. I batch aggressively: instead of one activity per user record, I'll pass a List<string> of up to 100 IDs and process them inside a single activity. The flip side is that a coarse-grained activity that processes 500 records in one shot loses all retry granularity — if it fails on record 499, you retry from the start. Find the sweet spot based on your actual failure rate and acceptable replay cost. For most HTTP-heavy workflows, batches of 10–50 items is where I land.
Form 3: The Entity Function — Stateful Actors That Actually Persist
The thing that trips most people up is reaching for an orchestration when they actually need an entity. Orchestrations are great for workflows — sequences of steps with retries and timeered delays. But if you need to track a count, maintain a user's session state, or accumulate items over time across many unrelated events, starting a new orchestration per update is the wrong shape. You end up with dozens of orchestration instances, each holding a tiny slice of mutable state, with no clean way to query or mutate that state from the outside without abusing the external events API.
The mental model that finally made entities click for me: imagine a Redis key whose value has methods. You get myCounter:Add, myCounter:Reset, myCounter:Get — and the runtime serializes all calls to that key automatically, persists the state to Azure Table Storage between calls, and lets you address any entity by a string ID from anywhere in your system. No locks, no race conditions, no boilerplate. The entity ID is just new EntityId("Counter", "user-42") and all calls to that ID run serially by design.
Here's a real counter entity using the function-based syntax first, then the class-based one so you can see why I reach for class-based on anything non-trivial:
// Function-based — fine for toy examples, gets ugly fast
[FunctionName("Counter")]
public static void Counter([EntityTrigger] IDurableEntityContext ctx)
{
int state = ctx.GetState<int>();
switch (ctx.OperationName.ToLowerInvariant())
{
case "add":
int amount = ctx.GetInput<int>();
ctx.SetState(state + amount);
break;
case "reset":
ctx.SetState(0);
break;
case "get":
ctx.Return(state);
break;
}
}
// Class-based — this is what you should actually use
public class Counter : TaskEntity<int>
{
// State is the typed backing field — no GetState/SetState ceremony
public void Add(int amount) => State += amount;
public void Reset() => State = 0;
public int Get() => State;
[FunctionName(nameof(Counter))]
public static Task Run([EntityTrigger] TaskEntityDispatcher dispatcher)
=> dispatcher.DispatchAsync<Counter>();
}
The class-based approach landed in the Durable Functions 2.x SDK — you need Microsoft.Azure.WebJobs.Extensions.DurableTask 2.9+ for the TaskEntity<T> base class. The dispatch model uses reflection to route operation names to method names, so your Add method handles the "add" operation automatically. No switch statement, no manual state serialization, and you get full IntelliSense. For anything beyond two or three operations the function-based syntax becomes a maintenance problem.
Calling entities from an orchestrator is where you choose your consistency guarantee. SignalEntityAsync is fire-and-forget — the orchestrator moves on immediately without waiting for the entity to process the call. CallEntityAsync awaits the result and blocks the orchestrator until the entity responds. Use signal for writes where you don't need confirmation, call for reads or writes that gate later logic:
[FunctionName("OrderOrchestrator")]
public static async Task RunOrchestrator(
[OrchestrationTrigger] IDurableOrchestrationContext context)
{
var entityId = new EntityId(nameof(Counter), "order-items");
// Fire-and-forget — add an item, don't wait
context.SignalEntity(entityId, nameof(Counter.Add), 1);
// Awaitable read — block until we get the current count back
int currentCount = await context.CallEntityAsync<int>(entityId, nameof(Counter.Get));
if (currentCount >= 10)
await context.CallActivityAsync("TriggerBulkShipment", null);
}
The serialization behavior is the thing that surprised me most, and it's actually why entities are reliable. Every call to a given entity ID is queued and processed one at a time — there's no concurrency within a single entity. So if 50 orchestrators all signal Counter/user-42 simultaneously, those 50 operations execute sequentially in arrival order. You will never get a torn write. The trade-off is that you can't do parallel fan-out inside an entity the way you'd fan out activity functions in an orchestrator. If you need that, the entity shouldn't be doing the fan-out — the orchestrator should, and it signals the entity at the end to record the result. Once I stopped thinking of entities as mini-orchestrators and started treating them as consistent state stores with a method interface, the design patterns became obvious.
Comparing the Three Forms Side by Side
The thing that finally made these three click for me was stopping thinking about them as "types of functions" and starting thinking about them as roles in a system. Each one has exactly one job, and the constraints aren't arbitrary — they fall out directly from what that job requires.
Here's the side-by-side view I keep mentally loaded when I'm designing a new workflow:
Form | Can do I/O | Persistent state | Determinism required | Triggered by
-------------|------------|------------------|----------------------|---------------------------
Orchestrator | No | No (replayed) | YES — strictly | HTTP / queue / timer
Activity | Yes | No (stateless) | No constraint | Orchestrator only
Entity | Yes | YES — durable | No constraint | Orchestrator or direct signal
The determinism constraint on orchestrators is where most people hit their first real bug. The runtime replays your orchestrator function from history every time it resumes — so if you call DateTime.Now, do a database read, or generate a random number directly inside an orchestrator, you'll get different results on replay than on the original run. The runtime detects this and your workflow either silently corrupts or throws a non-determinism error. All of that work has to live in an activity instead, even if it feels silly to wrap a single Guid.NewGuid() call in its own activity function.
Activities are the workhorses — they do literally everything that touches the outside world. HTTP calls, database writes, sending emails, calling third-party APIs. They're stateless in the sense that you can't rely on any in-memory state persisting between calls, but they have zero determinism constraints. If an activity fails, the orchestrator retries it based on a retry policy you configure, and each retry is a clean invocation. The trade-off is that activities are the unit of retry, so if your activity does five things and the fifth one fails, you redo all five on retry. Keep them granular.
// Granular activity — retries are cheap and safe
[FunctionName("ChargeCard")]
public async Task<string> ChargeCard([ActivityTrigger] PaymentRequest req, ILogger log)
{
// idempotency key prevents double-charging on retry
return await _stripe.ChargeAsync(req.Amount, req.IdempotencyKey);
}
// Avoid bundling unrelated side effects — retry becomes expensive
// BAD: ChargeCard + SendReceipt + UpdateLedger all in one activity
Entities are the one that people underuse. The mental model is a tiny stateful actor — think a shopping cart, a counter, an approval status — where you need to read and mutate state across multiple orchestrator runs without spinning up a full database roundtrip every time. Entities get their own durable storage slot keyed by an entity ID, and they can receive signals from orchestrators or from external callers directly. The "no determinism constraint" part matters here too: because entity state is explicitly checkpointed, you don't have the replay problem that orchestrators have.
My actual decision rule, which I've used on every Durable Functions design in the past three years: coordinating logic → orchestrator, doing real work → activity, remembering something across time → entity. If I catch myself wanting to put an HttpClient call in an orchestrator, that's an activity. If I catch myself wanting to store approval state in a database that my orchestrator polls, that's an entity. The lines are cleaner than they look in the docs — you just have to map your intuition to the right bucket first.
A Real Pattern: Combining All Three in One Flow
The thing that finally made all three durable function forms click for me wasn't a toy example — it was a document processing pipeline. You have a batch of incoming PDFs, each needs OCR, validation, and classification, and you need a single source of truth that says "17 of 42 documents processed successfully." That's not solvable cleanly with just an orchestrator, and it's not solvable with just an entity. You need all three working together, and the wiring between them is where most tutorials stop short.
Here's the actual shape of it: an orchestrator fans out work to activity functions (one per document), then after each completes, it signals a counter entity to record the result. The entity holds the running tally, survives crashes, and can be queried independently of the orchestration. The key insight is that the orchestrator doesn't own the completion count — the entity does. That separation lets you query progress without polling the orchestration history, which can get enormous on large batches.
// Orchestrator — fans out activities, signals entity after each
const df = require("durable-functions");
df.app.orchestration("documentBatchOrchestrator", function* (context) {
const documents = context.df.getInput(); // e.g. ["doc1.pdf", "doc2.pdf", ...]
const entityId = new df.EntityId("DocumentCounter", "batch-" + context.df.instanceId);
// Fan out: one activity per document, no awaiting yet
const tasks = documents.map(docId =>
context.df.callActivity("processDocument", { docId })
);
// Fan in: wait for all, but handle partial failures gracefully
const results = yield context.df.Task.all(tasks.map(t =>
t.catch(err => ({ error: err.message, failed: true }))
));
// Signal the entity for each result — fire and forget, no await needed
for (const result of results) {
const signal = result.failed ? "incrementFailed" : "incrementSuccess";
context.df.signalEntity(entityId, signal);
}
// Read final counts back before finishing
const summary = yield context.df.callEntity(entityId, "get");
return summary;
});
// Activity — the actual document work
df.app.activity("processDocument", async (input) => {
const { docId } = input;
// OCR, validation, classification — whatever your pipeline needs
await runOcrPipeline(docId);
return { docId, status: "ok" };
});
// Entity — durable counter with named operations
df.app.entity("DocumentCounter", function (context) {
const state = context.df.getState(() => ({ success: 0, failed: 0 }));
switch (context.df.operationName) {
case "incrementSuccess":
state.success += 1;
break;
case "incrementFailed":
state.failed += 1;
break;
case "get":
context.df.return(state);
break;
}
context.df.setState(state);
});
One thing that surprised me: signalEntity is truly fire-and-forget from the orchestrator's perspective. The orchestrator doesn't yield on it, which means you're not blocking fan-in completion waiting for entity writes. The Durable Framework queues those signals and processes them serially against the entity. That serial guarantee is exactly why entities work as counters — you can't get a race condition even if 50 activity completions arrive at once.
For debugging locally, the management HTTP API is your first stop. After you start the function host with func start, hit this to see the full orchestration state including custom status and history:
# Replace {instanceId} with the ID returned when you started the orchestration
GET http://localhost:7071/runtime/webhooks/durabletask/instances/{instanceId}?showHistory=true&showHistoryOutput=true
# Typical response shape (truncated)
{
"name": "documentBatchOrchestrator",
"instanceId": "abc123",
"runtimeStatus": "Running",
"input": ["doc1.pdf", "doc2.pdf"],
"customStatus": null,
"output": null,
"createdTime": "2025-01-15T10:23:00Z",
"lastUpdatedTime": "2025-01-15T10:23:04Z",
"historyEvents": [
{ "EventType": "TaskCompleted", "Name": "processDocument", ... },
...
]
}
The history array grows with every event — for a 100-document batch that's hundreds of entries. Parsing that JSON by eye gets old fast. The Durable Functions Monitor VS Code extension (durablefunctionsmonitor.durablefunctionsmonitor in the marketplace) connects directly to your local storage emulator or Azure Storage account and renders the orchestration as a visual DAG. You can see exactly which activities completed, which are in-flight, and whether any entity signals are queued — without writing a single Kusto query or storage table scan. I reach for it immediately when an orchestration stalls because it makes the execution graph obvious in a way that the raw JSON history never will be.
The Rough Edges I Hit That the Docs Glossed Over
The history table bloat problem is the one that will sneak up on you weeks after go-live. Every activity call, every timer, every external event gets a row in Azure Table Storage under the Instances and History partitions. A moderately complex orchestration with 50 steps running thousands of times a day will generate millions of rows fast. Azure Table Storage doesn't auto-expire rows. I found this out the hard way when query latency on the orchestrator status endpoint climbed from milliseconds to seconds — the history table had grown to 4+ million rows with zero cleanup configured.
Fix it on two fronts: cap idle time and run the purge API on a schedule. Set MaxOrchestrationIdleTime in your host config so the runtime marks completed instances eligible for cleanup sooner, then call the purge endpoint in a durable timer loop or a separate cleanup function:
// host.json — trim how long completed instances hang around
{
"extensions": {
"durableTask": {
"maxOrchestrationIdleTime": "1.00:00:00",
"storageProvider": {
"type": "azure"
}
}
}
}
// Cleanup orchestrator — runs nightly via timer trigger
[FunctionName("PurgeOldInstances")]
public static async Task Run(
[TimerTrigger("0 0 2 * * *")] TimerInfo timer,
[DurableClient] IDurableOrchestrationClient client)
{
var cutoff = DateTime.UtcNow.AddDays(-7);
// purge all terminal instances older than 7 days
await client.PurgeInstanceHistoryAsync(
DateTime.MinValue,
cutoff,
new[] {
OrchestrationStatus.Completed,
OrchestrationStatus.Failed,
OrchestrationStatus.Terminated
});
}
Entity contention hit me on a fan-out pattern where 500 concurrent orchestrators were all signaling the same entity ID to update a shared counter. The entity executes operations serially by design — that's what gives you consistency — but it means you've just created a single-threaded queue with 500 items in it. The backlog grows faster than it drains. The fix is coarser-grained partitioning: instead of one Counter@global entity, create Counter@userId or Counter@batchId-0 through Counter@batchId-N shards and aggregate lazily. You trade real-time accuracy for throughput, which is almost always the right call.
// Shard by user instead of hammering one global entity
var shardId = $"counter@user-{userId}";
var entityId = new EntityId(nameof(CounterEntity), shardId);
ctx.SignalEntity(entityId, "add", incrementValue);
The NonDeterministicOrchestrationException on deploys is brutal because it surfaces at runtime, not at compile time, and it affects every in-flight instance simultaneously. The constraint is real: an orchestrator function must replay its full history deterministically every time it wakes up. If you deployed code that adds a new activity call between two existing ones, the replay diverges from the stored history and the runtime throws. Your options in priority order: drain the queue before deploying (set the function app to read-only, wait for all running instances to complete), use ContinueAsNew to checkpoint long-running orchestrations into fresh instances, or implement explicit versioning using a Task.WhenAny conditional branch:
// Version gate — v1 path for old instances, v2 for new ones
var version = ctx.GetInput().SchemaVersion;
if (version >= 2)
{
await ctx.CallActivityAsync("NewActivityStep", input);
}
// old instances that predate v2 skip this branch entirely
// their replay stays deterministic against the original history
The Azurite issue is the most tedious because it only bites you during test runs, usually right before you're trying to demo something. Entity state from a previous test run persists across Azurite restarts if you just kill and restart the process — the SQLite-backed state files survive. The only clean reset is deleting the __azurite_db_blob__.json, __azurite_db_queue__.json, and __azurite_db_table__.json files from wherever you ran Azurite from. I now have a one-liner in my package.json dev scripts:
# nuke azurite state and restart clean — run before each test suite
rm -f __azurite_db_*.json __azurite_db_*.json-wal __azurite_db_*.json-shm \
&& azurite --silent --location ./.azurite --debug /dev/null &
The --silent flag also suppresses the noisy request logs that pollute your test output. Without it you'll see hundreds of table storage poll requests scrolling past while your actual test assertions are somewhere in the middle. One more thing: if you're on Windows and using the Azurite VS Code extension rather than the CLI, the state files live under %USERPROFILE%\.azurite by default — the extension's "Clean" button in the sidebar doesn't always flush entity state completely. Delete the files manually.
When to NOT Use Durable Functions
The replay model is the gotcha that bites everyone. I've watched teams adopt Durable Functions expecting "just orchestrated Azure Functions" and then spend three painful days debugging why their logger was printing the same line six times, or why a timestamp captured inside an orchestrator was returning a value from six hours ago. The mental model shift required is significant — orchestrator functions re-execute from the beginning on every await, so any non-deterministic code (DateTime.Now, Guid.NewGuid(), random numbers) will produce different values on replay and corrupt your history. That's not a caveat buried in a footnote; it's a fundamental constraint that has caused real production incidents.
If your API endpoint needs to respond in under 200ms, Durable Functions is the wrong abstraction. The queue-based execution model means your orchestrator starts, writes to Azure Storage queues, and your activity functions get scheduled via polling. That polling interval defaults to around 30 seconds in the storage provider — yes, you can tune it down, but you're still adding hops that a direct function call doesn't have. I've seen teams build what they thought was a fast checkout flow using Durable orchestration and end up with 3–8 second response times that had nothing to do with their business logic. Use Durable Functions for workflows you can hand off asynchronously; never put them in the synchronous path of a user-facing request.
For straightforward scheduled work — process a batch at midnight, send a weekly digest, clean up stale records — plain timer-triggered Azure Functions are genuinely simpler and cheaper. You don't need orchestration history, you don't need the Durable Task Framework overhead, and you don't need to think about replay. The moment you start writing something like:
// DON'T do this for simple scheduled tasks
[FunctionName("WeeklyReport")]
public async Task Run([OrchestrationTrigger] IDurableOrchestrationContext ctx)
{
await ctx.CallActivityAsync("GenerateReport", null);
}
// DO this instead
[FunctionName("WeeklyReport")]
public async Task Run([TimerTrigger("0 0 9 * * MON")] TimerInfo timer)
{
await _reportService.GenerateAsync();
}
...the second version has fewer moving parts, one fewer queue hop, no storage table writes for history, and any junior dev can debug it without understanding the Durable replay model.
Entity functions are the hidden performance trap. They look like a clean solution for shared mutable state — a counter, a session, a rate limiter. But each signal or call to an entity goes through the storage queue, and entities with high write throughput will saturate that queue fast. If you're updating the same entity more than a few hundred times per minute, you'll start seeing queue depth build up and latency balloon. The Durable storage provider (backed by Azure Storage Tables and Queues) isn't designed for this pattern. For anything resembling a hot counter or a per-user rate limiter under real traffic, reach for Azure Cache for Redis with its atomic increment operations, or model the data in Cosmos DB with its conflict resolution — both are built for that access pattern in ways Durable Entities simply aren't.
The team-readiness issue is the one most engineering managers wave away, and it ends up costing a sprint. The replay model, the constraint on non-determinism, the difference between orchestrator/activity/entity function scopes, the fact that you can't just throw a try/catch around an await ctx.CallActivityAsync() and expect it to behave like normal async code — none of this is hard once you've built something with it. But the first time through, developers make assumptions that seem completely reasonable and are completely wrong. Budget at minimum a week of genuine exploration time before putting this in a production critical path, and make sure at least one person on the team has read through the code constraints documentation before writing a single orchestrator.
Disclaimer: This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content.
Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.
Top comments (0)