DEV Community: Martin Oehlert

Preparing for Migration: Decoupling Your Function Logic

Martin Oehlert — Fri, 22 May 2026 05:48:43 +0000

When does pushing logic out of the Function method actually pay off? The day the Consumption-plan batch runs past 10 minutes, the host kills the worker, and the same queue message is delivered again from the top. The trigger binding turns out to be a hosting contract, not a programming model: a queue trigger and a BackgroundService loop are two different shapes for the same job. Once the workload sits behind an injected service, swapping one shape for the other becomes a Program.cs change instead of a rewrite, and tests stop needing the Functions host.

The pattern shows up most clearly in a working sample. The MigrationDemo folder from Part 5 ships one workload (Settlement.Core) and three hosts (Functions, App Service, Container App) that all call into it. Everything below is grounded in that code.

The trigger as a thin controller

The Function method does three things and nothing else:

Receive the trigger payload. For queue, event hub, and Cosmos triggers the binding deserializes it for you. HTTP triggers do their own ReadFromJsonAsync.
Call an injected service with the deserialized command and a CancellationToken.
Return or forward the result.

Everything beyond that list (validation past the framework checks, persistence, downstream calls, business rules) belongs to a service. The rule is not "make the Function shorter", it is "make the Function disposable": the same workload has to be callable from somewhere that is not a trigger binding.

Before: the trigger doing too much

A naïve settlement function pulls a batch off the queue and processes it inline. Reading raw configuration, branching on the gateway response, sending dead-letter messages, all in the Run method:

public sealed class SettlementFunction(
    QueueClient deadLetterQueue,
    ILogger<SettlementFunction> logger)
{
    [Function(nameof(SettlementFunction))]
    public async Task Run(
        [QueueTrigger("settlement-batches", Connection = "AzureWebJobsStorage")] string body,
        CancellationToken cancellationToken)
    {
        var batch = JsonSerializer.Deserialize<SettlementBatch>(body)
            ?? throw new InvalidOperationException("Malformed batch.");

        var delayMs = int.Parse(Environment.GetEnvironmentVariable("PER_PAYMENT_DELAY_MS") ?? "50");
        var failureRate = double.Parse(Environment.GetEnvironmentVariable("FAILURE_RATE") ?? "0.02");

        var settled = 0;
        var failed = 0;

        foreach (var payment in batch.Payments)
        {
            cancellationToken.ThrowIfCancellationRequested();
            await Task.Delay(delayMs, cancellationToken);

            var hash = (uint)payment.PaymentId.GetHashCode(StringComparison.Ordinal);
            var accepted = ((hash & 0xFFFF) / 65536.0) >= failureRate;

            if (accepted)
            {
                settled++;
            }
            else
            {
                failed++;
                logger.LogWarning("Rejected {PaymentId}", payment.PaymentId);
                await deadLetterQueue.SendMessageAsync(
                    JsonSerializer.Serialize(new { payment.PaymentId, reason = "GATEWAY_DECLINED" }),
                    cancellationToken: cancellationToken);
            }
        }

        logger.LogInformation(
            "Batch {BatchId}: settled={Settled}, failed={Failed}",
            batch.BatchId, settled, failed);
    }
}

Three problems are not visible in the file, but they are visible the first time something goes wrong:

The settlement loop only runs inside the Functions worker. A Consumption-plan batch that runs past the 10-minute default timeout gets killed, and the same message comes back from scratch.
Changing the failure-rate threshold means redeploying the Function. Nothing in the loop is testable without spinning up the host.
Replacing the dead-letter queue with Service Bus is a method rewrite. The branching, the serialization, and the SDK call are tangled at the call site.

After: the contract in code

Push the loop into IPaymentSettler and the trigger collapses to its three jobs. From Settlement.FunctionApp/SettlementFunction.cs:

public sealed class SettlementFunction(
    IPaymentSettler settler,
    ILogger<SettlementFunction> logger)
{
    [Function(nameof(SettlementFunction))]
    public async Task Run(
        [QueueTrigger("settlement-batches", Connection = "AzureWebJobsStorage")] SettlementBatch batch,
        CancellationToken cancellationToken)
    {
        logger.LogInformation(
            "Function host received batch {BatchId} ({Count} payments)",
            batch.BatchId, batch.Payments.Count);

        var result = await settler.SettleAsync(batch, progress: null, cancellationToken);

        logger.LogInformation(
            "Function host completed batch {BatchId}: settled={Settled}, failed={Failed}",
            batch.BatchId, result.Settled, result.Failed);
    }
}

Five statements. The QueueTrigger binding deserializes the message into a SettlementBatch POCO before Run is called; Microsoft Learn confirms this in the queue trigger usage notes. The body of the workload sits behind IPaymentSettler.SettleAsync(SettlementBatch, IProgress<SettlementProgress>?, CancellationToken). The CancellationToken propagates through.

What does not belong, with evidence from the sample

Each rule below would show up as code in SettlementFunction.cs if it had been violated. None of them is.

Business rules. No validation, no per-payment branching, no calculation in Run. The accept/reject branch and the rejection log live in PaymentSettler.SettleAsync.
Direct Azure SDK calls. SettlementFunction.cs has no using Azure.Storage.* and never constructs a QueueClient or BlobClient. The trigger reaches the downstream payments network through IPaymentSettler, which depends on ISettlementGateway, which speaks the domain.
FunctionContext leakage. Neither Run nor IPaymentSettler declares FunctionContext. That type ships in Microsoft.Azure.Functions.Worker; if a service took it as a parameter, that service would not compile in the App Service or Container App project.
Environment-variable reads. No Environment.GetEnvironmentVariable anywhere in Settlement.Core. Configuration is IOptions<SettlementOptions> and IOptions<PaymentSettlerOptions>, bound identically across all three hosts.
Sink-specific logging. SettlementFunction takes ILogger<SettlementFunction>. No TelemetryClient, no FunctionContext.GetLogger.

Why thinness matters when the host changes

Three things in the after-snippet are bound to Microsoft.Azure.Functions.Worker: the [Function] attribute, the [QueueTrigger] attribute, and the dispatch contract that says the worker calls Run. When the host changes, those three things get rewritten. Whatever sits below the settler.SettleAsync(...) call site survives.

The MigrationDemo sample makes that claim measurable. The same IPaymentSettler.SettleAsync call appears in three host projects:

Settlement.FunctionApp/SettlementFunction.cs: queue trigger, five statements.
Settlement.AppService/Services/SettlementWorker.cs: a BackgroundService polling the same queue, with an IProgress<SettlementProgress> reporter feeding a /status endpoint.
Settlement.ContainerApp/SettlementWorker.cs: the same polling loop with no HTTP host.

Identical call signature, identical service, identical cancellation propagation. The diff between the three hosts is the activation surface and the composition root. The workload itself does not move.

Abstracting Azure SDK dependencies

The first temptation when "separating logic" is to wrap every SDK type. IBlobStore around BlobClient. IQueueClientWrapper around QueueClient. IServiceBusSenderWrapper around ServiceBusSender. Each wrapper has the same surface as the SDK, just with the namespace renamed. The indirection doubles, the testable surface does not change, and now there are two places to update when the SDK adds a method.

MigrationDemo cuts in a different place. It abstracts the workload shape, not the transport:

public interface ISettlementGateway
{
    Task<SettlementResponse> SubmitAsync(
        Payment payment,
        CancellationToken cancellationToken);
}

public interface IPaymentSettler
{
    Task<SettlementProgress> SettleAsync(
        SettlementBatch batch,
        IProgress<SettlementProgress>? progress,
        CancellationToken cancellationToken);
}

Both interfaces speak the domain. Payment and SettlementBatch are records in Settlement.Core/Models/. The downstream payments network is an ISettlementGateway; the demo wires FakeSettlementGateway (deterministic, fast, no network), and production swaps in a typed HttpClient against the real provider. The workload entrypoint is IPaymentSettler. One method each. No BlobClient, no QueueClient, no ServiceBusSender anywhere in the file.

Infrastructure clients live above `Settlement.Core`

The SDK clients still exist; they just live above the workload library. The App Service and Container App hosts both drain the same queue, and both register QueueServiceClient via Microsoft.Extensions.Azure. From Settlement.AppService/Program.cs:

builder.Services.AddAzureClients(clientBuilder =>
{
    if (!string.IsNullOrWhiteSpace(queueConnection))
    {
        clientBuilder.AddQueueServiceClient(queueConnection);
    }
    else if (!string.IsNullOrWhiteSpace(queueServiceUri))
    {
        clientBuilder.AddQueueServiceClient(new Uri(queueServiceUri));
        clientBuilder.UseCredential(new DefaultAzureCredential());
    }
    else
    {
        throw new InvalidOperationException(
            "Queue:ConnectionString or Queue:ServiceUri must be configured.");
    }
});

Two branches, one shared registration. The ConnectionString branch is for local development against Azurite; the ServiceUri branch uses DefaultAzureCredential so the production host authenticates with managed identity. The choice is config-driven, so the composition root does not change between dev and prod.

QueueServiceClient is the per-account singleton. The per-queue QueueClient is derived from it:

builder.Services.AddSingleton(sp =>
{
    var service = sp.GetRequiredService<QueueServiceClient>();
    var name = sp.GetRequiredService<IOptions<QueueOptions>>().Value.QueueName;
    return service.GetQueueClient(name);
});

SettlementWorker keeps its existing QueueClient constructor parameter; only the wiring above changed. The Container App host has the same block, with one config object difference.

The Functions host does not call `AddAzureClients`

Settlement.FunctionApp/Program.cs skips this whole section. It does not need to. The [QueueTrigger("settlement-batches", Connection = "AzureWebJobsStorage")] attribute reads the storage connection from the host's AzureWebJobsStorage setting and manages the client itself. To run the Function host on managed identity instead of a connection string, set AzureWebJobsStorage__queueServiceUri on the app's configuration and grant the function app's identity the Storage Queue Data Contributor role on the storage account. Microsoft Learn documents the suffix pattern in identity-based connections for Functions. That is a hosting change, not a code change.

Keep the asymmetry in mind when reading the three Program.cs files side by side. The App Service and Container App hosts manage the queue client themselves because their BackgroundService is the activation surface. The Function host hands that job to the binding.

Building portable service classes

A service class is portable when three rules hold. Each one rules out a specific failure I have watched bite during migration.

Rule 1: zero references to `Microsoft.Azure.Functions.*` in the project file

The portability claim has to survive grep. In MigrationDemo:

$ grep -r "Microsoft.Azure.Functions" Settlement.Core/
(no matches)

$ dotnet list package --include-transitive --project Settlement.Core/Settlement.Core.csproj | grep -i Functions
(no matches)

Settlement.Core.csproj declares three direct references, all Microsoft.Extensions.*: DI abstractions, logging abstractions, options. If anything in that list grows a transitive Microsoft.Azure.Functions.* dependency, the library is no longer host-agnostic and the migration story stops working. Make this check part of CI so it does not regress quietly.

Rule 2: configuration via `IOptions<T>` with validated binding

PaymentSettler reads its only knob through PaymentSettlerOptions:

public sealed class PaymentSettlerOptions
{
    public const string SectionName = "PaymentSettler";

    [Range(1, 100_000)]
    public int ProgressReportInterval { get; init; } = 1;
}

Every host binds the same options the same way:

builder.Services
    .AddOptions<PaymentSettlerOptions>()
    .Bind(builder.Configuration.GetSection(PaymentSettlerOptions.SectionName))
    .ValidateDataAnnotations()
    .ValidateOnStart();

What changes between hosts is the source of the value, not the shape:

Functions: PaymentSettler__ProgressReportInterval as an app setting, or Values:PaymentSettler:ProgressReportInterval in local.settings.json. The double underscore is the cross-platform environment-variable hierarchy delimiter that maps to : in the configuration tree.
App Service: nested JSON in appsettings.Development.json ({ "PaymentSettler": { "ProgressReportInterval": 50 } }) or the same env-var shape in the portal.
Container App: nested JSON plus the option to inject PaymentSettler__ProgressReportInterval at runtime, often via a Container Apps secret.

IConfiguration collapses all three sources into one binding call. PaymentSettler only ever sees IOptions<PaymentSettlerOptions>. Validation runs on startup, so a missing or out-of-range value fails the host before the first message is processed instead of crashing one batch in.

Rule 3: logging via `ILogger<T>`

PaymentSettler takes ILogger<PaymentSettler> in its primary constructor. Nothing else. No TelemetryClient, no FunctionContext.GetLogger, no Console.WriteLine. The sink (Application Insights, OpenTelemetry, console) is wired in each host's Program.cs; the workload never sees which one is registered.

The moment a service starts using TelemetryClient directly, it has a hard dependency on the Application Insights SDK. The Container App host might not register that. The migration story turns from "swap the host" into "audit every logging call site", which is exactly the rewrite the decoupling work was supposed to avoid.

The `BackgroundService` lifetime trap

Settlement.Core/Services/ServiceCollectionExtensions.cs registers IPaymentSettler as a singleton:

public static IServiceCollection AddSettlementCore(this IServiceCollection services)
{
    services.TryAddSingleton<IPaymentSettler, PaymentSettler>();
    return services;
}

That works because PaymentSettler is stateless past its IOptions<> snapshot. The minute a real workload grows a scoped dependency (a DbContext, a tenant-bound HttpClient), the lifetime story diverges across hosts:

The isolated Functions worker creates a scope per invocation. A scoped DbContext gets a fresh instance per message.
ASP.NET creates a scope per HTTP request, but a BackgroundService runs in the root scope. Injecting a scoped service into a BackgroundService constructor throws on startup. Microsoft Learn flags the constraint in the hosted-services guidance.
The Container App host hits the same root-scope rule because its activation surface is also a BackgroundService.

The fix is to inject IServiceScopeFactory into the BackgroundService and call CreateScope() per message, then resolve the scoped dependency from the scope. The Function host does not need that wrapping. Flag this before the first scoped dependency lands, or the App Service and Container App hosts will diverge from Functions in a way that only shows up at runtime.

Testing without the Functions runtime

A unit test that spins up the Functions host to verify a discount calculation takes four seconds to start, and goes red every time the worker SDK ships a new version. The discount calculation has nothing to do with the trigger. Push it behind IPaymentSettler and the test becomes plain xUnit construction against the service class.

PaymentSettler takes three constructor parameters, all in Microsoft.Extensions.*. The unit test resolves them by hand:

[Fact]
public async Task SettleAsync_with_all_accepting_gateway_reports_full_settlement()
{
    var batch = new SettlementBatch(
        BatchId: "test-1",
        CutoffUtc: DateTimeOffset.UtcNow,
        Payments: Enumerable.Range(0, 10)
            .Select(i => new Payment($"p-{i}", 100m, "EUR"))
            .ToList());

    var settler = new PaymentSettler(
        gateway: new AlwaysAcceptingGateway(),
        options: Options.Create(new PaymentSettlerOptions { ProgressReportInterval = 1 }),
        logger: NullLogger<PaymentSettler>.Instance);

    var result = await settler.SettleAsync(batch, progress: null, CancellationToken.None);

    Assert.Equal(10, result.Settled);
    Assert.Equal(0, result.Failed);
}

private sealed class AlwaysAcceptingGateway : ISettlementGateway
{
    public Task<SettlementResponse> SubmitAsync(Payment payment, CancellationToken ct) =>
        Task.FromResult(new SettlementResponse(payment.PaymentId, Accepted: true, ReasonCode: null));
}

No TestHostBuilder, no local.settings.json, no func start. The test runs in milliseconds. The hand-rolled AlwaysAcceptingGateway is the lever that makes the assertion deterministic: the FakeSettlementGateway shipped with the sample uses a hash-and-threshold check that is great for reproducible demos and inconvenient when the question is "what happens when this specific batch is all accepted?". Microsoft's Azure SDK unit-testing guide shows both options for Settlement.Core-style services: a hand-rolled subclass of the dependency (which is what AlwaysAcceptingGateway is here) or a mocking library like Moq or NSubstitute. Pick the mocking library when the assertion is on interactions; pick a subclass when the assertion is on behaviour.

Integration tests at the same boundary

The integration suite swaps the fake for the real implementation (a typed HttpClient against the payments network) and runs the same SettleAsync call. The trigger is still not in the picture; dotnet test walks the workload directly. For the App Service and Container App variants, the queue surface drains through QueueClient against Azurite. Azurite supports Blob and Queue in GA; the Table emulator is in preview, which is only worth flagging if the suite grows a Table dependency.

The Functions variant has no first-party in-process test host for the isolated worker. The integration story there is "run func start once per build and stimulate the queue." Microsoft documents one in-process Functions test fixture, Microsoft.DurableTask.InProcessTestHost, in the Durable Functions unit-testing guide, and it only covers Durable orchestrations. For a plain queue or HTTP trigger that does no replay, the trigger gets one smoke test per build; everything else is xUnit at the service.

The smoke test that earns its keep

One end-to-end test per build, run against func start, confirms the binding wires up. Its job is to catch "the queue name in host.json does not match what the worker reads" and similar host-configuration mistakes. It does not assert business logic; the unit suite already does that in milliseconds.

Before and after: a refactoring walkthrough

The trigger before-and-after is above. The two remaining pieces of the diff are the service IPaymentSettler resolves to and a second host that consumes the same service. Reading both alongside the Function makes the migration claim specific: there is exactly one place the workload lives, and the hosts compete to be the cheapest way to call it.

PaymentSettler.SettleAsync is what IPaymentSettler resolves to. The full method body fits on one screen, in Settlement.Core/Services/PaymentSettler.cs:

public async Task<SettlementProgress> SettleAsync(
    SettlementBatch batch,
    IProgress<SettlementProgress>? progress,
    CancellationToken cancellationToken)
{
    var settled = 0;
    var failed = 0;
    var total = batch.Payments.Count;
    var processed = 0;
    var interval = _options.ProgressReportInterval;

    logger.LogInformation(
        "Settling batch {BatchId}: total={Total}, cutoff={CutoffUtc:O}",
        batch.BatchId, total, batch.CutoffUtc);

    foreach (var payment in batch.Payments)
    {
        cancellationToken.ThrowIfCancellationRequested();

        var response = await gateway.SubmitAsync(payment, cancellationToken);

        if (response.Accepted)
        {
            settled++;
        }
        else
        {
            failed++;
            logger.LogWarning(
                "Settlement rejected for {PaymentId}: {ReasonCode}",
                payment.PaymentId, response.ReasonCode);
        }

        processed++;
        if (processed % interval == 0 || processed == total)
        {
            progress?.Report(new SettlementProgress(settled, failed, total));
        }
    }

    var final = new SettlementProgress(settled, failed, total);
    logger.LogInformation(
        "Batch {BatchId} settled: settled={Settled}, failed={Failed}",
        batch.BatchId, final.Settled, final.Failed);
    return final;
}

Constructor dependencies: ISettlementGateway, IOptions<PaymentSettlerOptions>, ILogger<PaymentSettler>. No Microsoft.Azure.Functions.*. No Azure.Storage.*. The host's identity (queue trigger vs BackgroundService vs anything else) is below the abstraction.

The App Service host is one of the hosts below it. Settlement.AppService/Services/SettlementWorker.cs extends BackgroundService, drains the same queue the Function reads, and dispatches each message through ProcessAsync:

private async Task ProcessAsync(QueueMessage message, CancellationToken cancellationToken)
{
    SettlementBatch? batch;
    try
    {
        batch = JsonSerializer.Deserialize<SettlementBatch>(message.Body.ToString());
    }
    catch (JsonException ex)
    {
        logger.LogError(ex, "Discarding malformed message {MessageId}", message.MessageId);
        await queueClient.DeleteMessageAsync(message.MessageId, message.PopReceipt, cancellationToken);
        return;
    }

    if (batch is null)
    {
        await queueClient.DeleteMessageAsync(message.MessageId, message.PopReceipt, cancellationToken);
        return;
    }

    var progress = new Progress<SettlementProgress>(p => status.Update(batch.BatchId, p));
    try
    {
        var result = await settler.SettleAsync(batch, progress, cancellationToken);
        status.Complete(batch.BatchId, result);
        await queueClient.DeleteMessageAsync(message.MessageId, message.PopReceipt, cancellationToken);
    }
    catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)
    {
        throw;
    }
    catch (Exception ex)
    {
        logger.LogError(
            ex,
            "Settlement of batch {BatchId} failed; message will reappear after visibility timeout",
            batch.BatchId);
    }
}

The dispatch line that matters is await settler.SettleAsync(batch, progress, cancellationToken);. Same IPaymentSettler, same three-argument signature, same cancellation propagation as the Function. The App Service variant passes a real IProgress<SettlementProgress> (the worker keeps a live /status endpoint backed by SettlementWorkerStatus); the Function and Container App variants pass null. That is the entire workload diff between the three hosts.

What does differ between hosts is composition root and activation surface. The Function App's Program.cs is shorter because the binding owns the queue client; the App Service and Container App hosts register QueueServiceClient via AddAzureClients and derive QueueClient from it. The activation surface is [QueueTrigger] for the Function, BackgroundService.ExecuteAsync for the other two. Everything below those lines, AddSettlementCore(), the gateway registration, the options binding, the logger pipeline, is shared verbatim. The host moves; the workload does not.

Series wrap-up: where to go from here

Series 2 set out to answer one question: when does Azure Functions stop paying its freight, and what do you do about it? The six articles trace the ladder.

Running Azure Functions in Docker and Docker Pitfalls I Hit packaged the Function App into a container so the runtime moved with the code. Scaling Azure Functions made the cost ceilings visible by walking the Consumption, Premium, and Dedicated plans against real workloads. Structuring Complex Function Apps reorganised the project structure so refactoring stayed tractable as the function count grew. When Azure Functions Fight Back enumerated the four signals that justify a move: timeout walls, sprawl, coupling patterns, cost crossover. This article closes the loop by making the move mechanical. Once the trigger is a thin controller and the workload sits behind IPaymentSettler, the diff between Function App, App Service, and Container App is a Program.cs change.

Three places to go from here:

Series 3 (forthcoming) integrates .NET Aspire into the Functions workflow: AppHost orchestration, Service Bus, Storage, and Redis as Aspire resources, and azd deployment to Container Apps. The decoupled Settlement.Core library carries straight across; only the composition root learns about Aspire.
The MigrationDemo sample is the working reference: one workload, three hosts, identical behaviour. Clone it, swap the connection string for a service URI, and the production-credential path lights up against managed identity.
Microsoft Learn migration guidance covers the surface this article does not: moving from in-process to isolated worker (still relevant as the decoupling that makes any further migration tractable) and the broader App Service migration overview when the destination is not Functions at all.

The decoupling work earns its keep the first time a queue handler runs past 10 minutes and the answer is "register a BackgroundService against the same queue" instead of "rewrite the workload".

Closing question

Which Azure SDK type was hardest for you to push behind an interface: BlobClient, ServiceBusSender, or something else? Reply with the type name.

Azure Functions Beyond the Basics
Continues from Azure Functions for .NET Developers (Parts 1-9)

Part 1: Running Azure Functions in Docker: Why and How

Part 2: Docker Pitfalls I Hit (And How to Avoid Them)

Part 3: Scaling Azure Functions: Consumption vs Premium vs Dedicated

Part 4: Structuring Complex Function Apps: Project Organization

Part 5: When Azure Functions Fight Back: Signs You've Outgrown Them

Part 6: Preparing for Migration: Decoupling Your Function Logic (this article)

When Azure Functions Fight Back: Signs You've Outgrown Them

Martin Oehlert — Fri, 15 May 2026 09:41:53 +0000

Your queue handler hit the 10-minute Consumption ceiling last week. You restructured it to checkpoint, and the next month-end batch still creeps over. The question now is not how to wring one more workaround out of Functions. It is when the next workaround stops being cheaper than moving the job onto a different host. Four signals push the answer past "still cheaper": performance walls, complexity sprawl, coupling patterns the platform makes worse, and a cost crossover that arrives sooner than most teams plan for.

Performance walls you will hit

Four limits decide how far Functions can carry the workload: how long any one invocation can run, how much memory it can use, how many sockets it can hold open, and what its file system actually persists.

Execution timeout per plan

The numbers from the hosting plan timeout reference:

Cross-cutting cap: HTTP triggers that do not respond within 230 seconds are cut off by the Azure Load Balancer with HTTP 502 regardless of functionTimeout. The function keeps running but cannot return a response. Sources: HTTP trigger limits, Web request times out in App Service. For longer work, Microsoft points at the Durable Functions async HTTP pattern.

host.json: functionTimeout describes what happens at the cap: "When an execution exceeds this duration, a timeout error occurs and the language worker process restarts." The worker is killed, and in-flight invocations on it are lost. What the trigger does next is per binding:

Service Bus: PeekLock with autoComplete = true. On host crash the lock expires, and on next visibility the message reappears with DeliveryCount incremented. After MaxDeliveryCount (default 10) it lands in <queue>/$deadletterqueue (Service Bus dead-letter queues).
Storage Queue: visibility timeout per message. On host crash the storage-default 10-minute timeout takes over. After 5 failed attempts the message moves to <queue>-poison (Storage queue trigger: Peek lock).
Blob trigger: same five-attempt default, with failed blobs landing in webjobs-blobtrigger-poison (Poison blobs).
HTTP: caller already saw the 502 at 230 s. No automatic retry.
Timer: per Timer trigger: Retry behavior, "the timer trigger doesn't retry after a function fails."

A subtle one to flag: only Cosmos DB, Event Hubs, Kafka, and Timer support host-level retry policies ([FixedDelayRetry], [ExponentialBackoffRetry]). For Service Bus and Storage Queue, the binding's native retry semantics are the only mechanism (Azure Functions error handling and retries). Decorating a Service Bus trigger with [ExponentialBackoffRetry] does nothing.

The fix that lets you stay (when it can)

Most "we hit the timeout" stories are workloads that can be split into chunks with a checkpoint between them. A queue-triggered batch that processes 50,000 items in 50 ms each runs 42 minutes. The same handler that processes 500 at a time and writes a cursor blob between chunks survives any number of restarts. From the companion sample:

[Function(nameof(DrainBatchFunction))]
public async Task Run(
    [QueueTrigger("batches", Connection = "AzureWebJobsStorage")] BatchCommand command,
    CancellationToken cancellationToken)
{
    await cursors.CreateIfNotExistsAsync(cancellationToken: cancellationToken);

    var checkpoint = cursors.GetBlobClient($"batch-{command.BatchId}.cursor");
    var lastCommitted = await ReadCursorAsync(checkpoint, cancellationToken);

    await foreach (var chunk in source.ChunksAfterAsync(
                       command.BatchId, lastCommitted, _chunkSize, cancellationToken))
    {
        await ProcessAsync(chunk, cancellationToken);

        // Commit the cursor *before* the next chunk; if the worker is killed
        // immediately after this upload, the next attempt resumes here.
        await checkpoint.UploadAsync(
            BinaryData.FromString(chunk.LastItemId.ToString(CultureInfo.InvariantCulture)),
            overwrite: true,
            cancellationToken);
    }
}

The shape is what matters: chunk, process, commit cursor, repeat. Each chunk must be small enough that the worst-case batch (500 items at 50 ms = 25 s) finishes well inside the timeout. The cursor write is the moment durability shifts. On retry, ReadCursorAsync resumes at the last committed item instead of restarting at item 1.

This pattern keeps you on Functions. When the single chunk itself runs longer than the timeout (a 12-minute database query, a multi-gigabyte file copy, a 30-minute model inference), the workload has outgrown the platform. No chunk size helps.

Memory ceilings

Per-instance memory from the service limits table:

Two surprises in that list. Flex Consumption has a 512 MB SKU. Most teams reading the marketing page assume Flex starts where Premium does. And from the trigger's perspective, OOM and timeout look the same: the OS terminates the worker, the host restarts, and the per-binding retry semantics from the previous section decide whether the input is replayed or dropped. The closest official reference is the host health monitor, which can recycle the host preemptively when performance counters stay above 80% within the health-check window.

The blob trigger has a documented memory amplifier worth knowing. Microsoft's Blob trigger: Memory usage and concurrency warns that "the runtime must load the entire blob into memory more than one time during processing" if you bind to a non-streaming type, and concurrency multiplies the effect. Bind to Stream for anything past a few MB.

SNAT ports and connection exhaustion

Two distinct outbound limits, easy to confuse.

SNAT port budget per instance. From Troubleshoot intermittent outbound connection errors: "Each instance on Azure App service is initially given a preallocated number of 128 SNAT ports." Ports apply to the same destination tuple (address + port) and are reclaimed by the load balancer four minutes after the connection closes. Microsoft's recommendation is to keep usage under 100 outbound connections per unique remote endpoint per instance.

Total outbound TCP connections per instance. Consumption is capped at 600 active (1,200 total) per instance, and the runtime logs Host thresholds exceeded: Connections at the limit. Flex, Premium, and Dedicated are listed as "unbounded" (Dedicated still subject to App Service worker-size caps).

The fastest way to walk into both limits is the canonical Functions anti-pattern: new HttpClient() inside a function body. Each invocation creates a new socket pool, and sockets sit in TIME_WAIT after disposal, compounded by the four-minute SNAT reclaim. At any reasonable RPS, the 128-port budget for a destination host is exhausted, and the function sees intermittent connect failures or SocketException. The AZF0002 analyzer flags the call site at build time.

The recommended fix is IHttpClientFactory, registered once in Program.cs. From the companion sample:

var builder = FunctionsApplication.CreateBuilder(args);
builder.ConfigureFunctionsWebApplication();

builder.Services
    .AddOptions<PaymentsOptions>()
    .Bind(builder.Configuration.GetSection(PaymentsOptions.SectionName))
    .ValidateDataAnnotations()
    .ValidateOnStart();

builder.Services
    .AddHttpClient<IPaymentsApi, PaymentsApi>((sp, client) =>
    {
        var options = sp.GetRequiredService<IOptions<PaymentsOptions>>().Value;
        client.BaseAddress = new Uri(options.BaseAddress);
        client.Timeout = TimeSpan.FromSeconds(options.TimeoutSeconds);
    })
    .AddStandardResilienceHandler();

The factory caches HttpMessageHandler instances (default lifetime 2 min per HttpClient lifetime management) and rotates them so DNS changes are picked up. The AddStandardResilienceHandler call layers in retry, circuit breaker, and timeout policies from Microsoft.Extensions.Http.Resilience without an extra DI dance. The function takes IPaymentsApi on its primary constructor and never calls new HttpClient() again.

One caveat from HttpClient guidelines: Recommended use: the factory shares a CookieContainer across pooled handlers. If your client depends on cookies, prefer a singleton with an explicit SocketsHttpHandler that sets PooledConnectionLifetime instead.

Database connections do not get the same fix

ADO.NET pools per process, and Functions runs one process per instance. Microsoft's SqlClient guidance is direct: "ADO.NET implements connection pooling by default. But because you can still run out of connections, you should optimize connections to the database."

The math is the trap. Each Consumption instance carries its own pool (default Max Pool Size = 100), and the platform can spin up to 200 Consumption instances (service limits). At full scale-out: 200 instances * 100 connections = 20,000 potential connections against the database. Premium is capped at 100 instances, Flex at 1,000. Most managed databases fall over well below that.

The fix lives at the database, not in Functions. Lower the pool size per instance, cap the application's max scale-out, or front the database with a connection pooler (PgBouncer, RDS Proxy equivalents on Azure Database for PostgreSQL Flexible Server).

Local file system

Three behaviours worth distinguishing.

Per-instance ephemeral temp. From Operating system functionality in App Service: %SystemDrive%\local is reserved for temporary local storage, "not persistent across app restarts." Plan-by-plan capacity ranges from 0.5 GB (Consumption) to 21-140 GB (Premium / Dedicated). Scale-out does not just reset temp. It makes the data invisible: instance A writes /tmp/cache.json, and instance B never sees it.

Persisted shares. Premium and Dedicated can mount Azure Files (SMB or NFS). Flex Consumption supports SMB and read-only Blobs but not NFS (Choose a file access strategy). SMB cold-start latency is documented at 200-500 ms on first execution.

Read-only file systems. When WEBSITE_RUN_FROM_PACKAGE is set, "the wwwroot folder is read-only and you receive an error if you write files to this directory" (Run from package: General considerations). The same page is explicit: "Don't add the WEBSITE_RUN_FROM_PACKAGE app setting to apps on the Flex Consumption plan." Flex's deployment model treats the package as read-only by design. Container Apps follows container-image semantics: image layers are read-only.

The Durable Functions provider doc gives the blunt warning: "Storing payloads to local disks is not recommended, since on-disk state isn't guaranteed to be available" (Azure Storage representation in a task hub). Anything you want to read on a different instance, or after a restart, belongs in Blob, Cosmos, or another external store.

Complexity signals that indicate a problem

Performance walls fail the workload outright. Complexity signals fail it slowly: the function count keeps climbing, the orchestration diagram keeps growing legends, and at some point you cannot describe the system without a whiteboard. None of the limits below are timeouts. They are the shape of the architecture telling you it has outgrown the deployment unit.

Durable Functions sub-orchestration sprawl

Microsoft lists three legitimate uses for CallSubOrchestratorAsync (Sub-orchestrations: When to use):

Compose reusable workflow building blocks shared across parents.
Fan out parallel instances of the same orchestrator and wait for all.
Organise a large orchestration into named, testable pieces.

If a sub-orchestration is not doing one of those three jobs, it is decoration, and the cost compounds. The same docs add a hard constraint: "Sub-orchestrations must be defined in the same app as the parent orchestration." The cross-app workaround is the HTTP 202 polling pattern, a different programming model with no parent/child semantics, no shared retry policy, and no automatic exception propagation. The reason is the task hub model: "If multiple apps use the same task hub, they compete for messages, which can result in undefined behavior, including orchestrations getting unexpectedly stuck."

The decision worth surfacing in your design review: task hub partition count is immutable after creation. Default 4, max 16. From Performance and scale: Partition count: "You can't change the partition count after you create a task hub. Set it high enough to meet expected scale-out requirements." Most teams discover this when their orchestrator throughput plateaus and they reach for the dial that does not exist.

Sprawl signals inside one app:

Parent depth of 3+ levels where the grandchild does almost no work.
Sub-orchestrators that wrap a single activity call.
Parents fanning out to sub-orchestrators that themselves fan out: one instance ID can spawn dozens of children, each consuming control queue capacity.

Orchestrator code constraints

Orchestrators replay from history every time a new event arrives. The runtime requires deterministic code: the same input must produce the same call sequence on every replay. From Orchestrator function code constraints, the following must not appear inside an orchestrator body:

DateTime.Now / DateTime.UtcNow (use context.CurrentUtcDateTime).
Guid.NewGuid() (use context.NewGuid(), which returns Type 5 UUIDs derived from instance ID).
Bindings, including the orchestration client and entity client bindings. I/O lives in activities.
Static variables and environment variables.
Direct network or HTTP calls.
Task.Run, Task.Delay, Thread.Sleep, HttpClient.SendAsync. Use context.CreateTimer for delays.

The clean rule is one sentence: orchestrators schedule, activities act. Anything that reads time, calls the network, or generates a random value belongs in an activity. The framework throws NonDeterministicOrchestrationException sometimes, but the docs are explicit: "this detection behavior won't catch all violations, and you shouldn't depend on it." Violations ship and break weeks later.

Shared external state as bottleneck

Account-level ceilings from Standard storage account scalability targets:

20,000 RPS per general-purpose v2 account in most regions, and 40,000 RPS in higher-tier regions.
Hitting any of these returns HTTP 503 Server Busy or HTTP 500 Operation Timeout.

Per-service ceilings (the ones that bite first):

Storage Queue: account 20,000 msg/s, single queue tops out at 2,000 msg/s (Queue Storage scalability, Data partitioning strategies).
Storage Table: 20,000 entities/s account-wide, 2,000 entities/s per partition (Table Storage scalability). Throttling shows as PercentThrottlingError. Date-as-partition-key is the canonical anti-pattern.
Cosmos DB hot partition: hard ceiling of 10,000 RU/s per logical partition (Partitioning and horizontal scaling), regardless of total provisioned throughput. Splitting does not help when the key is genuinely hot (Redistribute throughput). The fix is a different partition key.

Durable Functions sits on top of those numbers. Every task hub creates two Azure Tables (<hub>History, <hub>Instances), one work-item queue, one control queue per partition, and blob containers for leases and large messages (Azure Storage representation in a task hub). When several apps point at the same storage account, every one fights for the same per-partition 2,000 entities/s on the History and Instances tables. Microsoft's reliability guidance is direct: "use a separate storage account for each function app. This aspect is especially true with Durable Functions and Event Hubs triggered functions" (Best practices for reliable Azure Functions).

Service Bus session locks

The architectural smell is two or more functions in the same function app triggering on the same session-enabled queue. Sessions hold an exclusive lock per session ID (Message sessions): only one receiver at a time, per-session FIFO, default MaxDeliveryCount = 10. The Functions session host defaults are aggressive (Service Bus host.json settings): maxConcurrentSessions = 2000, maxConcurrentCalls = 16 per session. Thread starvation at high MaxConcurrentSessions is documented in the troubleshooting guide.

Two functions in the same app are not parallelising work over those sessions. They are competing for session locks. SessionLockLost shows up when the lock expires before renewal, the partition rebalances, or the AMQP link is idle for 10 minutes (Service Bus messaging exceptions: SessionLockLost). The fix is one consumer per session-enabled queue per app, period.

Function count vs feature count

Microsoft does not publish a number for "too many functions in one app", but the Function organization best practices describe the failure mode plainly:

Each function that you create has a memory footprint. While this footprint is usually small, having too many functions within a function app can lead to slower startup of your app on new instances.

Connection strings and other credentials stored in application settings gives all of the functions in the function app the same set of permissions in the associated resource. Consider minimizing the number of functions with access to specific credentials by moving functions that don't use those credentials to a separate function app.

A self-test you can run in 10 minutes:

Can you name the feature each function delivers without opening the code?
Does explaining one feature require drawing a diagram of 5+ function boundaries?
Is your function-to-feature ratio above 3:1? (50 functions for 8 features is 6:1.)
Do all functions share the same connection strings whether they use them or not?
Do load profiles inside the app diverge? (Chatty queue trigger next to memory-heavy report function.)
Does shipping one function redeploy 49 others?
Does cold start time grow with every release?
Does the same Durable task hub serve more than one feature area?

Three or more is the smell. Six or more is "this should have been split a quarter ago." The W19 split-apps approach handles the first wave of this. If splitting still leaves you fighting the platform, the article you are reading is the second wave.

Coupling patterns that fight the serverless model

Some of the worst Functions deployments do not look bad on any one screen. The handlers are clean, every function is short, the metrics are healthy. The damage is in the topology: the way the functions wire to each other multiplies cost, slows change, or quietly burns money in a loop. Four shapes show up over and over.

Sequential chains (the service with three methods)

Function A writes a message. Function B is triggered by it. Function C is triggered by B's message. Every input traverses the same three hops, with no branching and no fan-out. It is a workflow, not a serverless decomposition.

Microsoft's Pipes and Filters pattern is explicit on when not to use it: "the processing steps performed by an application aren't independent, or they have to be performed together as part of a single transaction." A three-step lockstep chain meets that condition. The same page points at Compute Resource Consolidation for the consolidation: "You can group filters that should scale together in the same process."

Cost per hop is five charges:

Source-function invocation (per-execution + GB-second).
Queue write (one transaction).
Queue read by the next function (another transaction).
Serialize + deserialize (CPU on both sides).
New consumer invocation (per-execution + GB-second again).

A three-function chain triples that. The corrected shapes are documented: collapse to one Functions invocation that calls the three operations as private methods (single trigger, single execution, no inter-function queues), or move to Durable Functions function chaining if the steps are genuinely separate but coordinated. The orchestrator keeps durable state and tracks the choreography explicitly.

Shared database schemas across Function Apps

Microsoft's Data considerations for microservices is unambiguous:

Two services shouldn't share a data store. Each service manages its own private data store, and other services can't access it directly.

Services can safely share the same physical database server. Problems occur when services share the same schema, or they read and write to the same set of database tables.

Two Function Apps on the same Azure SQL server with separate schemas is fine. Two Function Apps writing the same Orders table is the antipattern.

The cost surfaces at migration time. Additive changes (new column, new table) work. Destructive changes (rename, drop, type change, NOT NULL on a populated column) require all N apps to agree on a release order: forward-compat code shipped in advance to every app, or a coordinated cutover that removes the ability of any one app to deploy independently. Two-phase migrations (expand-and-contract) become the default. The blue/green compatibility window is the intersection of every app's deploy windows. Each schema version must be readable and writable by every prior version of every consumer that might still be running (multitenant antipatterns).

If three apps share an Orders table and one ships weekly, one biweekly, one monthly, the slowest cadence sets the floor. A hosted service that owns the schema collapses N consumers to 1, and migrations stop being a coordination problem. The trade is what you give up: per-function scaling and trigger-binding ergonomics. The Saga pattern: Context and problem acknowledges the trade explicitly.

Circular queue dependencies and poison loops

Two defaults to cite side by side, because readers conflate them:

Storage Queue maxDequeueCount = 5. After 5 failures the message goes to <originalqueue>-poison (Storage queue host.json settings).
Service Bus MaxDeliveryCount = 10. After 10 attempts the message goes to the DLQ at <queue path>/$deadletterqueue (Service Bus dead-letter queues). "There's no automatic cleanup of the DLQ. Messages remain in the DLQ until you explicitly retrieve them."

Functions handles the settlement automatically: "By default, the runtime calls Complete on the message if the function finishes successfully, or calls Abandon if the function fails" (Service Bus trigger: PeekLock). The two antipatterns that subvert this safety net:

Retry queue that re-enqueues to the input queue. Handler catches the exception, writes the message back to the input queue with a transient-failure tag, returns success. The runtime never sees a failure, dequeueCount resets each round trip, and the message lives forever. The MessageId changes because the application is publishing a new message each time, so log correlation by ID misses it.
Dead-letter handler that re-triggers the original function. A second function with a trigger on the poison/DLQ subqueue picks up failed messages and calls back into the original function's logic (or worse, writes back to the original input queue). Result: input -> processing -> poison -> handler -> input ad infinitum. Service Bus eventually surfaces QuotaExceeded (messaging exceptions). Storage Queues do not fail nearly as loudly. The loop just costs money until somebody notices the bill.

Microsoft names the failure shape directly. The Choreography pattern carries the warning: "There's a risk of cyclic dependency between saga participants because they have to consume each other's commands."

Detecting poison loops in Application Insights

The End-to-end transaction details view shows a Gantt chart of every server-side telemetry event for a correlated operation_Id across all instrumented components. For a Functions chain, each invocation shows up as a request span with the queue-dependency calls between them, all under the same operation. N+1 invocations of the same function name under the same operation_Id is the loop signature. The Application Map makes the cycle visible at the topology level: a node with a self-edge or a tight A->B->A cycle.

The detection workflow is two clicks: Failures view, drill into a sample exception, the End-to-end transaction view opens with the trace tree expanded. If you cannot tell at a glance whether a transaction is one logical request or three loops of the same one, instrument before you refactor.

Configuration drift

Splitting one Function App into N copies the configuration N times. Each app has its own connection strings, API keys, storage keys in App Settings. Rotating a secret means N updates. Missing one leaves a stale credential in production until the next deploy.

Key Vault references move the storage out of App Settings and into a vault. The syntax (Use Key Vault references as app settings) takes one of two forms:

@Microsoft.KeyVault(SecretUri=https://myvault.vault.azure.net/secrets/mysecret)
@Microsoft.KeyVault(VaultName=myvault;SecretName=mysecret)

The failure mode the same page documents: "If a reference isn't resolved properly, the reference string is used instead." Real production failure pattern: the function tries to authenticate with the literal string @Microsoft.KeyVault(...), gets a 401, and the operator stares at App Settings that look correct in the portal. The WEBSITE_KEYVAULT_REFERENCES env var holds resolution status for every reference, and the portal exposes a Key Vault Application Settings Diagnostics detector. Both are worth knowing before the first incident.

The "partially adopted" antipattern is the worst version: some apps reference @Microsoft.KeyVault(...), others have raw values because the migration was incomplete. Rotating the secret in the vault updates the references and leaves the raw-value apps stale. Configuration looks correct in the portal until the next failure surface, usually a 401 from a downstream service hours after rotation.

Azure App Configuration collapses N App Settings stores into one. "Spreading configuration settings across these components can lead to hard-to-troubleshoot errors during an application deployment. Use App Configuration to store all the settings for your application and secure their accesses in one place." App Configuration handles non-secret config and holds Key Vault references for secret values. Key Vault stays the secret store. The trade is one runtime dependency, N role assignments, and refresh semantics that are opt-in (without dynamic refresh, settings are read once at startup). For two-app workloads it might not pay. For ten-app workloads it almost always does (App Configuration: high resiliency).

The cost crossover point

"Functions is too expensive at scale" and "Functions is the cheap option" are both true. They describe different points on the same curve. The crossover happens earlier than most teams plan for, and the worked example below makes it concrete.

A worked example: 10 RPS at 200 ms / 256 MB

Assumptions: 10 RPS sustained for 30 days, 200 ms execution, 256 MB memory. East US 2. Single subscription with the per-month free grants applied. All numbers verified against the Azure Retail Prices API on 2026-05-08.

Volumes:

Executions: 10 * 3,600 * 24 * 30 = 25,920,000
GB-seconds: 0.25 * 0.2 * 25,920,000 = 1,296,000

The Flex number is the trap. Flex bills a 1-second minimum per execution then rounds to 100 ms above that (Flex billing). At 200 ms / 256 MB, every invocation bills as if it ran 1 second, which is 5x the GB-seconds and pushes the bill 9x above legacy Consumption. Teams switching from Consumption to Flex "for the per-function scaling" walk into this and call the platform expensive. Verify on real workload before committing.

The crossover with EP1 is the other number worth keeping in your head. Setting Consumption_total = EP1_floor and solving for RPS at 200 ms / 256 MB:

Cost ≈ 2.592 * R - 6.60 = 145.93
R ≈ 58.8 RPS sustained

At the 200 ms / 256 MB shape, Consumption ties EP1 around 60 RPS sustained. Below that, Consumption wins on cost. Above that, EP1 starts winning and the Premium-only reasons (always-on, VNet integration, longer timeouts) compound the case. The crossover shifts with execution length: at 1 s / 256 MB it falls to about 12 RPS, and at 50 ms it pushes past 240 RPS. Pick the shape that matches your workload before you quote a number.

AKS gets one sentence: if you need Kubernetes primitives, you are no longer comparing against Functions, and the comparison belongs in a different article.

The hidden bill in App Insights

App Insights bills through Log Analytics (App Insights billing) at $2.76 per GB ingested above the 5 GB free grant per workspace. For most apps the default adaptive sampling at 5 events/second per host keeps ingestion under the grant (Sampling in Application Insights). The failure mode is operational: an engineer disables sampling to chase a bug, forgets to re-enable, and the next month's bill arrives. At 100 RPS unsampled with 10 KB telemetry events, ingestion is 86 GB/day, which is 2.6 TB/month, which is roughly $7,180/month at the PAYG rate.

The mitigation is a daily cap, set per workspace. The portal default for a workspace-based App Insights resource is 100 GB/day, but resources created via Visual Studio default to 32.3 MB/day (Daily cap). Whichever number you pick, set it before someone disables sampling.

The other hidden bill: the storage account

Every Function App requires a general-purpose storage account (Storage considerations). With one or two apps, storage is rounding error. With thirty apps after a W19-style split, it becomes a line item:

Queue triggers: every poll is a Class 2 op, roughly one per second when idle. Idle alone is 86,400 polls/day * 30 / 10,000 * $0.004 ≈ $1/month per queue. Real processing adds put + get + delete = 3 ops per message.
Durable task hubs: orchestration and history tables grow into millions of rows with replays, plus control queues, work-item queues, and instance tables. A busy task hub easily reaches $20-50/month before the orchestrator's compute cost.
Internal runtime traffic: lease blobs for the scale controller, host locks. Negligible per app, multiplied by N apps it shows up.

Microsoft's own guidance is to put each app on its own storage account, especially for Durable Functions and Event Hubs triggers. That doubles or triples your storage line item, and it is still the right call.

People-time

The cost crossover argument rarely decides the migration. People-time almost always does.

Debugging across N function apps requires correlated query plumbing in App Insights plus a mental model of which app owns which trigger.
Each function app is its own deployment unit, so coordinated releases need a release pipeline that understands ordering and rollback.
An on-call alert that fires "queue X is backing up" requires the operator to know which app owns queue X, which trigger, and which version is deployed.
Configuration drift compounds with app count, as the previous section already showed.

Cognitive load scales super-linearly with app count. It shows up as slower MTTR, not as a Functions bill line item, and that is exactly why it stays invisible until the team is exhausted.

Making the decision: stay, refactor, or migrate

Most "outgrowing Functions" complaints are organisational, not technical. Apply the W19 split-apps approach first, then revisit. The four concrete signals that say you can stay: no timeout pressure (longest job under half the plan limit), no memory pressure (peak under 60% of the instance ceiling), SNAT and connection issues fixed by IHttpClientFactory, function-to-feature ratio under 3:1. If all four hold, the platform is not the problem.

Refactor within Functions

This is W19 territory: split into multiple apps, extract a shared library, isolate triggers by scaling profile. If you are on Consumption, the next step before Premium is Flex Consumption (per-function scaling, longer timeouts, larger SKUs). The caveat from the cost section bears repeating: Flex's 1-second billing floor punishes sub-second functions. Verify on real workload before committing.

For workloads that bump the timeout but can be split, the checkpoint pattern from the Performance walls section keeps you on Functions: chunk, process, commit cursor, repeat. That keeps the trigger ergonomics, the deployment unit, and the scale controller. The cost is one cursor blob per active batch.

Extract specific functions

The middle ground. The function app stays, and one or two functions move out, usually because they hit a wall the rest of the app does not.

Long-running jobs. Hosted service in App Service, or a Container Apps job. Same business logic, no per-invocation cap. The companion sample shows the same payment-settlement workload as a BackgroundService next to its Function App original. The diff is the hosting wrapper, not the algorithm:

public sealed class SettlementWorker(
    QueueClient queueClient,
    IPaymentSettler settler,
    IOptions<QueueOptions> queueOptions,
    SettlementWorkerStatus status,
    ILogger<SettlementWorker> logger) : BackgroundService
{
    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        await queueClient.CreateIfNotExistsAsync(cancellationToken: stoppingToken);

        while (!stoppingToken.IsCancellationRequested)
        {
            var response = await queueClient.ReceiveMessagesAsync(
                maxMessages: _options.MaxBatchMessages,
                visibilityTimeout: TimeSpan.FromSeconds(_options.VisibilityTimeoutSeconds),
                cancellationToken: stoppingToken);

            if (response.Value.Length == 0)
            {
                await Task.Delay(_options.IdlePollingDelayMs, stoppingToken);
                continue;
            }

            foreach (var message in response.Value)
            {
                await ProcessAsync(message, stoppingToken);
            }
        }
    }
}

Same IPaymentSettler from the shared library, same queue, no per-invocation timeout. The trade is paying for an always-on worker (App Service) or accepting cold-start on first replica scale-out (Container Apps with KEDA).

Stateful workflows hitting Durable limits. Cross-app sub-orchestration is impossible, and shared task hub contention is real. Logic Apps Standard or a hosted workflow engine (Temporal, Elsa, Conductor) trades the binding ergonomics for a workflow surface that scales the way Durable does not.
CPU- or memory-bound work above EP3. Container Apps with the Dedicated workload profile, or AKS if Kubernetes is already a platform decision in your org.

Full migration

Rare. Criteria: timeout + memory + connection + cost crossover all present, all blocking, and the W19 organisational fixes have been applied without relief. The migration path is contract-first: extract HTTP triggers into thin wrappers, push business logic into testable libraries (a *.Core project consumed by both the Function App and the destination host), then move binding by binding.

The companion sample lays this out concretely. Settlement.Core is consumed unchanged by Settlement.FunctionApp (the timeout-wall starting point), Settlement.AppService (always-on, adds HTTP endpoints), and Settlement.ContainerApp (KEDA-scaled, scale-to-zero, no web host). The diffs are pure hosting concerns. The algorithm and the contract are identical. Reading the three side by side makes the migration question stop being abstract: you are pricing one wrapper against another, not rewriting the workload.

Decision matrix

The cost crossover at the bottom of the table is the one most teams reach for first. The decision rarely turns on it. It turns on people-time and on whether the abstraction is fighting your design or supporting it. When the abstraction is supporting the design, the right answer is almost always "stay and clean up the wiring."

Wrap-up

The four signals from the opening map to the four sections you just read. Timeouts, memory, sockets, and file system are the platform telling you the workload is too big. Sub-orchestration sprawl, session-lock contention, and high function-to-feature ratios are your team telling you the deployment unit is too big. Sequential chains, shared schemas, and poison loops are the topology making both worse. The cost crossover decides whether the right move is a different plan, a different host, or a different problem statement. None of the signals on its own says "migrate." Two of them at once says "look harder." Three of them blocking says "the abstraction is no longer paying its freight."

When you last looked at outgrowing Functions, did you stay and refactor, or did you extract the workload to a different host?

Azure Functions Beyond the Basics
Continues from Azure Functions for .NET Developers (Parts 1-9)

Part 1: Running Azure Functions in Docker: Why and How

Part 2: Docker Pitfalls I Hit (And How to Avoid Them)

Part 3: Scaling Azure Functions: Consumption vs Premium vs Dedicated

Part 4: Structuring Complex Function Apps: Project Organization

Part 5: When Azure Functions Fight Back: Signs You've Outgrown Them (this article)

Structuring Complex Function Apps: Project Organization

Martin Oehlert — Fri, 08 May 2026 05:52:21 +0000

Your project is past 15 functions. The next one needs different host.json concurrency than the rest, and a connection string nobody else in the app should see. Do you split into a second Function App, or change the values and live with the cross-talk? The answer turns on four constraints, none of which is cold start, even though cold start is the reason most teams give first.

When one Function App is too many

Microsoft puts the soft cap at 100 event-based triggers per app. Past that, the scale controller silently stops looking: "When your app has more than 100 event-based triggers, scale decisions are made based on only the first 100 triggers that execute."

Long before you hit 100, four constraints start to bite:

Scale. Every trigger in the app shares one scaling decision (with one Flex Consumption exception below).
Deploy cadence. One PR redeploys every function in the project.
Blast radius. Every function reads every connection string in app settings.
host.json scope. One concurrency, timeout, and retry setting for the whole app.

The official guidance punts past that: "It's hard to say how many functions should be in a single app, which depends on your particular workload." That's the honest answer, but the four constraints above are what you're actually weighing when you get to "particular workload."

Monolith vs multiple Function Apps

Each of the four constraints has a doc-citable behaviour behind it.

Scale. On Consumption and Premium, the app is the scale unit: "all functions within a function app that share resources in an instance are scaled at the same time." One spike on a Service Bus trigger pulls every HTTP endpoint along for the ride, even if those endpoints are idle.

Flex Consumption is the exception. It scales per trigger group: HTTP triggers share one set of instances, Service Bus / Event Hubs / Storage share another, Durable Functions share another. On Flex, splitting an app with one Service Bus trigger and one HTTP trigger gets you almost nothing the platform doesn't already give you. On Consumption or Premium, splitting is the only way to get that behaviour at all.

The scale-out clock matters too: "For HTTP triggers, new instances are allocated, at most, once per second. For non-HTTP triggers, new instances are allocated, at most, once every 30 seconds." A monolith cannot speed that up. Splitting can give a latency-sensitive HTTP path its own scale clock independent of a slow-scaling queue trigger sitting next to it.

Deploy cadence. "All functions in your local project are deployed together as a set of files." That's fine when one team owns one repo, ships everything together, and a bad deploy on Function A doesn't block fixing Function B. The day either of those stops being true, the monolith is in your way. Slot deployments, canary releases, and per-function rollback all assume separate apps.

Blast radius. Every function in the app reads every connection string and every Key Vault reference in app settings. Microsoft writes this as a security practice, not a sizing one: "Connection strings and other credentials stored in application settings gives all of the functions in the function app the same set of permissions in the associated resource. Consider minimizing the number of functions with access to specific credentials by moving functions that don't use those credentials to a separate function app." A single high-privilege connection string contaminates the whole app. The mitigation is a separate Function App with its own managed identity.

host.json scope. Settings in host.json apply to every function in the app within an instance. The worked example: "if you had a function app with two HTTP functions and maxConcurrentRequests set to 25, a request to either HTTP trigger would count towards the shared 25 concurrent requests." When two triggers need different concurrency budgets, you pick the looser one and accept the cross-talk, or you split the app. There is no third option.

Cold start: how much does function count actually cost?

The docs warn that "having too many functions within a function app can lead to slower startup of your app on new instances," but Microsoft publishes no numbers, so the warning sits as a vibe. I wanted to see when it starts to bite.

Three .NET 10 isolated worker apps, identical except for function count: 5, 15, and 30 minimal HTTP endpoints. No DI, no shared state, no external dependencies. For each app I spawn func start from a clean build, poll the first endpoint until it returns a 200, record wall-clock time, then kill and repeat ten times. Median across the ten runs:

Functions	Median	p10	p90	Δ vs 5-fn baseline
5	1528 ms	1481 ms	1673 ms	+0 ms
15	1552 ms	1536 ms	1728 ms	+24 ms
30	1548 ms	1542 ms	1561 ms	+20 ms

The delta is noise. Going from 5 functions to 30 cost 20 ms median on my machine, well inside the ~200 ms run-to-run variance on the same project. At this scale, function count is not where your cold-start budget goes.

That changes the case for splitting. If you're splitting a 30-function app because of cold start, the data isn't with you. The reasons that hold up are different: scaling (one trigger spike pulls everyone with it on Consumption and Premium), deployment cadence (one PR redeploys all of it), blast radius (every function in the app can read every connection string and every Key Vault reference in app settings), and host.json scope (one concurrency / timeout setting for the lot). Cold start is the argument that sounds intuitive and turns out not to land.

The absolute ~1.5 s number above includes Core Tools overhead, .NET runtime startup, and host metadata loading. Don't extrapolate it to Azure platform cold start. That's a different constant on top. The delta column is what scales with function count, and on this machine it's noise.

The methodology, the three apps, and a script you can run on your own machine to reproduce or extend the measurement (more iterations, more functions, your machine, your runtime): ColdStartBenchmark/ in the companion repo.

Sharing code without copy-paste

Two Function Apps in one solution want the same Order record, the same OrderValidator, the same IOrderStore abstraction. Project reference is the default. You reach for an internal NuGet package only when project reference stops being enough.

<!-- OrderProcessor.Core/OrderProcessor.Core.csproj -->
<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <RootNamespace>OrderProcessor.Core</RootNamespace>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="Microsoft.Extensions.DependencyInjection.Abstractions" />
    <PackageReference Include="Microsoft.Extensions.Logging.Abstractions" />
    <PackageReference Include="Microsoft.Extensions.Options" />
  </ItemGroup>
</Project>

Two things are missing on purpose:

No TargetFramework because the solution sets it centrally via Directory.Build.props. Whichever TFM the Function Apps use, this library matches. Microsoft's library guidance says: "DO start with including a net8.0 target or later for new libraries." If every consumer is .NET 10 isolated worker, target net10.0 and skip the netstandard2.0 ceremony.
No Microsoft.Azure.Functions.Worker.* packages. The library has zero dependency on the Functions SDK. An ASP.NET Core API or a console app could consume it without dragging in the worker host.

The second rule is the one that bites. The moment you put a [QueueTrigger] attribute or a [ServiceBusOutput] binding on a class in the shared library, you've forced Microsoft.Azure.Functions.Worker.Extensions.Storage.Queues (or the Service Bus equivalent) onto every consumer. A non-Functions consumer can no longer use the library without dragging in the worker SDK.

Trigger and binding attributes belong with the function class, in the Function App project. Models, validators, business services, and DI extensions belong in the shared library. The Functions/ folder in each app holds the trigger code. The shared library holds everything else.

<!-- OrderProcessor.Http/OrderProcessor.Http.csproj -->
<ItemGroup>
  <FrameworkReference Include="Microsoft.AspNetCore.App" />
  <PackageReference Include="Microsoft.Azure.Functions.Worker" />
  <PackageReference Include="Microsoft.Azure.Functions.Worker.Sdk" />
  <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Http.AspNetCore" />
</ItemGroup>
<ItemGroup>
  <ProjectReference Include="..\OrderProcessor.Core\OrderProcessor.Core.csproj" />
</ItemGroup>

The HTTP app pulls in Http.AspNetCore. The Queue app pulls in Storage.Queues instead. Both reference Core for shared code.

The static-client exception

The "no statics in shared libraries" rule has a documented exception: connection-bearing clients. Microsoft's connection management guidance is explicit: "Do create a single, static client that every function invocation can use. Consider creating a single, static client in a shared helper class if different functions use the same service."

The rule is more precise than "no statics": no static application state that assumes single-instance execution. HttpClient, BlobServiceClient, CosmosClient, ServiceBusClient are intended to be shared statics. Counters, caches, and "current user" fields are not.

When project reference stops scaling

Project references work until you have more than one solution. The moment a second repo wants OrderProcessor.Core, you either git-submodule it (don't), copy-paste it (also don't), or publish it as an internal NuGet package and ship versioned releases. Azure Artifacts gives you an org-scoped feed for that, with 2 GiB free. The cost is the version-drift problem you didn't have before: now Function App A can sit on Core 1.4 while Function App B is on Core 1.7.

The default for a single solution is project reference. Switch to internal NuGet when (a) two repos consume the library, and (b) you actually need to ship them on different cadences. Anything before that is process where you needed a project reference.

Dependency injection with Keyed Services

Two Function Apps want different storage backends. The HTTP app writes through SQL because the read model needs strong consistency. The Queue app reads from Cosmos for bulk reprocessing. Both consume IOrderStore. .NET 8 added Keyed Services so you don't have to invent a factory, a sentinel type, or a Func<string, IOrderStore> per backend.

namespace OrderProcessor.Core.Stores;

public interface IOrderStore
{
    Task<Order?> GetAsync(string orderId, CancellationToken cancellationToken);
    Task SaveAsync(Order order, CancellationToken cancellationToken);
}

public sealed class SqlOrderStore(ILogger<SqlOrderStore> logger) : IOrderStore
{
    // SQL implementation
}

public sealed class CosmosOrderStore(ILogger<CosmosOrderStore> logger) : IOrderStore
{
    // Cosmos implementation
}

Both implementations register against the same interface, distinguished by a key:

public static class OrderStoreKeys
{
    public const string Sql = "sql";
    public const string Cosmos = "cosmos";
}

// OrderProcessor.Core/Services/ServiceCollectionExtensions.cs
public static IServiceCollection AddOrderServices(this IServiceCollection services)
{
    services.TryAddSingleton<OrderValidator>();

    services.TryAddKeyedSingleton<IOrderStore, SqlOrderStore>(OrderStoreKeys.Sql);
    services.TryAddKeyedSingleton<IOrderStore, CosmosOrderStore>(OrderStoreKeys.Cosmos);

    return services;
}

Three details in the snippet are load-bearing:

The keys are const string on a static class, not bare string literals at the call site. Stringly-typed keys are the failure mode: a typo in [FromKeyedServices("sqll")] throws InvalidOperationException at resolve time, not compile time. A typo in OrderStoreKeys.Sql doesn't compile. The object? parameter accepts anything that implements Equals correctly, so enums or typed records work too. const string constants are the cheapest mitigation.
TryAddKeyed* rather than AddKeyed*. Every AddKeyed* call adds a new descriptor, and GetKeyedService<T> returns the last registration, silently shadowing earlier ones. Library code that registers default keyed implementations should use TryAddKeyed* so a consumer can override without ending up with two (IOrderStore, "sql") registrations and the silent second-wins behaviour.
App-specific wiring stays in Program.cs. AddOrderServices is shared across both apps. If OrderProcessor.Http needs an HttpClient and OrderProcessor.Queue needs a ServiceBusClient, those go in their respective Program.cs files, not in the shared library.

Program.cs pulls it together:

// OrderProcessor.Http/Program.cs
using Microsoft.Azure.Functions.Worker.Builder;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using OrderProcessor.Core.Configuration;
using OrderProcessor.Core.Services;

var builder = FunctionsApplication.CreateBuilder(args);

builder.ConfigureFunctionsWebApplication();

builder.Services.Configure<OrderProcessingOptions>(
    builder.Configuration.GetSection("OrderProcessing"));

builder.Services.AddOrderServices();

builder.Build().Run();

FunctionsApplication.CreateBuilder is the recommended builder for new isolated worker projects. It loads appsettings.json automatically, where new HostBuilder() does not. The Configure<OrderProcessingOptions> line binds an options class for any per-function tunables (retry counts, batch sizes, timeouts) that you'd rather change in app settings than in code. The function injects IOptions<OrderProcessingOptions> and reads .Value. Wednesday's tip walks through IOptions<T> vs IOptionsSnapshot<T> for the cases where you need values to refresh at runtime.

Resolving the right key

The function class takes its IOrderStore directly on the primary constructor parameter:

// OrderProcessor.Http/Functions/CreateOrderFunction.cs
public sealed class CreateOrderFunction(
    ILogger<CreateOrderFunction> logger,
    OrderValidator validator,
    [FromKeyedServices(OrderStoreKeys.Sql)] IOrderStore primaryStore)
{
    [Function(nameof(CreateOrder))]
    public async Task<IActionResult> CreateOrder(
        [HttpTrigger(AuthorizationLevel.Function, "post", Route = "orders")] HttpRequest req,
        [FromBody] CreateOrderRequest request,
        CancellationToken cancellationToken)
    {
        var order = new Order(request.OrderId, request.CustomerId, request.Amount, OrderStatus.Pending);

        var validation = validator.Validate(order);
        if (!validation.IsValid)
        {
            return new BadRequestObjectResult(new { error = validation.Error });
        }

        await primaryStore.SaveAsync(order, cancellationToken);

        return new CreatedResult($"/api/orders/{order.OrderId}", order);
    }
}

[FromKeyedServices] is AttributeUsage = AttributeTargets.Parameter, which is exactly where primary constructors put their parameters. No backing fields to assign, no field-name shuffle to make the key visible to the function method.

A second function in the same app can resolve a different key:

public sealed class GetOrderFunction(
    ILogger<GetOrderFunction> logger,
    [FromKeyedServices(OrderStoreKeys.Cosmos)] IOrderStore readStore)
{
    [Function(nameof(GetOrder))]
    public async Task<IActionResult> GetOrder(
        [HttpTrigger(AuthorizationLevel.Function, "get", Route = "orders/{orderId}")] HttpRequest req,
        string orderId,
        CancellationToken cancellationToken)
    {
        Order? order = await readStore.GetAsync(orderId, cancellationToken);
        return order is null ? new NotFoundResult() : new OkObjectResult(order);
    }
}

CreateOrderFunction writes through SQL. GetOrderFunction reads from Cosmos. Same Function App, same IOrderStore interface, two implementations resolved by key.

Two breaking changes worth knowing

.NET 9 changed missing-key behaviour. Before .NET 9, [FromKeyedServices("xyz")] would silently fall back to an unkeyed IService registration if no "xyz" key existed. As of .NET 9, it throws InvalidOperationException at resolve time. That's an improvement (typos no longer succeed quietly) and another reason to keep keys as compile-time constants.

.NET 10 changed KeyedService.AnyKey semantics. Two related changes: GetKeyedService(provider, type, KeyedService.AnyKey) now throws instead of resolving an arbitrary registration, and GetKeyedServices(provider, type, KeyedService.AnyKey) no longer returns services that were themselves registered against AnyKey. If you have library code using AnyKey for "give me whatever's registered", check it before the .NET 10 upgrade.

.NET 10 also added FromKeyedServicesAttribute.LookupMode so a keyed service's transitive dependencies can inherit the parent's key automatically. Useful for keyed graphs (a KeyedDataProcessor whose inner KeyedConnection should match the outer key), but skip it on the first pass. Explicit keys are easier to read.

The full working code lives in ProjectOrganizationDemo/ in the companion repo: shared Core library, two Function Apps, both keyed implementations, and a local.settings.json.example for each app.

Folder structure that holds up at 30 functions

Microsoft's official .NET isolated-worker sample groups files by trigger:

samples/FunctionApp/
├── FunctionApp.csproj
├── Program.cs
├── host.json
├── local.settings.json
├── HttpTriggerSimple/
├── HttpTriggerWithBlobInput/
├── HttpTriggerWithCancellation/
├── HttpTriggerWithDependencyInjection/
├── HttpTriggerWithMultipleOutputBindings/
└── QueueTrigger/

That's the actual tree of Azure/azure-functions-dotnet-worker/samples/FunctionApp. It's a per-trigger demo: one folder per function feature, no shared layers. It works for the sample because each folder is self-contained.

It stops working at fifteen functions. By then a Services/, a Models/, and a couple of cross-cutting concerns have accumulated, and the per-trigger folder layout has nowhere natural to put them. You either bolt Services/ on next to the trigger folders, or you switch to a layered convention.

The convention this article uses (and the one the companion sample uses) is:

OrderProcessor.Http/
├── OrderProcessor.Http.csproj
├── Program.cs
├── host.json
├── local.settings.json
├── Functions/                # trigger classes only
│   ├── CreateOrderFunction.cs
│   └── GetOrderFunction.cs
├── Models/                   # app-specific request/response types
│   └── CreateOrderRequest.cs
└── Infrastructure/           # middleware, options classes, app-specific clients

Microsoft does not document this layout for C# isolated worker. It's a community convention, not Microsoft guidance. The closest official endorsement is the Node.js v4 model docs, which explicitly recommend a src/functions/ subfolder. The .NET isolated-worker docs are silent on subfolder shape.

Three rules I follow:

Functions/ holds trigger classes only. Every file in Functions/ has a [Function] attribute. If a class has no trigger, it doesn't go here.
App-specific code is local. Reusable code goes to Core. CreateOrderRequest is a DTO only the HTTP app sees, so it lives in OrderProcessor.Http/Models/. Order, the domain record both apps share, lives in OrderProcessor.Core/Models/.
Infrastructure/ is for cross-cutting wiring, not business logic. Middleware, options classes, custom logging filters, retry policy setup. The test is "if I deleted this folder, would my domain logic break?" If yes, it belongs in Services/ or Core. If no, Infrastructure/.

host.json and local.settings.json stay in the project root. The deployment payload is explicit that they sit next to the executable, peer to your code files. There is no clever way to move them.

When the project also runs in Aspire

If the Function App is part of a .NET Aspire orchestration, OpenTelemetry, health checks, and resilience defaults move to a separate *.ServiceDefaults project. The Aspire docs are explicit: "If your project is part of an Aspire orchestration, it uses OpenTelemetry for monitoring instead. Don't enable direct Application Insights integration within Aspire projects." When you go Aspire, the Infrastructure/ folder mostly empties into ServiceDefaults, and Program.cs becomes one extra builder.AddServiceDefaults() call.

When to split, when to keep together

The four constraints from the opening turn into a checklist. Split when one or more of these flips from "no" to "yes":

Independent scale needs. Two triggers in the same app share scaling on Consumption and Premium. If a queue trigger scales to 50 instances during a backlog burn-down, the HTTP endpoint comes along, even when nobody is calling it.
Independent deploy cadence. A team that wants to canary the orders API without redeploying the inventory worker. A risky change to one function that shouldn't block hotfixing another. Per-slot deployments per app.
Different host.json. Two HTTP endpoints, one needs maxConcurrentRequests: 25, the other 200. There is no way to set this per function inside one app.
Credential boundary. A function reads a high-privilege Cosmos connection string. Every other function in the app inherits the same read access. The mitigation is a separate app with its own managed identity.
Test code mixed with prod. The scalability guide says it directly: "If you're using a function app in production, don't add test-related functions and resources to it." Memory is shared inside the app. So is everything else.

Keep together when:

Shared state. A function that pre-warms a cache another function reads. They depend on co-location to be cheap.
Single team, single repo, related triggers. Small surface, no cross-team friction, the functions evolve together. The overhead of a second app pays for nothing.
Low traffic. The app handles three requests a minute. Splitting trades one infrastructure unit for two with no operational gain.

Two reasons that sound good and are on neither list:

Cold start. The benchmark above shows function count is not where your cold-start budget goes at 5-30 functions per app. If you're splitting for cold start, the data isn't with you.
"It feels too big." Aesthetic discomfort with a 20-function project is not a constraint. Pick a doc-citable reason from the list above, or accept the discomfort.

Wrap-up

The split-or-keep decision belongs to your scaling, deploy, credential, and host.json constraints. The size of the project is a symptom, not a reason. A 30-function monolith with one team, one deploy cadence, and one credential set is fine. Two functions with conflicting host.json settings are not.

Keyed Services on a primary constructor is the cleanest way to handle "two implementations of one interface" once you're inside a single app. A pure shared library (zero Microsoft.Azure.Functions.Worker.* references) is what keeps those abstractions reusable across multiple Function Apps without dragging the worker SDK into every consumer.

When you last split a Function App, what was the line that forced it: a host.json setting that needed two different values, or a credential that shouldn't be visible to every function in the project?

Scaling Azure Functions: Consumption vs Premium vs Dedicated

Martin Oehlert — Fri, 01 May 2026 03:45:15 +0000

Azure Functions Beyond the Basics
Continues from Azure Functions for .NET Developers (Parts 1-9)

Part 1: Running Azure Functions in Docker: Why and How

Part 2: Docker Pitfalls I Hit (And How to Avoid Them)

Part 3: Scaling Azure Functions: Consumption vs Premium vs Dedicated (you are here)

Your Consumption plan function works fine in dev. Then production traffic arrives, the app scales to zero during a quiet period, and the next request takes 6.8 seconds. The question that follows is always the same: do you switch to Premium at $146/month, or is there something between free-with-cold-starts and always-warm-but-always-billing? Azure Functions has five hosting options now (Consumption, Flex Consumption, Premium, Dedicated, and Container Apps), each with a different billing model and a different answer to that question. This article covers the four App Service-based plans; Container Apps is a different deployment model aimed at containerized microservices. All code samples are in the companion repo.

Consumption: true serverless, true cold starts

Microsoft now labels the Consumption plan as "legacy" in its hosting docs and is directing new serverless workloads to Flex Consumption. But Consumption is still where most Functions apps start, and where many should stay. You deploy your code, the platform handles the rest. No servers to manage, no capacity to plan. You pay only when your functions execute.

How the scale controller works

The scale controller monitors event rates for each trigger type and decides how many instances to run. Since runtime v4.19.0, it uses target-based scaling by default. The formula is one division:

desired instances = event source length / target executions per instance

What "event source length" means depends on the trigger. For Storage Queues, it's queue length. For Service Bus, active message count. For Event Hubs, unprocessed events per partition. For Cosmos DB, pending changes in the change feed. The controller reads these signals and adjusts instance count accordingly.

The controller adds up to four instances at a time. HTTP triggers get new instances at most once per second. Non-HTTP triggers scale at most once every 30 seconds. This is fast enough for gradual traffic ramps but won't help with sudden spikes from zero.

Instance limits and billing

Each Consumption instance gets 1.5 GB of memory and one CPU core. The maximum instance count is 200 on Windows and 100 on Linux (with a 500-instance-per-subscription-per-hour rate limit on Linux).

Billing has two components:

Executions: $0.20 per million, with 1,000,000 free per month
Execution time: $0.000016 per GB-second, with 400,000 GB-seconds free per month

Memory is rounded up to the nearest 128 MB bucket. Execution time rounds to the nearest millisecond, with a minimum billable unit of 128 MB x 100 ms. For a function that runs a few thousand times a day at under a second each, you'll stay well inside the free grant.

Cold start reality on .NET

After roughly 20 minutes of inactivity, the Consumption plan scales to zero. The next request waits for the platform to provision a fresh instance and start your application from scratch.

On .NET isolated worker, that cold start typically lands between 2 and 7 seconds. Heavy DI registrations push it past 10. The in-process model was faster, but Microsoft is retiring it in November 2026.

For timer triggers, queue processors, and other background work, a few seconds of cold start is invisible. For HTTP endpoints that a user is waiting on, it's a problem.

What Consumption can't do

The hard constraints that push teams to other plans:

No VNet integration. If your function needs to reach resources inside a virtual network, Consumption is off the table.
10-minute execution timeout. The default is 5 minutes, configurable to 10. Long-running orchestrations or batch jobs need a different plan.
No per-function scaling. All functions in the app scale together. A chatty timer trigger can cause the platform to allocate instances that your HTTP trigger didn't need.
600 active outbound connections per instance. Hit this with parallel HTTP calls to external APIs and requests start failing.

Linux Consumption is retiring September 30, 2028. Microsoft is directing all new Linux serverless workloads to Flex Consumption. If you're starting a new project on Linux, skip Consumption entirely.

Flex Consumption: the middle ground

Flex Consumption is the plan Microsoft now recommends for new serverless workloads. It addresses the two biggest Consumption limitations: no VNet support and no way to reduce cold starts without jumping to a $146/month Premium plan.

The plan scales to zero like Consumption, but adds always-ready instances that you can configure to stay warm. It supports VNet integration out of the box. And it scales to 1,000 instances instead of Consumption's 200.

Always-ready instances vs on-demand

By default, Flex Consumption behaves like regular Consumption: zero instances when idle, on-demand instances when events arrive. The difference is you can configure always-ready instances that stay running regardless of traffic.

Always-ready instances are assigned to scale groups:

http: all HTTP and SignalR triggers
durable: orchestration, activity, and entity triggers
blob: Event Grid-based blob triggers
function:<FUNCTION_NAME>: a specific function

Setting always-ready to 2 for the http group keeps two instances permanently running for HTTP functions. Those handle traffic first. If demand exceeds their capacity, the platform adds on-demand instances on top.

az functionapp scale config set \
  --resource-group my-rg \
  --name my-func-app \
  --always-ready http=2

On-demand instances scale to zero when idle. Always-ready instances are billed continuously whether they're executing functions or not. If you enable zone redundancy, the minimum is 2 always-ready instances per group.

Billing: per-second, not per-execution

Flex Consumption bills differently from Consumption. Instead of per-execution pricing with sampled memory, you choose a fixed instance size upfront and pay per GB-second of active execution time:

On-demand rates are $0.000026 per GB-second and $0.40 per million executions. The monthly free grant is smaller than Consumption: 250,000 executions and 100,000 GB-seconds (compared to Consumption's 1,000,000 and 400,000).

Always-ready instances have a separate billing structure with no free grant. The baseline (idle) rate is $0.000004 per GB-second, roughly 6.5x cheaper than the on-demand execution rate. When always-ready instances are actively executing, the execution time rate is $0.000016 per GB-second (the same as Consumption's rate, and cheaper than on-demand).

The minimum billable execution is 1,000 ms (1 second). After that, billing rounds to the nearest 100 ms. This is less granular than Consumption's per-millisecond rounding, so very fast functions (under 100 ms) cost relatively more on Flex.

Each instance also gets an extra 272 MB platform buffer that isn't billed. This is memory the Functions host and worker process use, not your function code.

Scale behavior

Flex Consumption scales per function by trigger type. HTTP and SignalR triggers scale together. Durable Functions triggers scale together. Blob triggers (Event Grid source) scale together. Everything else scales independently per function. This fixes a real problem from Consumption, where a noisy timer trigger could cause unnecessary instance allocation for your HTTP functions.

Maximum instances: 1,000 (default limit is 100, configurable via CLI). All Flex Consumption apps in a subscription and region share a regional quota of 250 cores by default. The formula: instances x cores per instance (0.25 for 512 MB, 1 for 2,048 MB, 2 for 4,096 MB). One app running 1,000 instances at 512 MB consumes the entire quota (1,000 x 0.25 = 250 cores). You can request an increase through Azure support, but plan for this limit when running multiple Flex apps in the same region.

The constraints to know about

Flex Consumption comes with real limitations:

One app per plan. Consumption and Premium let you put up to 100 function apps on one plan. Flex is one-to-one.
No deployment slots. Rolling updates are in public preview as an alternative (zero-downtime deployments without slot swaps), but if your deployment strategy depends on slot swaps, this is a blocker today.
Linux only. No Windows support.
Isolated worker only. The C# in-process model is not supported.
App init timeout: 30 seconds. If your startup code takes longer, the instance fails to initialize. This is not configurable.
Blob trigger uses Event Grid only. The polling-based blob trigger is not available.

Flex Consumption also supports Azure Files storage mounts, letting you mount SMB shares as local directories. This is useful for large binaries, ML models, or shared reference data that you don't want to package in your deployment.

The Linux-only constraint is less of an issue than it sounds. Linux is where .NET Functions performance is best, and the in-process model (the main reason teams stayed on Windows) is being retired anyway.

VNet integration works the same way as Premium: subnet delegation to Microsoft.App/environments, support for private endpoints on storage accounts, Key Vault references over VNet, and native virtual network triggers for non-HTTP event sources.

Premium: warm instances, guaranteed

Premium (Elastic Premium) is the plan teams reach for when cold starts become unacceptable. It keeps at least one instance running at all times, so your functions never start from zero. That guarantee comes with a price floor: even with zero traffic, you're billed for that minimum instance.

What you get for $146/month

Billing is per-second based on vCPU-seconds and GB-seconds allocated across instances. No per-execution charge. The EP1 cost breaks down to ~$116.80/vCPU/month + ~$8.32/GB/month at pay-as-you-go rates in US regions. Savings plans (1-year or 3-year commitments) offer roughly 17% off.

There is no free grant on Premium. From the moment your plan exists, the meter is running.

Pre-warmed instances and elastic scale

Premium uses two layers to eliminate cold starts:

Always-ready instances run continuously, regardless of load. You configure how many per app, up to 20. These are billed 24/7, executing or not. If you have multiple function apps on the same Premium plan, the plan's minimum instance count equals the highest always-ready count among all apps.

Prewarmed buffer instances sit behind the always-ready pool. The default is 1. When all active instances are handling traffic, the prewarmed instance swaps to active and the platform immediately provisions a new buffer instance to take its place. This means scale-out events get a warm instance instead of a cold one.

You can define a warmup trigger that runs during the prewarming window. This is where you force-initialize lazy dependencies, open database connections, and prime HTTP connection pools before the instance receives real traffic:

public class Warmup
{
    private readonly HttpClient _httpClient;
    private readonly Lazy<ExpensiveAnalyticsClient> _analytics;

    public Warmup(HttpClient httpClient, Lazy<ExpensiveAnalyticsClient> analytics)
    {
        _httpClient = httpClient;
        _analytics = analytics;
    }

    [Function("Warmup")]
    public void Run([WarmupTrigger] object warmupContext)
    {
        _ = _analytics.Value;
        _ = _httpClient.GetAsync("/health", HttpCompletionOption.ResponseHeadersRead);
    }
}

The warmup trigger only fires during scale-out, not on restarts or deployments. It's available on Premium and Flex Consumption, not on the Consumption plan.

Elastic scale can burst up to 100 instances on Windows and 20-100 on Linux depending on region. Scaling beyond the minimum is best-effort: the platform allocates instances as fast as it can, but rapid spikes can outpace the prewarmed buffer. When that happens, you get cold starts even on Premium.

VNet and other features

VNet integration is supported but not automatic. You configure it at creation time or after, using regional VNet integration with a dedicated subnet (at least 100 available IPs). Private endpoints for inbound traffic are fully supported: you can create a private IP in your VNet and restrict all public access.

Non-HTTP triggers from VNet-secured resources (Service Bus with private endpoints, for example) require enabling Runtime Scale Monitoring. Without it, the scale controller can't read the event source metrics to decide when to scale.

Other features that set Premium apart:

Execution timeout: 30 minutes default, configurable to unbounded. Consumption caps at 10 minutes.
Deployment slots: 3 (including production). Consumption gets 2, Flex gets 0.
Apps per plan: up to 100 function apps on a single Premium plan, sharing the VM pool.
Custom Linux container images are supported.

When Premium is the wrong call

The most common mistake is jumping to Premium from Consumption solely because of cold starts, without evaluating the alternatives.

If VNet was your only reason, Flex Consumption now gives you VNet integration with scale-to-zero pricing. No need to pay $146/month for network access.

If your workload is sporadic (a few hundred invocations a day), the math doesn't work. That function costs pennies on Consumption. On Premium EP1, it costs $146/month regardless of usage. The cold start tax has to be genuinely painful to justify that gap.

And watch the SKU names. EP1 is Elastic Premium. P1v2 is a Dedicated App Service plan. They behave completely differently: EP1 scales dynamically based on event volume, P1v2 gives you a fixed VM that you scale manually. If your Terraform or Bicep has sku = "P1v2" and you expected autoscaling, check again.

Dedicated: fixed compute, fixed bill

The Dedicated plan runs your functions on a standard App Service plan. Same infrastructure, same pricing, same scaling model as a web app. Multiple function apps and web apps can share the same plan.

This is the plan you pick when you already have App Service infrastructure and want to add functions without creating a separate billing line item.

Pricing and compute

These are Windows pay-as-you-go prices for US East. Linux is cheaper (roughly 40-50% less for P-series tiers). P1v2 is a previous-generation SKU; Microsoft recommends P1v3 for new deployments.

Billing is hourly, prorated to the second, per scaled-out instance. Reserved instances (1-year or 3-year) can save up to 55% on Linux. The cost is fixed: you pay the same whether your functions execute zero times or a million times per day.

Scaling: you manage it

There is no event-driven scaling on Dedicated. The scale controller that powers Consumption and Premium does not apply here.

Your options:

Manual scale-out: set the instance count in the portal or via CLI
Rule-based autoscale (Standard tier and above): trigger scale-out based on CPU percentage, memory usage, or a schedule

Autoscale on App Service is slower than Premium's elastic scale. It reacts to sustained load patterns, not individual event bursts. App Service also has a newer "automatic scaling" feature for HTTP-based traffic, but it's not supported when Functions apps are in the plan.

Maximum instances: 10-30 per plan, or 100 in an App Service Environment (ASE).

Always On must be enabled in the App Service configuration. Without it, the Functions runtime goes idle after a period of inactivity. Unlike Consumption's scale-to-zero (which the platform manages), an idle Dedicated plan just means your functions silently stop processing. You're still billed for the compute.

When Dedicated fits

Dedicated makes sense in specific circumstances:

You already have an underutilized App Service plan. Adding functions to existing compute costs nothing extra. The plan is already paid for.
You run mixed workloads. A web app and a set of background processing functions on the same plan, sharing resources.
You need deployment slots. Up to 20, far more than Premium's 3.
Predictable billing matters more than efficiency. Some finance teams prefer a fixed monthly line item over variable serverless costs.

The downside is resource contention. If your web app and function app share an S1 instance and the web app spikes, your function throughput drops. There's no isolation within the plan.

Cold start mitigation: what to try first

If you're staying on Consumption or Flex Consumption, cold starts are part of the deal. The strategies below are ordered by impact, highest first. Not all of them apply to every plan.

1. ReadyToRun compilation

The single highest-impact change for .NET cold starts on Consumption and Flex Consumption. Two lines in your .csproj:

<PropertyGroup>
    <PublishReadyToRun>true</PublishReadyToRun>
    <RuntimeIdentifier>linux-x64</RuntimeIdentifier>
</PropertyGroup>

ReadyToRun pre-compiles your assemblies to native code. The JIT compiler still runs for hot paths at runtime, but the initial load skips the bulk of compilation overhead. In practice, this cuts cold start time roughly in half.

The trade-off: your deployment package grows 2-3x because the assemblies contain both the native precompiled code and the original IL. For a typical Functions app, that's still well under the 1 GB deployment limit.

2. Placeholder optimization for .NET isolated

The Functions platform can pre-provision a worker process before your app code loads. Enable it with an app setting:

WEBSITE_USE_PLACEHOLDER_DOTNETISOLATED=1

This requires .NET 6+, a 64-bit process, and the latest Azure Functions SDK versions. The placeholder worker starts the .NET runtime and gets the IPC channel ready while your code is still being loaded, shaving off part of the startup sequence.

Combine this with ReadyToRun for the best result on Consumption.

3. Trim your DI registrations

Every service you register in Program.cs adds to startup time. On a warm instance this is negligible. On a cold start, it compounds.

Register HTTP clients and SDK clients as singletons so they're constructed once and reused. Wrap expensive dependencies in Lazy<T> so they're only built when a function actually needs them:

builder.Services.AddSingleton(_ => new HttpClient(new SocketsHttpHandler
{
    PooledConnectionLifetime = TimeSpan.FromMinutes(2)
})
{
    BaseAddress = new Uri("https://api.example.com")
});

builder.Services.AddSingleton<Lazy<ExpensiveAnalyticsClient>>(sp =>
    new Lazy<ExpensiveAnalyticsClient>(() =>
    {
        var logger = sp.GetRequiredService<ILogger<ExpensiveAnalyticsClient>>();
        return new ExpensiveAnalyticsClient(logger);
    }));

The PooledConnectionLifetime on SocketsHttpHandler rotates DNS entries without disposing the HttpClient instance. This avoids socket exhaustion (the same problem IHttpClientFactory solves, but without requiring per-request factory calls in a singleton context).

Fewer functions per app also helps. Each function adds discovery and registration overhead at startup.

4. Warmup trigger (Premium and Flex Consumption only)

On plans that support prewarmed instances, the warmup trigger lets you run initialization code before the instance takes real traffic. Force-construct your lazy dependencies, open database connections, and send a throwaway HTTP request to prime the connection pool. See the Premium section above for the code.

The warmup trigger only fires during scale-out. It does not fire on restarts, deployments, or slot swaps. One per app, and the function must be named Warmup (case-insensitive).

What works where

Not every strategy applies to every plan:

On Dedicated with Always On enabled, cold start is largely a non-issue because instances stay running. On Premium, the always-ready and prewarmed instances handle most of it. ReadyToRun and DI trimming matter most on the serverless plans where instances start from scratch.

Choosing a plan: the decision matrix

Which plan for which workload

Consumption if your traffic is sporadic, you don't need VNet access, and your users can tolerate a few seconds of cold start. Timer triggers, low-volume queue processors, webhook receivers that aren't latency-sensitive. If your bill on Consumption is under $10/month, there's no reason to move.

Flex Consumption if you need VNet integration or more than 200 instances, but still want scale-to-zero pricing. Evaluate this before jumping to Premium. The always-ready instances give you a dial between pure serverless and always-warm, and you pay only for what you configure. The constraints (one app per plan, no deployment slots, Linux only) are the deciding factors.

Premium EP1 if your HTTP endpoints are latency-sensitive and cold starts are genuinely costing you users or revenue. Also the right choice for functions that run continuously or need more than 10 minutes of execution time. If you're running multiple function apps, a shared Premium plan can amortize the $146/month minimum across them.

Dedicated if you already have an App Service plan with spare capacity, need more than 3 deployment slots, or your finance team requires a fixed monthly line item. Don't create a Dedicated plan specifically for Functions unless you have a concrete reason: the lack of event-driven scaling makes it the least "serverless" option.

The mistake to avoid

The most common path is: start on Consumption, hit cold start problems in production, jump straight to Premium at $146/month. Flex Consumption sits between them and didn't exist when many teams made that decision. If you're evaluating today, Flex Consumption with 1-2 always-ready instances gives you warm starts with scale-to-zero pricing for on-demand instances. Test it before committing to Premium's minimum.

Are you running Consumption or Premium in production right now?

Azure Functions Beyond the Basics
Continues from Azure Functions for .NET Developers (Parts 1-9)

Part 1: Running Azure Functions in Docker: Why and How

Part 2: Docker Pitfalls I Hit (And How to Avoid Them)

Part 3: Scaling Azure Functions: Consumption vs Premium vs Dedicated (this article)

Docker Pitfalls I Hit (And How to Avoid Them)

Martin Oehlert — Fri, 24 Apr 2026 05:39:32 +0000

Your Dockerfile builds, your container starts, and your triggers never fire. The Functions host logs "no functions found" or the container sits idle, processing nothing. The gap between a working image and a working function app is entirely configuration. The runtime needs specific environment variables, the build must publish to the exact path the host expects, and Azurite connections behave differently inside a container network than on localhost. Four walls, four fixes. All code samples are in the companion repo.

Pitfall 1: Environment Variables That Vanish

Your container starts, the Functions host initializes, and the logs show this:

[2026-04-20T08:12:03Z] No job functions found. Try making your job classes and methods public.
[2026-04-20T08:12:03Z] If you're using binding extensions (e.g. Azure Storage, ServiceBus, Timers, etc.)
[2026-04-20T08:12:03Z] make sure you've called the registration method for the extension(s)
[2026-04-20T08:12:03Z] in your startup code
[2026-04-20T08:12:03Z] 0 functions loaded

You check your code. The classes are public. The methods are public. The bindings are registered. Everything runs fine with func start on your machine.

The error is misleading. The real cause: FUNCTIONS_WORKER_RUNTIME is not set.

Why the file you trusted does not exist here

local.settings.json is a dev-time convenience. The Azure Functions Core Tools reads it when you run func start locally. Inside a container, that file is never loaded. The container runtime reads OS environment variables only, and if FUNCTIONS_WORKER_RUNTIME is missing, the host cannot determine which language worker to start. It discovers zero functions and prints an error that sends you looking at your code instead of your configuration.

AzureWebJobsStorage is the second variable that catches people. Without it, you get a different failure:

Value cannot be null. (Parameter 'connectionString')

Or worse, no error at all. HTTP triggers still work because they do not require storage. You test with an HTTP endpoint, everything responds, you deploy, and your queue triggers silently never fire. The host needs a storage connection to manage leases, checkpoints, and timer schedules for every non-HTTP trigger type.

If you set FUNCTIONS_WORKER_RUNTIME to dotnet instead of dotnet-isolated, the host raises AZFD0013: the configured runtime does not match the worker runtime metadata in your published artifacts. Another error that points away from the actual one-word fix.

The second trap: `.env` files that silently mangle values

Azure Storage connection strings are long. If your .env file wraps them across lines, Docker Compose silently truncates or corrupts the value:

# Broken: line-wrapped connection string
AzureWebJobsStorage=DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;
  AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;
  BlobEndpoint=http://azurite:10000/devstoreaccount1;
  QueueEndpoint=http://azurite:10001/devstoreaccount1;

# Working: entire value on one line
AzureWebJobsStorage=DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://azurite:10000/devstoreaccount1;QueueEndpoint=http://azurite:10001/devstoreaccount1;

No warning, no parse error. The value just stops at the first newline.

The fix: separate constants from secrets

Bake values that never change per environment into your Dockerfile:

ENV AzureWebJobsScriptRoot=/home/site/wwwroot
ENV FUNCTIONS_WORKER_RUNTIME=dotnet-isolated

Pass everything else through your Compose file or deployment config:

services:
  functions:
    build: .
    environment:
      - AzureWebJobsStorage=${AzureWebJobsStorage}
      - APPLICATIONINSIGHTS_CONNECTION_STRING=${APPLICATIONINSIGHTS_CONNECTION_STRING}
    env_file:
      - .env

Connection strings and instrumentation keys stay out of the image. They come from .env locally and from app settings or Key Vault references in production.

One more if you are deploying custom containers to App Service specifically: set WEBSITES_ENABLE_APP_SERVICE_STORAGE=false. The default (true) mounts persistent storage over /home, which overwrites your published function code at startup. This does not apply to Container Apps, only App Service (GitHub issue #642).

Pitfall 2: Azurite and Docker Networking

Most tutorials tell you to set AzureWebJobsStorage to UseDevelopmentStorage=true and move on. That shorthand expands to a full connection string pointing at localhost:

DefaultEndpointsProtocol=http;
AccountName=devstoreaccount1;
AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;
BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;
QueueEndpoint=http://127.0.0.1:10001/devstoreaccount1;
TableEndpoint=http://127.0.0.1:10002/devstoreaccount1;

See those 127.0.0.1 addresses? When Azurite runs on your machine, that works fine. Inside Docker, it breaks.

Each container runs in its own network namespace. 127.0.0.1 inside the functions container refers to the functions container itself, not Azurite. Your function tries to reach storage on its own loopback interface, finds nothing listening, and fails silently or throws a connection error depending on the trigger type.

Docker Compose creates a shared bridge network where each service name resolves to the corresponding container's IP. So the fix is to spell out the full connection string with the Compose service name replacing 127.0.0.1:

DefaultEndpointsProtocol=http;
AccountName=devstoreaccount1;
AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;
BlobEndpoint=http://azurite:10000/devstoreaccount1;
QueueEndpoint=http://azurite:10001/devstoreaccount1;
TableEndpoint=http://azurite:10002/devstoreaccount1;

azurite here is whatever you named the service in your docker-compose.yml. DNS resolution happens automatically on the Compose network.

But DNS resolving correctly is not enough. By default, Azurite binds to 127.0.0.1 inside its own container, which means it only accepts connections from itself. You need to pass --blobHost 0.0.0.0 --queueHost 0.0.0.0 --tableHost 0.0.0.0 so Azurite listens on all interfaces. Without this, the functions container resolves azurite to the right IP, opens a TCP connection, and gets "Connection refused."

This pitfall hides well because HTTP triggers don't need storage. You build a function app, add an HTTP trigger, test it in Docker, everything works. Then you add a queue trigger and it silently does nothing: no errors in the console, no messages processed, no indication that storage is unreachable. The function host quietly skips triggers it can't initialize.

A quick connectivity check saves you the debugging:

docker compose exec functions curl -s http://azurite:10000

If Azurite is reachable and bound correctly, you get back a short XML or text response. If you get "Connection refused," check the bind flags. If you get a DNS error, check your service name.

Part 1 already showed the working Compose file with these settings in place. That is why each piece is there.

Pitfall 3: Debugging a Silent Container

Your container starts, the health check passes, but nothing happens. No HTTP responses, no queue processing, no timer triggers. The logs show the host booting and then silence. This is the most common failure mode with Azure Functions in Docker, and it has six distinct causes. Work through them in order.

Step 1: Check if the host found your functions. Run docker logs <container> and look for the function discovery block near startup:

Host initialized (348ms)
Found the following functions:
  ProcessOrder: timerTrigger
  SubmitOrder: httpTrigger

If you see "Host initialized" but zero functions listed, your AzureWebJobsScriptRoot is wrong or your Dockerfile's WORKDIR does not point to /home/site/wwwroot. The host scans that directory for compiled function metadata. If it points somewhere else, it finds nothing and starts successfully with nothing to run. This is the root cause in #642 and #980.

Step 2: Check storage connectivity. If your functions are listed but triggers never fire, the problem is almost always storage. Look for this error in the logs:

The Azure Storage connection string named 'Storage' does not exist.

Timer triggers and queue triggers need a valid AzureWebJobsStorage connection to coordinate leases and checkpoints. HTTP triggers work without storage, so a container that responds to HTTP but ignores everything else is a storage configuration problem. Verify your environment variables:

docker compose exec functions env | grep FUNCTIONS

This surfaces FUNCTIONS_WORKER_RUNTIME, AzureWebJobsStorage, and any other Functions-specific configuration in the running container.

Step 3: Inspect the container filesystem. When functions still do not appear after fixing the script root, the published output may not be where you think it is. Check directly:

docker compose exec functions ls /home/site/wwwroot

You should see your .dll files, host.json, function.json files, and the worker.config.json. A wrong COPY --from=build path in a multi-stage Dockerfile is the most common cause: the build stage publishes to /app/publish but the copy targets /app/out, and the container starts with an empty wwwroot.

Step 4: Check for assembly conflicts. If the host discovers your functions but the worker crashes on invocation, look for FileNotFoundException referencing assemblies like System.Memory.Data. This happens when in-process WebJobs SDK packages ship inside an isolated worker image. The host and worker expect different assembly versions, and the loader fails silently until a trigger actually fires. Pin your NuGet package versions to match the host's expectations. See #1221 for the specific version matrix.

Step 5: Attach a debugger. When the logs tell you nothing useful, attach directly. VS Code's pipeTransport configuration or Rider's Docker attach both work. The critical detail: the Functions host and the isolated worker are separate .NET processes. The host is the parent process managing triggers; your code runs in the worker. Attach to the worker PID, not the host PID. If you attach to the host, you will see trigger infrastructure but none of your breakpoints will hit.

Step 6: Watch for broken image tags. Sometimes your container worked yesterday and fails today with no code changes. Base image tag updates can silently break functions. Tag 4.33.2 broke function discovery for days before anyone traced it back to the image itself (#1068). Always pin specific version tags in your Dockerfile. Never use :latest for the Functions base image in production.

Other Known Issues

A few problems fall outside the decision tree but will bite you eventually:

No graceful shutdown. The default entrypoint start.sh runs as PID 1 and does not forward SIGTERM to child processes. Your container gets SIGKILL after the orchestrator's grace period expires, which means in-flight executions are terminated without cleanup. This has been open for five years (#404). Workaround: use dumb-init or a custom entrypoint that traps signals.

Non-root containers break startup. The Functions host needs write access to /azure-functions-host at startup. Running the container as a non-root user fails unless you fix directory permissions in your Dockerfile (#424).

Development environment restart loops. Setting AZURE_FUNCTIONS_ENVIRONMENT=Development can trigger the host to restart repeatedly as it watches for file changes that never settle (#1207). Use Production or Staging in Docker unless you specifically need development-mode diagnostics.

Pitfall 4: Image Size and Cold Start

The default Azure Functions base image is 800-900 MB. Add your application code, NuGet packages, and assets, and you're over 1 GB before your first request arrives (#236).

The -slim tags can paradoxically be larger than the regular tags (#1230). Always verify with docker images.

Old extension bundles (v2 and v3) still ship inside the v4 images, wasting roughly 429 MB on code your app will never execute (#880).

Every optimization here is measurable. Start with docker images and track the delta.

.dockerignore

Without a .dockerignore, COPY . . sends your entire working directory to the Docker daemon, including .git/ history and local.settings.json (which contains connection strings and keys).

bin/
obj/
.git/
.vs/
.vscode/
local.settings.json
node_modules/
*.user
Dockerfile

This alone can cut your build context by hundreds of megabytes and prevent secrets from leaking into image layers.

Layer ordering for cache hits

The order of your COPY instructions determines whether Docker can reuse cached layers. Copy the project file first, restore, then copy everything else:

COPY MyFunctionApp.csproj .
RUN dotnet restore

COPY . .
RUN dotnet publish -c Release -o /home/site/wwwroot

When only source code changes, the restore layer stays cached:

Step 3/7 : RUN dotnet restore
 ---> Using cache
 ---> 4a8b2c1d3e5f
Step 4/7 : COPY . .
 ---> 9f1e2d3c4b5a

That "Using cache" line saves 30-120 seconds per build depending on your package count. Without this ordering, every code change re-downloads every NuGet package.

ReadyToRun compilation

Add the PublishReadyToRun flag to pre-compile IL to native code, reducing JIT time at startup:

RUN dotnet publish -c Release -o /home/site/wwwroot \
    -p:PublishReadyToRun=true

This increases image size slightly but cuts cold start latency by front-loading compilation to build time instead of request time.

Trimming (PublishTrimmed=true) is the more aggressive option. It strips unused assemblies and can dramatically reduce image size. But the Functions runtime uses reflection to discover your function endpoints, and the trimmer can remove types it considers unreachable. If your functions disappear after trimming, that's why. Use trimming only if you're willing to maintain trim annotations and test thoroughly.

Cold start: the numbers that matter

On Azure Container Apps, image pull time dominates cold start because the platform scales to zero:

Image size	Pull time	Total cold start
~480 MB	~20s	~25-30s
~140 MB	~7s	~12-15s

That 13-second pull difference hits every scale-from-zero event. On Functions Premium with always-ready instances, the image is cached on warm infrastructure, so size matters less for latency. It still matters for deployment speed and registry costs.

CVE accumulation

Base images are only rebuilt monthly, so vulnerabilities accumulate between rebuilds (#1185). A multi-stage build where you copy your published output onto a fresh OS base gives you control over patching:

FROM mcr.microsoft.com/dotnet/sdk:8.0 AS build
# ... build steps ...

FROM mcr.microsoft.com/azure-functions/dotnet-isolated:4-dotnet-isolated8.0
COPY --from=build /home/site/wwwroot /home/site/wwwroot

Run docker images after applying these changes. A starting point of 1 GB+ dropping to 300-400 MB is typical when you combine layer optimization, proper .dockerignore, and ReadyToRun instead of carrying dead extension bundles.

Pre-Deploy Checklist

Save yourself a repeat debugging session. Run through this before every container deployment.

[ ] FUNCTIONS_WORKER_RUNTIME set to dotnet-isolated in your container environment, not inherited from local.settings.json
[ ] AzureWebJobsStorage uses explicit endpoint strings with Docker service names instead of UseDevelopmentStorage=true
[ ] Connection strings are single-line in .env files with no line-wrapping
[ ] docker logs confirms all expected functions discovered at startup
[ ] AzureWebJobsScriptRoot points to /home/site/wwwroot (verify if using a custom base image)
[ ] .dockerignore excludes bin/, obj/, .git/, and local.settings.json
[ ] NuGet restore layer cached separately from the source code copy step
[ ] Base image tag pinned to a specific version, not :latest
[ ] Azurite bound to 0.0.0.0 in your Compose configuration
[ ] Image tested with docker compose up locally before pushing to any registry

Which of these four pitfalls cost you the most debugging time: environment variables, Azurite networking, silent startup failures, or image size?

Azure Functions Beyond the Basics

Part 1: Running Azure Functions in Docker: Why and How

Part 2: Docker Pitfalls I Hit (And How to Avoid Them) (this article)

From AZ-204 to AI-200: What Changed and Why It Matters

Martin Oehlert — Fri, 17 Apr 2026 21:18:34 +0000

Comparing the AZ-204 skill outline against the AI-200 course structure, roughly 60% of AZ-204 carries forward, 25% is dropped entirely, and AI-200 adds about 30% net-new content that AZ-204 never touched. Which side of that split you land on determines whether this transition is a week of review or a month of study. The gap is lopsided enough that you cannot assume existing knowledge transfers cleanly.

What Is AI-200?

Full name: AI-200: Azure AI Cloud Developer Associate

Format: Multiple choice, case studies, and scenario-based questions. Based on standard Microsoft exam format: approximately 40-60 questions, 100-minute window, passing score around 700/1000.

Timeline:

Beta exam: April 2026
General availability: July 2026 (estimated)
AZ-204 retirement: July 31, 2026

Course: The AI-200T00 instructor-led training course maps to seven learning paths that define the exam scope:

Container Hosting: ACR, App Service containers, Container Apps, AKS
Cosmos DB: NoSQL API with vector search and AI integration
PostgreSQL Vector Search: pgvector, HNSW indexes, hybrid search
Azure Managed Redis: data operations, event messaging, vector storage
Backend Services: Service Bus, Event Grid, Azure Functions
Secrets and Configuration: Key Vault, managed identities, App Configuration
Observability: OpenTelemetry, Azure Monitor logs and metrics

What Carried Forward, What Got Dropped, What's New

Carried forward (~60% of AZ-204)

Most of the backend services survive: Azure Functions (triggers, bindings, Durable Functions), Service Bus, Event Grid, Key Vault, App Configuration, and managed identities all carry over. Several topics are expanded rather than simply retained. Container Apps now gets deeper coverage of KEDA scaling and Dapr integration. Cosmos DB adds vector search on top of the existing NoSQL API. Container Registry picks up ACR Tasks. And managed identities extend to AKS workload identity, which matters because AKS is one of the largest new additions. If you already hold AZ-204, this 60% is review, not new study.

Dropped (~25% of AZ-204)

Comparing the AZ-204 study guide against the AI-200 course outline, seven topics are gone entirely: Blob Storage SDK, MSAL/Identity Platform, Microsoft Graph, SAS tokens, API Management, Event Hubs, and Azure Container Instances. Microsoft removed the CRUD-oriented cloud app topics that do not serve AI workloads. You will not be tested on generating SAS tokens or calling Graph endpoints. If you spent weeks on MSAL token flows for AZ-204, that knowledge still applies to real projects, but it will not appear on AI-200.

Brand new (~30% of AI-200)

Based on the AI-200T00 course structure, AKS spans three modules and likely accounts for an estimated 20-25 exam questions. You need cluster creation with kubectl, ACR integration via --attach-acr, and scaling with HPA and cluster autoscaler. Configuration covers ConfigMaps, Secrets, Key Vault CSI Driver, persistent storage (Azure Disk for RWO, Azure Files for RWX), taints and tolerations, and the difference between resource requests and limits. Monitoring adds Container Insights, KQL queries for pod status and events, managed Prometheus, and alerting on OOMKills and resource exhaustion. This is the single largest new topic by question count.

PostgreSQL with pgvector is where AI-200 tests your understanding of vector databases, covering an estimated 8-11 questions across three modules. The foundation is Flexible Server provisioning with Entra ID auth and PgBouncer connection pooling. From there, you enable the pgvector extension for vector storage and work with distance operators: L2 (<->), cosine (<=>), and inner product (<#>). Batch embedding pipelines use Azure OpenAI to generate vectors at scale. Index optimization is where it gets specific: IVFFlat (partition-based, best under 100K vectors with frequent updates) versus HNSW (graph-based, best above 500K static vectors). Hybrid search combines vector similarity with metadata filters using standard WHERE clauses.

Azure Managed Redis replaces the narrow "Azure Cache for Redis" coverage from AZ-204 with a broader scope across an estimated 7-12 questions. Five core data types (strings, hashes, lists, sets, sorted sets) and caching patterns (cache-aside, write-through, write-behind) form the baseline. The exam also tests event messaging: Pub/Sub for fire-and-forget broadcasting versus Streams for durable at-least-once delivery with consumer groups (XREADGROUP, XACK). On the Enterprise tier, RediSearch enables vector similarity search using FLAT and HNSW indexes combined with tag, numeric, and text filters.

OpenTelemetry rounds out the new content with an estimated 5-7 questions. The Azure Monitor OpenTelemetry Distro provides a one-line setup via UseAzureMonitor(), replacing the proprietary Application Insights SDK. Custom spans use ActivitySource, custom metrics use Meter instruments, and W3C TraceContext propagation handles distributed trace correlation across services. Sampling strategies control telemetry volume and cost, which is the kind of production concern the exam now prioritizes.

Three Shifts Worth Understanding

The topic changes above are not random. They reflect a different definition of what an Azure developer does.

Vector databases replace blob storage

If your application retrieves context from a knowledge base before passing it to a language model, you are building a RAG pipeline, and the retrieval layer runs on one of three backends the exam now tests.

Cosmos DB supports vector search for globally distributed workloads. PostgreSQL with pgvector handles complex hybrid queries where you combine vector similarity with metadata filters in standard WHERE clauses. Redis provides low-latency vector retrieval on the Enterprise tier using FLAT and HNSW indexes. Each backend has different index types (IVFFlat, HNSW, FLAT), different distance operators, and different tradeoffs around dataset size and query complexity.

AZ-204 treated storage as a CRUD problem: upload blobs, set access tiers, generate SAS tokens. AI-200 treats storage as a search problem, and the skill gap between "call a PUT endpoint" and "choose the right index type for 500K embeddings" is not small.

AKS moves from infrastructure to developer concern

AI workloads need GPU-enabled nodes isolated from general compute, custom operators, and fine-grained resource limits. Container Apps cannot give you any of that. AI-200 assigns three full modules to AKS: deployment, configuration, and monitoring.

The exam expects you to select between Azure Disk (RWO) and Azure Files (RWX) storage classes, integrate secrets through the Key Vault CSI Driver, and manage node pools with taints and tolerations. Monitoring means writing KQL queries against Container Insights to diagnose pod failures and resource exhaustion. AZ-204 kept you at the Container Apps level, where Kubernetes was an implementation detail you never touched. That abstraction no longer holds when your inference service needs a dedicated A100 node pool with specific resource requests and limits.

OpenTelemetry replaces proprietary instrumentation

Your tracing code now works the same whether telemetry flows to Azure Monitor, Jaeger, or Datadog. The Application Insights SDK locked you into Microsoft's instrumentation API, Microsoft's backend, and Microsoft's query tools. AI-200 replaces that instrumentation layer with OpenTelemetry, the CNCF-backed open standard.

The Azure Monitor OpenTelemetry Distro makes setup a one-liner with UseAzureMonitor(), but the exam goes deeper. Custom instrumentation means creating spans with ActivitySource and recording metrics with Meter instruments. Distributed trace correlation relies on W3C TraceContext headers propagated across service boundaries. Sampling configuration controls telemetry volume, which directly affects cost at scale. Azure Monitor still serves as the analysis backend; what changed is the instrumentation contract.

What This Means for Your Study Plan

If you already hold AZ-204: roughly 60% carries forward. Your knowledge of Azure Functions, Service Bus, Event Grid, Key Vault, managed identities, and Cosmos DB basics is still valid. The gap areas are AKS (the largest single investment if you have not worked with Kubernetes), PostgreSQL with pgvector, Azure Managed Redis vector storage and Streams, and OpenTelemetry custom instrumentation. Budget 3-4 weeks of focused study on those new topics, then 1 week reviewing carried-over material to make sure nothing has shifted in scope.

If you are currently studying for AZ-204: you have a decision to make before July 31, 2026. If you are close to passing, finish it: the credential stays valid through its full renewal cycle. If you are early in your studies, pivot to AI-200 now and skip the dropped topics entirely. There is no reason to invest time in the Blob Storage SDK, MSAL, Microsoft Graph, API Management, or Event Hubs when those topics will not appear on the replacement exam.

If you are starting fresh: go directly to AI-200. The AI-200T00 course structure and Microsoft Learn paths give you everything you need; AZ-204 material adds no value at this point.

The 8-week plan below assumes you are starting from scratch or pivoting from early AZ-204 study:

The AI-200T00 course and the Microsoft Learn paths aligned to each domain are the primary resources once they publish alongside the beta exam. Are you finishing AZ-204 before July or pivoting to AI-200 now?

Running Azure Functions in Docker: Why and How

Martin Oehlert — Fri, 17 Apr 2026 05:23:51 +0000

Azure Functions Beyond the Basics
Continues from Azure Functions for .NET Developers (Parts 1-9)

When zip-deploy stops fitting

Your Azure Function needs to generate PDF invoices, so you add Puppeteer to your project. Zip-deploy works fine on your machine, but the Consumption plan doesn't have the Chromium dependencies installed. The function throws a cryptic error about missing shared libraries, and you're stuck choosing between a workaround that limits your architecture or a deployment model that gives you full control over the OS.

Most Azure Functions never hit this wall. Zip-deploy and run-from-package handle the majority of workloads well: your code and dependencies get packaged, uploaded, and run on Microsoft's managed infrastructure. You don't think about the OS, the runtime image, or what's installed underneath. That's the point, and it's a good default.

Containerizing a Function adds real operational cost. You own the base image, the patching cycle, the registry, and the build pipeline. If zip-deploy already works, containerizing your Function adds overhead with no payoff.

But there are specific problems where Docker earns that overhead back.

Native dependencies are the most common trigger. FFmpeg for media processing, Puppeteer or Playwright for headless browser work, libgdiplus for image manipulation: these require OS-level packages that the default Azure Functions host doesn't include. A custom Docker image lets you install exactly what the function needs.

Reproducible builds across environments matter when your team needs the same OS, the same SDK version, and the same native tooling from local dev through staging to production. A Dockerfile pins all of it in version control.

Running Functions alongside other containers is the third case. If you're already deploying to Azure Container Apps or AKS, packaging your Function as a container lets it sit next to your APIs, workers, and sidecars in the same orchestration layer. One deployment model, one scaling configuration, one set of infrastructure to manage.

If one of those three problems is yours, the container tax is worth paying.

The Dockerfile: multi-stage build for .NET 10

Start with the complete Dockerfile, then walk through what each stage does.

FROM --platform=linux/amd64 mcr.microsoft.com/dotnet/sdk:10.0 AS build
WORKDIR /src

COPY *.csproj .
RUN dotnet restore

COPY . .
RUN dotnet publish -c Release -o /app/publish

FROM --platform=linux/amd64 mcr.microsoft.com/azure-functions/dotnet-isolated:4-dotnet-isolated10.0 AS runtime
WORKDIR /home/site/wwwroot
COPY --from=build /app/publish .

Fourteen lines. That's the whole thing.

The --platform=linux/amd64 flag on both FROM lines pins the image architecture. The Azure Functions base images only ship for linux/amd64, so without this flag, builds on Apple Silicon pull the wrong manifest and fail. Pinning the platform makes the Dockerfile work identically on Intel and ARM machines.

The build stage uses the .NET 10 SDK image to compile your project. The COPY *.csproj then dotnet restore pattern caches NuGet packages in a Docker layer, so subsequent builds skip the restore unless your dependencies change. The dotnet publish step compiles your code and produces a deployment-ready output in /app/publish.

The runtime stage switches to the Azure Functions base image. This image ships with the Functions host process, the dotnet-isolated worker runtime, and the three environment variables your app needs:

AzureWebJobsScriptRoot=/home/site/wwwroot
FUNCTIONS_WORKER_RUNTIME=dotnet-isolated
AzureFunctionsJobHost__Logging__Console__IsEnabled=true

You don't need to set any of these yourself. The base image handles it. Your only job is to place the published output at /home/site/wwwroot, which is why WORKDIR must point there. Get that path wrong and the Functions host starts but finds zero functions.

The final COPY --from=build pulls the compiled output from the build stage into the runtime image, keeping the SDK and all intermediate build artifacts out of your production container.

What you should know before building

Pin your SDK version in global.json. The base image 4-dotnet-isolated10.0 bundles a specific .NET 10 runtime. If your local SDK rolls ahead of what the image ships, subtle mismatches at runtime can show up. Pinning keeps builds deterministic across laptops, CI, and the image:

{
  "sdk": {
    "version": "10.0.201",
    "rollForward": "latestPatch"
  }
}

Package version floor for .NET 10. The isolated worker packages below 2.x don't target .NET 10. You need at minimum:

Microsoft.Azure.Functions.Worker 2.50.0 or later
Microsoft.Azure.Functions.Worker.Sdk 2.0.5 or later

If you're upgrading an existing project from .NET 8, bumping just the TargetFramework without updating these two packages is the most common failure mode.

.NET 10 doesn't run on the Linux Consumption plan. This is a hard platform constraint, not a preview gap. If your current app runs on Linux Consumption and you want .NET 10, you need to migrate to the Flex Consumption plan first. Premium, ACA, and AKS (covered later) all support .NET 10 without this restriction.

The Functions host runs on .NET 8 internally. Even in the dotnet-isolated10.0 image, the host process itself targets .NET 8. Your worker process runs on .NET 10. This is expected behavior for the isolated model, not a bug: the two processes communicate over gRPC, so the runtime versions are independent.

These images are linux/amd64 only. If you're on Apple Silicon, Docker Desktop runs them under Rosetta or QEMU emulation. Builds work fine. Performance is noticeably slower than native ARM execution, so keep local integration test suites short.

No slim variant exists for .NET 10 yet. The base image is Ubuntu-based (the .NET 10 container images moved from Debian to Ubuntu), and the full image weighs roughly 1.5 GB. A Mariner-based or distroless option may come later, but as of April 2026, this is what ships.

You own base image updates. Microsoft publishes monthly security patches to the base images, but unlike managed Functions deployments, custom containers do not auto-update. You pull the latest tag, rebuild, and redeploy. Set a calendar reminder or wire it into your CI pipeline. The official docs are explicit about this: maintaining your container is your responsibility.

Local development with Docker Compose and Azurite

Your function app needs storage. Timer triggers use it for lease management, queue triggers read from it directly, and durable functions store orchestration state there. In production that's an Azure Storage account. Locally, you need Azurite, Microsoft's storage emulator, running alongside your function container.

Here's the full Docker Compose file:

services:
  azurite:
    image: mcr.microsoft.com/azure-storage/azurite
    command: >-
      azurite
      --blobHost 0.0.0.0
      --queueHost 0.0.0.0
      --tableHost 0.0.0.0
      --loose
      --skipApiVersionCheck
    ports:
      - "10000:10000"
      - "10001:10001"
      - "10002:10002"
    volumes:
      - azurite-data:/data
    healthcheck:
      test: nc -z 127.0.0.1 10000
      interval: 3s
      retries: 5
      start_period: 5s

  functions:
    build: .
    ports:
      - "8080:80"
    environment:
      - AzureWebJobsStorage=DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://azurite:10000/devstoreaccount1;QueueEndpoint=http://azurite:10001/devstoreaccount1;TableEndpoint=http://azurite:10002/devstoreaccount1
    depends_on:
      azurite:
        condition: service_healthy

volumes:
  azurite-data:

The --blobHost 0.0.0.0 flags (and their queue/table equivalents) tell Azurite to listen on all network interfaces, not just localhost. Without them, your function container can't reach Azurite across the Docker network. --loose relaxes strict API validation. --skipApiVersionCheck prevents version mismatch errors when the Functions runtime targets a newer Storage API than Azurite supports.

The named volume azurite-data keeps your storage data intact between docker compose down and docker compose up. Queue messages, blob uploads, table entities: all survive restarts. Drop the volume only when you want a clean slate (docker compose down -v).

The health check deserves attention. Without it, Docker starts both containers simultaneously. Your function app boots in seconds, tries to connect to Azurite, and fails because Azurite hasn't finished initializing. The nc -z 127.0.0.1 10000 check confirms Azurite is actually accepting connections before the function container starts.

Now for the part that will cost you an hour if you don't know about it.

Your first instinct for the storage connection string will be UseDevelopmentStorage=true. That's what every Azure Functions tutorial uses, and it works fine when Azurite runs on your host machine. Inside Docker, it breaks. The shorthand expands to endpoints pointing at 127.0.0.1, which inside the function container means "myself," not "the Azurite container next door."

The fix is the explicit connection string you see in the Compose file above. The critical difference: every endpoint URL uses azurite as the hostname (the Compose service name) instead of 127.0.0.1. Docker's internal DNS resolves azurite to the correct container IP automatically. The account name and key are Azurite's well-known development credentials, the same ones UseDevelopmentStorage=true uses under the hood.

One practical tip: that connection string is long and ugly. Don't try to split it across multiple lines in your Compose file or inject it from a .env file with line breaks. YAML will quietly mangle it. Keep it on a single line, or use a .env file with the entire value on one line and reference it with ${AzureWebJobsStorage} in your Compose file.

Run docker compose up --build and you should see Azurite report all three services listening, followed by your function app discovering its triggers. If the function container restarts in a loop, check the connection string first. Nine times out of ten, that's the problem.

Debugging in containers: VS Code and Rider

Add a debug stage to your Dockerfile that installs the .NET debugger:

FROM build AS debug
RUN dotnet tool install --tool-path /tools dotnet-dump
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl unzip procps \
    && curl -sSL https://aka.ms/getvsdbgsh | bash /dev/stdin -v latest -l /vsdbg \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

ENV DOTNET_USE_POLLING_FILE_WATCHER=1
ENTRYPOINT ["dotnet", "run", "--project", "/src/HttpTriggerDemo"]

The DOTNET_USE_POLLING_FILE_WATCHER environment variable is required because Docker volume mounts don't support inotify. Without it, file change detection silently fails.

VS Code with pipeTransport

Point your launch.json at the container using pipeTransport instead of opening a debug port:

{
  "name": "Attach to Docker",
  "type": "coreclr",
  "request": "attach",
  "processId": "${command:pickRemoteProcess}",
  "pipeTransport": {
    "pipeProgram": "docker",
    "pipeArgs": ["exec", "-i", "my-functions-debug"],
    "debuggerPath": "/vsdbg/vsdbg",
    "pipeCwd": "${workspaceFolder}"
  },
  "sourceFileMap": {
    "/src": "${workspaceFolder}"
  }
}

pipeTransport sends debug commands through docker exec, so you never expose a debug port. The sourceFileMap entry maps the container's /src path back to your workspace so breakpoints resolve correctly. Start the container, hit F5 in VS Code, pick the dotnet process, and you're attached.

Rider

Rider handles most of this automatically. Open Run > Attach to Process, select the Docker tab, and pick your container. Rider installs its own debug agent on first attach. If you use Docker Compose, Rider also supports a native Docker Compose run configuration that builds, starts, and attaches in one step.

Docker Compose debug profile

Separate your debug configuration using a Compose profile so it doesn't interfere with production builds:

services:
  functions-debug:
    build:
      context: .
      target: debug
    volumes:
      - ./src:/src
    environment:
      - DOTNET_USE_POLLING_FILE_WATCHER=1
      - AzureWebJobsStorage=DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://azurite:10000/devstoreaccount1;QueueEndpoint=http://azurite:10001/devstoreaccount1;TableEndpoint=http://azurite:10002/devstoreaccount1
    depends_on:
      - azurite
    profiles: [debug]

Run it with docker compose --profile debug up. The target: debug directive tells Compose to stop at your debug stage, which includes the SDK and vsdbg but skips the production publish step.

Hot reload: set expectations

Using dotnet watch to wrap func start inside the container works, but every code change triggers a full restart. Expect 4-6 second cycles. That's usable for occasional debugging sessions, not for rapid iteration.

The pragmatic split: run func start on your host machine for day-to-day development. Keep Azurite and any dependencies (Service Bus emulator, CosmosDB emulator) in Docker. Reserve full-container debugging for integration testing or reproducing environment-specific issues. You get fast inner-loop feedback without giving up the production-parity benefits of containerized dependencies.

Where to deploy: ACA vs Premium vs AKS

You have a containerized Function. Now you need somewhere to run it. Three options exist, and each makes a different trade-off between operational control and managed convenience.

Azure Container Apps (ACA)

ACA is the recommended default for containerized Azure Functions. The platform reads your Function triggers and configures KEDA scaling rules automatically, so you never write ScaledObject YAML yourself.

Deploy with the Azure CLI:

az containerapp create \
  --name my-functions \
  --resource-group my-rg \
  --environment my-env \
  --image myregistry.azurecr.io/my-functions:latest \
  --registry-server myregistry.azurecr.io \
  --ingress external --target-port 80 \
  --min-replicas 0 \
  --max-replicas 30

Set --min-replicas 0 and your app scales to zero when idle, meaning zero compute cost during quiet periods.

Pricing follows the Container Apps model. On the Consumption plan, you pay per vCPU-second and GiB-second, with a monthly free grant of 180,000 vCPU-seconds and 360,000 GiB-seconds per subscription. For a Function that processes a few thousand events per day and idles overnight, you could land under $5/month. Dedicated workload profiles are available if you need guaranteed compute or GPU access, billed per instance rather than per resource consumed.

Cold start is the main gotcha. When your app scales from zero, the platform needs to pull the container image, provision resources, and start the Functions host. For a typical .NET isolated Function, teams commonly report 5-15 seconds on the first request after an idle period (Microsoft doesn't publish official cold start numbers). You can eliminate this by setting --min-replicas 1, but that means you pay for at least one instance around the clock. Keeping your container image small (pin to a specific tag, avoid unnecessary layers) helps reduce cold start time.

What ACA does not support: deployment slots, Functions access keys via the portal, and Functions proxies. If you rely on staging slots for zero-downtime swaps, you'll need to use ACA's built-in blue-green deployment with traffic splitting instead.

Azure Functions Premium Plan

The Premium plan (Elastic Premium, SKUs starting with EP) is the original way to run custom containers in Azure Functions. It predates ACA and still has one killer feature: always-ready instances with prewarmed buffers.

az functionapp plan create \
  --resource-group my-rg \
  --name my-premium-plan \
  --location eastus \
  --sku EP1 \
  --is-linux

az functionapp create \
  --resource-group my-rg \
  --plan my-premium-plan \
  --name my-functions \
  --deployment-container-image-name myregistry.azurecr.io/my-functions:latest

Three SKU sizes are available:

The billing model is the critical difference from ACA. Premium plan charges per core-second and memory across all allocated instances, with no execution charge. At least one instance must always be running. An EP1 instance running 24/7 costs roughly $155-175/month (varies by region). You cannot scale to zero. That always-on instance is the price you pay for eliminating cold starts entirely.

Where the Premium plan shines is latency-sensitive HTTP traffic. When load spikes, prewarmed instances are already initialized and waiting. No container pull, no cold start. For Functions that must respond in under 200ms consistently, this matters.

Watch out for the SKU naming confusion. EP1 is Elastic Premium (dynamic scaling). P1V2 is a Dedicated App Service plan (no dynamic scaling). Pick the wrong one and you'll pay more for less flexibility.

Maximum scale-out is up to 100 instances. The default maximumElasticWorkerCount in ARM templates is 20, so you may need to raise that limit explicitly.

AKS with KEDA

If your team already operates a Kubernetes cluster, running Functions there avoids introducing a new compute platform. You install KEDA as an AKS add-on, deploy your Function container as a standard Kubernetes deployment, and KEDA handles scaling based on event triggers.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: my-functions-scaler
spec:
  scaleTargetRef:
    name: my-functions
  minReplicaCount: 0
  maxReplicaCount: 50
  triggers:
    - type: azure-servicebus
      metadata:
        queueName: orders
        messageCount: "5"
      authenticationRef:
        name: servicebus-auth

KEDA supports these Azure Functions triggers directly: Azure Storage Queues, Azure Service Bus, Azure Event Hubs / IoT Hubs, Apache Kafka, and RabbitMQ. HTTP triggers work, but KEDA does not manage them directly; you configure HTTP scaling through the Horizontal Pod Autoscaler or Container Apps' HTTP scaler instead.

This is the only option that is community-supported, not Microsoft-supported. The docs are explicit: "Best-effort support is provided by contributors and from the community." If something breaks at 2am, you're opening a GitHub issue, not filing a support ticket. You also own the full Kubernetes stack: node pools, networking, RBAC, upgrades, monitoring.

Cost depends entirely on your cluster. If you're already paying for AKS nodes, adding a Function container is effectively free at the compute layer. If you'd be spinning up a new cluster just for Functions, the minimum AKS cost (one node with a Standard_D2s_v3 VM) starts around $70/month before you've deployed anything. KEDA itself is free and runs as a lightweight deployment in your cluster.

Cold start on AKS matches whatever your cluster can provision. With KEDA's scale-to-zero, a cold start involves scheduling a pod, pulling the image (if not cached), and starting the container. On a warm cluster with cached images, that's 3-10 seconds. On a cluster that needs to scale up a node, it could be 2-4 minutes.

Trade-offs at a glance

The decision tree is short. If you don't already run Kubernetes, don't start now for a single Function app. If your Function handles latency-sensitive HTTP requests and cold starts are unacceptable, use the Premium plan and accept the always-on cost. For everything else, ACA with the Consumption plan gives you scale-to-zero, automatic KEDA configuration, and the lowest operational overhead.

When Docker adds value

The deployment choice assumes the container was worth building in the first place. Every custom container you ship is infrastructure you now own: a registry to manage, a base image to patch monthly, a CI pipeline stage that didn't exist before. Zip-deploy skips all of that. Microsoft patches the managed host, and you never think about it.

That trade-off only flips when the managed host can't do what your function requires. Puppeteer needs Chromium installed at the OS level. Your compliance team mandates identical images from laptop to production. Your platform team already runs everything on AKS and adding a second deployment model would create more problems than it solves. Those are real constraints, not preferences.

The setup cost is lower than it looks. Twelve lines of Dockerfile, a Compose file with Azurite, and one container image that deploys to ACA, Premium, or AKS without changes. The ongoing cost is the part that matters: monthly base image pulls, rebuild-and-redeploy cycles, and one more thing to monitor.

Is your function broken without OS-level control, or would zip-deploy work fine if you tried it first?

Production Realities: When Azure Functions Stops Being Serverless

Martin Oehlert — Fri, 10 Apr 2026 05:38:43 +0000

The Enterprise Reality Check

At what point does an Azure Functions deployment stop being serverless and start being managed compute with a monthly bill? The shift happens not in one decision but in a sequence of reasonable ones: a VNet requirement from the security team, then private endpoints for the storage account, then an API gateway because the function is public-facing.

Your function works on Consumption. Zero cost at idle, automatic scaling, no infrastructure to think about. Then the security review lands. VNet integration is mandatory. Consumption doesn't support VNets, so you move to Flex Consumption. Private endpoints? Flex handles those too. You're still paying close to nothing at idle.

Then the surrounding infrastructure arrives. API Management adds $147 to $700. WAF protection adds $333. Each requirement passes its own cost-benefit test. None of them would make you question the architecture on their own. But the total floor lands somewhere between $530 and $1,080 per month, and the function plan itself is the smallest line item on the invoice.

The serverless pitch from Part 1 of this series was real. It just applies to a narrower set of workloads than most teams expect when they start. Once you're past that boundary, the question isn't whether to pay more. It's whether the Functions abstraction is still worth paying for, or whether Container Apps or App Service would give you the same outcome with less friction.

VNet: The Requirement That Changes Everything

Most enterprise environments mandate VNet integration for anything touching internal databases, key vaults behind private endpoints, or services that shouldn't be exposed to the public internet. The Consumption plan doesn't support VNet. That single requirement forces you into a different hosting plan, and each plan carries different pricing and operational constraints.

These are your options (East US pricing, April 2026):

Plan	VNet	Scale to Zero	Min Idle Cost/mo
Consumption	No	Yes	$0
Flex Consumption (on-demand)	Yes	Yes	$0
Flex Consumption (1 always-ready, 2 GB)	Yes	No	~$21
Premium EP1	Yes	No	~$146
Premium EP2	Yes	No	~$291
Dedicated S1/P1v3	Yes	No	varies
Container Apps (1 replica, 0.25 vCPU)	Yes	Yes	~$10

The jump from Consumption to Premium EP1 is $146/month for a single function app sitting idle. That's the cost of VNet access before your code processes a single request. Premium EP2 doubles it. These aren't theoretical numbers: they're the minimum monthly charges while your function waits for traffic.

Flex Consumption went GA in November 2024, and Microsoft now positions it as the recommended path for apps that need dynamic scaling with VNet support. In on-demand mode, Flex preserves the scale-to-zero model that made Consumption attractive. It also skips the Azure Files dependency (the shared file system Premium uses for deployment artifacts and runtime state). If your security team mandates private networking for storage, that saves roughly $30/month on private endpoint costs you'd otherwise pay on Premium. Under the hood, Flex uses shared gateways (up to 27 shared gateway IPs) instead of dedicated VNet-injected workers. That's how it keeps costs lower.

If you need guaranteed warm instances to avoid cold starts, Flex's always-ready configuration starts at about $21/month for one instance with 2 GB memory. That's still a fraction of Premium EP1.

But Flex has real constraints. Before you commit to it, compare what you're giving up:

Constraint	Flex Consumption	Premium
OS	Linux only	Windows + Linux
Apps per plan	One	Multiple
Deployment slots	Not supported	Up to 3
In-process .NET	Not supported	Supported
App init timeout	Fixed 30s	No limit
NFS file shares	No	Yes
Regional availability	Limited	Broad

The one-app-per-plan limitation is easy to overlook. You can't consolidate multiple function apps onto a single Flex plan the way you would with Premium. For teams running five or ten function apps, Premium's ability to share a single plan across all of them can actually cost less per app than running each on its own Flex instance.

The 30-second app init timeout is fixed. Not configurable. If your function app loads large dependency injection containers and connects to multiple databases at startup, 30 seconds may not be enough. Premium has no startup timeout limit, so heavy initialization is never a problem there.

If your codebase uses in-process .NET (the older hosting model where your function runs inside the Functions host process), Flex doesn't support it. You'd need to migrate to the isolated worker model first, which is its own project.

If you need Windows, deployment slots, or in-process .NET: Premium is your only option. If you're on Linux with the isolated worker model and can live with one app per plan, Flex Consumption gives you VNet support without abandoning scale-to-zero.

One more thing worth knowing: Linux Consumption is on a deprecation path. No new features after September 2025, with retirement scheduled for September 2028. Microsoft is pushing new workloads toward Flex Consumption, and the deprecation timeline makes that push harder to ignore.

The Plan Escalation Path

You start on Consumption. Your function triggers on an HTTP request, processes a message, writes to Cosmos DB. It costs nothing when idle. The serverless model, doing what it's supposed to do.

Then the requirements start arriving, one at a time.

VNet integration

Your security team requires all compute to run inside a virtual network. Consumption doesn't support VNet integration, so you move to Flex Consumption (on-demand only). This still scales to zero. You're still paying nothing at idle. No problem.

Running total: $0/mo idle

Private endpoints

Next review: inbound traffic to your function app must go through a private endpoint, and the backing storage accounts need private endpoints too.

Flex Consumption supports inbound private endpoints. You don't need to leave Flex for this. Your function app keeps scale-to-zero, and the private endpoint adds ~$7/month.

Flex also needs private endpoints for its backing storage accounts: Blob, Queue, and Table. Three endpoints, not four, because Flex has no Azure Files dependency. That's ~$22/mo for storage endpoints.

One deployment gotcha worth knowing: combining VNet integration with inbound private endpoints on Flex can cause deployment timeouts at the Kudu RemoveWorkersStep. The current workaround is temporarily removing the private endpoint during deployments, then re-adding it. Not ideal for automated pipelines, and worth factoring into your CI/CD design.

Running total: ~$29/mo (Flex on-demand)

The fork: when Premium becomes unavoidable

Most teams can stay on Flex through the VNet and private endpoint requirements. But Flex has constraints that force some teams onto Premium EP1:

Windows hosting required
Deployment slots for blue-green deployments
In-process .NET (not yet migrated to isolated worker)
Multiple function apps sharing a single plan
App init exceeding 30 seconds (Flex's hard timeout)

If any of these apply, EP1 gives you 1 vCPU and 3.5 GB of memory. The math: 1 vCPU at $116.80 plus 3.5 GB at $8.322 per GB = ~$146/mo. It runs 24/7 whether your function executes or not. Storage private endpoints on Premium cost ~$30/mo (four endpoints, including Azure Files).

Running total if forced to Premium: ~$176/mo

Cold starts

On Flex, cold starts are still possible when scaling from zero. If your workload needs guaranteed warm instances, Flex's always-ready configuration starts at ~$21/month for one instance with 2 GB memory. On Premium, cold starts are a non-issue: EP1 keeps at least one instance warm by default.

Running total: ~$50/mo (Flex + always-ready) or ~$176/mo (Premium)

API Management

Your API needs rate limiting and a developer portal. You add Azure API Management. The pricing depends on what your organization needs:

APIM Basic (classic): ~$147/mo, no VNet integration, 99.95% SLA
APIM Standard v2: ~$700/mo, partial VNet support (backend only), 99.95% SLA
APIM Premium (classic): ~$2,795/mo, full VNet integration, 99.99% SLA

Most teams start with Basic and accept the VNet gap. Some compliance requirements force Standard v2 or higher.

Running total: ~$197/mo (Flex + Basic) or ~$323/mo (Premium + Basic)

WAF protection

Compliance also wants a Web Application Firewall in front of your API. You deploy Application Gateway WAF_v2.

The breakdown: $0.443/hour for 730 hours ($323), plus at least one capacity unit ($10.50), plus a public IP ($3.65). That's ~$333-335/mo.

Application Gateway v1 retires April 28, 2026, so WAF_v2 is the only option going forward.

Running total: ~$530/mo (Flex floor) or ~$656/mo (Premium floor)

The full picture

Consumption ($0 idle)
  + VNet requirement
  → Flex on-demand: still $0 idle

  + private endpoints (inbound + storage)
  → Flex: ~$29/mo (1 inbound PE + 3 storage PEs)
  → Premium (if forced by constraints): ~$176/mo (EP1 + 4 storage PEs)

  + cold start elimination
  → Flex always-ready: ~$21/mo
  → Premium: included (always-on)

  + API Management
  → APIM Basic: ~$147/mo
  → OR APIM Standard v2: ~$700/mo

  + WAF protection
  → Application Gateway WAF_v2: ~$333/mo

  Flex path:
  = ~$530/mo floor (Flex + always-ready + PEs + APIM Basic + WAF)
  = ~$1,083/mo ceiling (Flex + PEs + APIM Standard v2 + WAF)

  Premium path (forced by constraints):
  = ~$656/mo floor (EP1 + PEs + APIM Basic + WAF)
  = ~$1,209/mo ceiling (EP1 + PEs + APIM Standard v2 + WAF)

Every one of these requirements is reasonable on its own. Your security team isn't wrong to ask for VNet integration. Private endpoints are a real protection. APIM and WAF exist because APIs need them.

The function plan itself is the smallest factor. On Flex, your compute cost at idle is $0 to $50/month. On Premium, it's $146 to $176. Either way, APIM and WAF add $480 to $1,033 on top. Those two services dominate the bill regardless of which Functions plan you choose.

Build and Deploy Friction

The cost story from the plan escalation section is the monthly bill. The deployment story is the engineering time you spend before your code even runs.

You cannot convert a function app from one plan to another in place. Moving from Consumption to Flex Consumption, or from Premium to Flex, means creating a new function app, redeploying your code, and deleting the old one. There is no az functionapp update --sku FC1. Microsoft's own migration guide recommends running both apps in parallel during a transition period, then cutting over. For production workloads, that's a blue-green deployment you didn't plan for.

The app settings cleanup

Flex Consumption deprecates roughly 20 app settings and site properties that other plans rely on. If you copy your existing configuration to a new Flex app without cleaning it up, the deployment fails or the app behaves unpredictably.

These settings must be removed:

# Deployment (handled by functionAppConfig.deployment.storage on Flex)
WEBSITE_RUN_FROM_PACKAGE

# Azure Files (Flex has no Azure Files dependency)
WEBSITE_CONTENTAZUREFILECONNECTIONSTRING
WEBSITE_CONTENTSHARE
WEBSITE_SKIP_CONTENTSHARE_VALIDATION

# Networking (inherited from the integrated VNet on Flex)
WEBSITE_CONTENTOVERVNET
WEBSITE_VNET_ROUTE_ALL
WEBSITE_DNS_SERVER

# Runtime (managed via functionAppConfig.runtime on Flex)
FUNCTIONS_EXTENSION_VERSION
FUNCTIONS_WORKER_RUNTIME

# Scaling (renamed in functionAppConfig.scaleAndConcurrency)
WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT

The most dangerous one is WEBSITE_RUN_FROM_PACKAGE. On Consumption and Premium, this setting controls how your code gets deployed. On Flex, it must not exist. Flex uses functionAppConfig.deployment.storage to point at a blob container instead of Azure Files. If WEBSITE_RUN_FROM_PACKAGE is still present, the deployment silently uses the wrong mechanism.

Site properties change too. alwaysOn must be false on Flex (it's invalid), but true on Premium and Dedicated. functionsRuntimeScaleMonitoringEnabled is unnecessary on Flex because scale monitoring is built in, but forgetting to remove it won't break anything. ARM template properties like linuxFxVersion, containerSize, and isReserved are all replaced by the functionAppConfig section.

Infrastructure as Code breaks across plans

Your Terraform and Bicep templates don't just need new property values. They need different resources entirely.

In Terraform, azurerm_linux_function_app does not work with the Flex Consumption SKU. Attempting to provision it with an FC1 service plan fails. You need azurerm_function_app_flex_consumption, a separate resource introduced in AzureRM provider v4.21.0:

# Consumption / Premium: this resource
resource "azurerm_linux_function_app" "func" {
  service_plan_id = azurerm_service_plan.plan.id
  # ...
}

# Flex Consumption: different resource, different schema
resource "azurerm_function_app_flex_consumption" "func" {
  service_plan_id = azurerm_service_plan.plan.id

  site_config {}

  storage_container_type    = "blobContainer"
  storage_container_endpoint = "${azurerm_storage_account.sa.primary_blob_endpoint}${azurerm_storage_container.deploy.name}"
  # ...
}

The Flex resource requires a blob storage container for deployments (no Azure Files), supports maximum_instance_count and instance_memory_mb properties that don't exist on the standard resource, and has its own quirks. As of early 2026, you still need to set AzureWebJobsStorage to an empty string as a workaround when using managed identity authentication, then use AzureWebJobsStorage__accountName for the actual connection.

In Bicep, the SKU values map to different tiers:

Plan	SKU Name	SKU Tier
Consumption	`Y1`	`Dynamic`
Flex Consumption	`FC1`	`FlexConsumption`
Premium	`EP1`/`EP2`/`EP3`	`ElasticPremium`

The FC1 plan also requires reserved: true and a functionAppConfig section that replaces most of the properties you'd normally set as app settings. That's a structural rewrite of your deployment template, not a property change.

CI/CD pipeline adjustments

GitHub Actions requires the sku parameter in azure/functions-action when deploying to Flex Consumption with a publish profile:

- uses: Azure/functions-action@v1
  with:
    app-name: ${{ env.FUNCTION_APP_NAME }}
    package: ${{ env.PACKAGE_PATH }}
    publish-profile: ${{ secrets.PUBLISH_PROFILE }}
    sku: 'flexconsumption'
    remote-build: 'true'

Without sku: 'flexconsumption', the action deploys using the standard Consumption mechanism, which fails silently or produces a broken deployment. With OIDC or service principal authentication, the action can auto-detect the SKU, but publish profile deployments need it explicitly.

The scm-do-build-during-deployment and enable-oryx-build flags that you might have in your existing workflow are also wrong for Flex. Flex always performs an Oryx build during remote deployment. Setting those flags manually can interfere with the process.

Private endpoints break GitHub-hosted runners

If your function app runs on Premium with private endpoints enabled, the SCM/Kudu site is not publicly reachable. GitHub-hosted runners cannot connect to it. Your deployment fails with Failed to fetch Kudu App Settings (CODE: 404) or a 401, and the error message gives you almost no indication that networking is the problem.

Your options:

Self-hosted runner inside the VNet: works, but now you're maintaining a VM ($50-100/month) to deploy a "serverless" function
GitHub-hosted runners with Azure private networking: GitHub can inject a runner NIC directly into your VNet subnet, giving hosted runners private access without self-hosted infrastructure. Requires a GitHub Team or Enterprise Cloud plan and larger runners (2-64 vCPU, per-minute billing). Supported in 25 Azure regions as of early 2026, but notably not West Europe.
Deploy via ARM using a service principal: bypasses SCM entirely, pushes configuration through the Azure Resource Manager API
Stage to blob storage: upload your package to a storage account the function app can reach, then trigger deployment from there

On Premium, you also need WEBSITE_SKIP_CONTENTSHARE_VALIDATION=1 in your ARM or Bicep templates when the backing storage account has a firewall or private endpoints. Without it, the ARM deployment fails during content share validation because the deployment engine can't reach the storage account through the private endpoint.

The compound effect

Any one of these issues is a half-day fix. The compound effect is what costs real engineering time: you change plans, which changes your Terraform resources, which changes your app settings, which changes your GitHub Actions workflow, which breaks because of private endpoint networking. Each layer has its own failure mode, its own error messages, and its own documentation scattered across different Microsoft Learn pages.

The real cost of plan migration shows up in the sprint consumed by infrastructure work, not in the Azure bill.

When Serverless Stops Making Sense

At some point, the friction outweighs the abstraction. If you're paying $530/month or more, fighting plan migrations in Terraform, and managing deployment workarounds because private endpoints interfere with your CI pipeline, you should be asking: is the Functions hosting model still earning its keep?

Signs it's time to look elsewhere:

Your total infrastructure cost exceeds what the "serverless" label saves you in operational effort
You're spending more time working around platform constraints than building features
Your build and deploy pipeline is already as complex as it would be with containers
Your team needs operational control (sidecars, traffic splitting, custom health probes) that Functions doesn't expose

The two alternatives worth evaluating are Azure Container Apps and App Service. They solve different problems.

Azure Container Apps: the container-native path

Azure Container Apps (ACA) can host the Functions runtime directly. The v2 model, which Microsoft recommends for all new deployments, creates a single Microsoft.App resource with kind=functionapp. No hidden proxy resources, no dual-resource management. Your function app is a container app with Functions triggers and bindings wired in.

The resource definition looks like a normal container app deployment:

az containerapp create \
  --name my-func-app \
  --resource-group rg-prod \
  --environment my-aca-env \
  --image myregistry.azurecr.io/my-func:latest \
  --kind functionapp \
  --min-replicas 0 \
  --max-replicas 10

KEDA (Kubernetes-based Event Driven Autoscaling) handles scaling. The Functions runtime automatically configures KEDA scale rules based on your triggers. You don't write KEDA definitions yourself; the platform infers them from your bindings. HTTP, Service Bus, Event Hubs, Queue Storage, and other triggers all map to KEDA scalers behind the scenes, and your app can scale to zero when idle.

What you gain over Premium Functions:

Cost: a single replica at 0.25 vCPU / 512 MB idles at roughly ~$10/month on the Consumption workload profile. Compare that to Premium EP1 at ~$146/month. With scale-to-zero, idle cost drops to $0.
Sidecar containers: run log forwarders, auth proxies, or Dapr sidecars alongside your function app in the same pod
Dapr integration: pub/sub, state management, and service invocation without managing the infrastructure
Traffic splitting via revisions: route a percentage of traffic to a new version before promoting it, something Functions deployment slots can't do with the same granularity
GPU support: if you're running inference workloads alongside event-driven functions, ACA supports GPU-backed workload profiles

What you give up:

Containerization is mandatory. There is no code-only deployment path. You build a Docker image, push it to a registry, and deploy from there. If your team has no container experience, this is a real adoption cost.
No built-in continuous deployment from the Functions tooling. You wire up GitHub Actions or Azure Pipelines yourself.
Inbound Private Endpoints through the Functions networking layer are not available. The Functions networking features table on Microsoft's docs shows a blank cell for "Inbound Private Endpoints" under the Container Apps column. ACA itself supports private endpoints at the environment level (workload profiles environments only), but the Functions-specific private endpoint feature does not carry over. If your compliance requirements specifically mandate Functions inbound private endpoints, Flex Consumption or Premium are your options.

The environment-level private endpoint on ACA carries additional charges through the Dedicated Plan Management fee. Budget roughly $67-70/month for this capability, which applies at the environment level regardless of how many apps you run inside it.

App Service: the predictable option

App Service doesn't get much attention in the serverless conversation, but it's worth considering if your workload has predictable traffic and you've already left the scale-to-zero model behind. On Premium EP1, you're paying ~$146/month for a single function app that never scales to zero anyway. An App Service P1v3 plan gives you 2 vCPUs and 8 GB of memory. It supports multiple apps on the same plan, full deployment slots (up to 5 on Standard, 20 on Premium), and no cold starts. The pricing is comparable, and you get operational features that Functions on Premium doesn't match.

App Service won't give you KEDA-based event scaling or scale-to-zero. It's a fixed-compute model. But if you're already paying for always-on compute through Premium Functions, the question is whether the Functions event-driven abstractions justify the constraints that come with them.

Picking the right exit

The choice depends on what pushed you away from Functions in the first place:

If your main frustration is cost, Container Apps on the Consumption workload profile gives you scale-to-zero with VNet support at a fraction of Premium pricing. You keep the Functions programming model, triggers, and bindings.

If your frustration is operational control, Container Apps gives you sidecars, revisions, and a container runtime you can customize. The trade-off is containerization overhead and the requirement to build and manage container images.

Teams frustrated by complexity for a steady workload often find that App Service is the right fit. It strips away the serverless machinery and gives you a predictable compute environment with mature deployment tooling.

None of these are a universal upgrade. Each one trades a Functions limitation for a different set of constraints. The point is to make that trade consciously, not to discover it after migration.

Making an Honest Choice

Three plans, three different products.

Consumption is genuine serverless. Your code runs, you pay for execution time, it scales to zero when idle. If your workload is public-facing, doesn't need VNet access, and won't face a security review that mandates private networking, Consumption is the right plan. It does exactly what the marketing says. The catch: Linux Consumption enters restricted feature mode in September 2025, with full retirement in September 2028. New workloads shouldn't start here.

Flex Consumption is serverless with enterprise networking. It went GA in November 2024 and it's the plan Microsoft recommends for new dynamic-scale workloads in 2026. You get VNet integration, inbound private endpoints, scale-to-zero, and no Azure Files dependency. The constraints are real (Linux only, one app per plan, no deployment slots, 30-second init timeout), but for teams that can work within them, Flex keeps the serverless economics intact while passing a security review.

Premium is managed compute with event-driven scaling. It is not serverless. You're paying for always-on instances whether traffic arrives or not. Premium exists because some requirements (Windows, deployment slots, in-process .NET, multiple apps per plan) have no other home. If you're on Premium, own that decision. Budget for it as compute, not as serverless with extra features.

The distinction matters most at the moment you least want to think about it: during the security review, when someone asks why your function app can't reach a private endpoint. Know which plan you're actually buying before that conversation starts. Migrating between plans means deleting and recreating the function app, updating Terraform resources, and rewriting deployment pipelines. It's not a configuration change. It's a project.

Has a security review pushed you from Consumption to Premium, or did you start on Premium from day one?

Azure Functions for .NET Developers: Series

Part 1: Why Azure Functions? Serverless for .NET Developers

Part 2: Your First Azure Function: HTTP Triggers Step-by-Step

Part 3: Beyond HTTP: Timer, Queue, and Blob Triggers

Part 4: Local Development Setup: Tools, Debugging, and Hot Reload

Part 5: Understanding the Isolated Worker Model

Part 6: Configuration Done Right: Settings, Secrets, and Key Vault

Part 7: Testing Azure Functions: Unit, Integration, and Local

Part 8: Deploying to Azure: CI/CD with GitHub Actions

Part 9: Azure Functions Observability: From Blind Spots to Production Clarity

Bonus: Production Realities: When Serverless Stops Being Serverless (this article)

Azure Functions Observability: From Blind Spots to Production Clarity

Martin Oehlert — Fri, 03 Apr 2026 06:25:08 +0000

Your function works locally, passes all tests, and deploys without errors. But how do you know it's healthy at 2am when a queue-triggered function silently drops messages? With a traditional web app on a VM, you'd SSH in, check logs, inspect process health. Serverless strips all of that away.

The observability gap in serverless is real, and it's structural. Your function runs inside an ephemeral container that spins up on demand, processes an event, and disappears. There's no persistent server to monitor, no process to attach a debugger to, no /var/log to tail. When your function app scales to zero between invocations, even continuous metric collection breaks down: there is literally nothing running to emit telemetry.

And when it scales from zero to fifty concurrent instances under load, correlating a single failed request across that distributed execution becomes a different problem entirely. Traditional APM tools assume long-lived processes with stable identities. Serverless functions violate every one of those assumptions.

Application Insights fills that gap. When connected to your function app (via a connection string, not the deprecated instrumentation key), it automatically captures request telemetry for every function execution, tracks dependencies like HTTP calls and database queries, collects host-level performance counters, and aggregates invocation metrics you can query from the portal or through code.

On top of that, it gives you structured log queries with KQL (Kusto Query Language), an application map that visualizes how your function calls downstream services, distributed tracing that follows a single request across multiple functions and dependencies, and alerting rules that can page you before your users notice something is wrong.

The examples below use the classic Application Insights SDK with the isolated worker model, which is what most production .NET function apps run today. The companion repository at azure-functions-samples has working examples of both the classic SDK (HttpTriggerDemo) and OpenTelemetry (EventHubDemo).

Setting Up Application Insights

Creating the Resource

You can create an Application Insights resource through the Azure Portal (search "Application Insights" and click Create) or provision it with Bicep alongside your function app:

resource appInsights 'Microsoft.Insights/components@2020-02-02' = {
  name: 'appi-orders-prod'
  location: location
  kind: 'web'
  properties: {
    Application_Type: 'web'
    WorkspaceResourceId: logAnalyticsWorkspace.id
  }
}

The resource gives you a connection string, which looks like InstrumentationKey=<guid>;IngestionEndpoint=https://region.in.applicationinsights.azure.com/. Use this, not the instrumentation key alone. Microsoft deprecated standalone instrumentation key ingestion in March 2025, and connection strings are required for sovereign clouds, regional endpoints, and Entra ID-authenticated ingestion. Store the value in your function app's APPLICATIONINSIGHTS_CONNECTION_STRING application setting, and Azure picks it up automatically.

NuGet Packages

The isolated worker model needs two packages:

dotnet add package Microsoft.ApplicationInsights.WorkerService --version 2.22.0
dotnet add package Microsoft.Azure.Functions.Worker.ApplicationInsights --version 2.0.0

Pin Microsoft.ApplicationInsights.WorkerService to 2.22.0. Version 3.0.0 migrated to OpenTelemetry internally and broke the ITelemetryInitializer interface that Microsoft.Azure.Functions.Worker.ApplicationInsights depends on. The result is a TypeLoadException at startup:

System.TypeLoadException: Could not load type
'Microsoft.ApplicationInsights.Extensibility.ITelemetryInitializer'
from assembly 'Microsoft.ApplicationInsights, Version=3.0.0.1'

This affects every .NET version (not just .NET 10). Until the Functions worker package ships a compatible update, stay on 2.22.0. Add a comment in your .csproj so the next person who runs dotnet outdated doesn't blindly upgrade:

<!-- Do NOT upgrade to 3.x: breaks Functions worker. See github.com/Azure/azure-functions-dotnet-worker/issues/3322 -->
<PackageReference Include="Microsoft.ApplicationInsights.WorkerService" Version="2.22.0" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.ApplicationInsights" Version="2.0.0" />
<!-- Check https://github.com/Azure/azure-functions-dotnet-worker/issues/3322 before upgrading either package -->

The second package, Microsoft.Azure.Functions.Worker.ApplicationInsights, is what connects your dependency telemetry (HTTP calls, SQL queries, queue operations) back to the parent function invocation. Without it, correlation breaks.

Program.cs Configuration

Two method calls handle the setup:

using Microsoft.Azure.Functions.Worker;
using Microsoft.Azure.Functions.Worker.Builder;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;

var builder = FunctionsApplication.CreateBuilder(args);

builder.Services
    .AddApplicationInsightsTelemetryWorkerService()
    .ConfigureFunctionsApplicationInsights();

builder.Build().Run();

AddApplicationInsightsTelemetryWorkerService() registers the Application Insights SDK for worker-style apps (background services, Functions). ConfigureFunctionsApplicationInsights() hooks into the Functions runtime's activity pipeline so that incoming triggers and outbound calls produce the right request and dependency telemetry.

See this in context in the HttpTriggerDemo Program.cs.

One catch: the SDK registers a default logging filter that suppresses everything below Warning. If you leave it in place, your ILogger.LogInformation() calls never reach Application Insights. Remove it explicitly:

using Microsoft.Azure.Functions.Worker;
using Microsoft.Azure.Functions.Worker.Builder;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.Logging;

var builder = FunctionsApplication.CreateBuilder(args);

builder.Services
    .AddApplicationInsightsTelemetryWorkerService()
    .ConfigureFunctionsApplicationInsights();

builder.Logging.Services.Configure<LoggerFilterOptions>(options =>
{
    LoggerFilterRule? defaultRule = options.Rules.FirstOrDefault(
        rule => rule.ProviderName ==
            "Microsoft.Extensions.Logging.ApplicationInsights.ApplicationInsightsLoggerProvider");

    if (defaultRule is not null)
    {
        options.Rules.Remove(defaultRule);
    }
});

builder.Build().Run();

Alternatively, manage log levels through appsettings.json (loaded automatically by FunctionsApplication.CreateBuilder):

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft": "Warning"
    },
    "ApplicationInsights": {
      "LogLevel": {
        "Default": "Information"
      }
    }
  }
}

What Gets Auto-Collected vs. What You Add Manually

Once that's in place, the SDK and the Functions runtime collect telemetry automatically, with no extra code:

Gotcha: Two Log Pipelines, Two Configurations

The isolated worker model runs your code in a separate process from the Functions host. This means host.json controls logging for the host process (trigger dispatch, scaling decisions, extension lifecycle), while your Program.cs or appsettings.json controls logging for the worker process (your function code, your dependencies, your ILogger calls). If you set "Default": "Information" in host.json but never configure your worker, your application logs still default to Warning only. You have to configure both sides, and they use different files.

Logging Best Practices

Structured Logging with ILogger

Structured logging writes log entries as key-value pairs instead of flat strings. In Application Insights, those keys become columns in the customDimensions property of the traces table, which means you can filter and aggregate by them in KQL without parsing text.

The ILogger API supports this through message templates: named placeholders wrapped in curly braces, filled by positional arguments.

public class ProcessOrderFunction(ILogger<ProcessOrderFunction> logger)
{
    [Function(nameof(ProcessOrderFunction))]
    public async Task Run(
        [QueueTrigger("orders")] OrderMessage message)
    {
        logger.LogInformation(
            "Order received: {OrderId} from customer {CustomerId}, total {OrderTotal}",
            message.OrderId,
            message.CustomerId,
            message.Total);

        var stopwatch = Stopwatch.StartNew();
        await ProcessAsync(message);
        stopwatch.Stop();

        logger.LogInformation(
            "Order {OrderId} processed successfully in {ElapsedMs}ms",
            message.OrderId,
            stopwatch.ElapsedMilliseconds);
    }
}

Two things to watch here. First, use PascalCase for placeholder names ({OrderId}, not {orderId} or {order_id}). Application Insights stores these as customDimensions keys, and PascalCase matches the convention for the rest of the telemetry schema. Second, never use string interpolation ($"Order {orderId}") in log calls. Interpolated strings defeat structured logging entirely: the provider receives a pre-formatted string with no queryable fields, and the arguments are evaluated even when the log level is disabled.

In the Application Insights traces table, a query like this pulls all logs for a specific order:

traces
| where customDimensions.OrderId == "ORD-20260330-1847"
| project timestamp, message, customDimensions.CustomerId, customDimensions.ElapsedMs
| order by timestamp asc

High-Performance Logging with Source Generators

For functions processing thousands of messages per second (high-throughput queue or Event Hub triggers), the standard LogInformation extension methods have measurable overhead: they box value-type arguments, allocate a params object[] on every call, and parse the message template at runtime.

The [LoggerMessage] source generator eliminates all three costs by generating strongly typed methods at compile time:

public static partial class OrderLogs
{
    [LoggerMessage(
        EventId = 1001,
        Level = LogLevel.Information,
        Message = "Order {OrderId} received from {CustomerId}, total {OrderTotal}")]
    public static partial void OrderReceived(
        ILogger logger, string orderId, string customerId, decimal orderTotal);

    [LoggerMessage(
        EventId = 1002,
        Level = LogLevel.Warning,
        Message = "Order {OrderId} retry attempt {RetryCount}")]
    public static partial void OrderRetrying(
        ILogger logger, string orderId, int retryCount);
}

Call these directly: OrderLogs.OrderReceived(logger, message.OrderId, message.CustomerId, message.Total). The generated code includes a log-level check before evaluating any arguments, so you pay zero cost when Information-level logging is disabled in production. Use this pattern on hot paths; for functions running a few times per minute, the standard extension methods are fine. The companion repo has source generator examples in both HttpTriggerDemo/Logging/OrderLogs.cs and EventHubDemo/Logging/SensorLogs.cs.

Correlation with BeginScope

Individual log lines tell you what happened; BeginScope ties related entries into a single operation. When you wrap your function body in a scope, every log entry inside it automatically inherits the scope's properties as customDimensions in Application Insights.

The critical detail: you must pass a Dictionary<string, object> to BeginScope for the keys to appear as individual customDimensions columns. A plain string or a message template with arguments produces a single formatted string in the scope, which is much harder to query.

[Function(nameof(ProcessOrderFunction))]
public async Task Run(
    [QueueTrigger("orders")] OrderMessage message)
{
    using var scope = logger.BeginScope(new Dictionary<string, object>
    {
        ["OrderId"] = message.OrderId,
        ["CustomerId"] = message.CustomerId,
        ["TenantId"] = message.TenantId
    });

    logger.LogInformation("Validating order");
    await ValidateAsync(message);

    logger.LogInformation("Charging payment");
    await ChargePaymentAsync(message);

    logger.LogInformation("Order complete");
}

Every log line inside that using block now carries OrderId, CustomerId, and TenantId in its customDimensions, even the ones from ValidateAsync and ChargePaymentAsync (assuming they use the same ILogger instance). This is how you trace a complete business operation across multiple internal methods without threading correlation IDs through every method signature.

Log Levels for Production

The host.json logLevel section controls which categories reach Application Insights. Two categories are easy to misconfigure, and getting them wrong silently breaks your monitoring dashboards.

{
  "logging": {
    "logLevel": {
      "default": "Warning",
      "Function": "Information",
      "Host.Results": "Information",
      "Host.Aggregator": "Trace"
    }
  }
}

Host.Results feeds the requests table. If you raise this above Information, successful function executions stop appearing in the Application Insights Performance and Failures blades, and the Function Monitor tab in the portal goes blank. You lose your primary visibility into whether functions are running at all.

Host.Aggregator feeds the customMetrics table with aggregated counts and durations. Set it to Trace so the runtime writes every batch. If you raise this to Warning or higher, the function overview dashboard in the portal loses its success rate and duration charts.

The default: Warning baseline keeps noise low for framework categories (Microsoft.*, Worker, System.*) while Function: Information ensures your own function logs reach Application Insights.

When you need to change log levels without redeploying, override any host.json value through app settings. The pattern replaces dots with double underscores:

AzureFunctionsJobHost__logging__logLevel__Function.ProcessOrder = Debug

This takes effect on the next function host restart (which happens automatically when you update an app setting) and lets you temporarily increase verbosity for a single function without touching host.json.

Sampling Configuration

Application Insights enables adaptive sampling by default, targeting 20 telemetry items per second per host. At low volume, you won't notice. At scale, sampling can silently discard traces, dependencies, and custom events before they reach your workspace.

The recommended production configuration excludes Request and Exception from sampling, so you never lose function execution records or error details:

{
  "logging": {
    "applicationInsights": {
      "samplingSettings": {
        "isEnabled": true,
        "maxTelemetryItemsPerSecond": 20,
        "excludedTypes": "Request;Exception"
      }
    }
  }
}

To check whether sampling is actively dropping data, run this KQL query. Any row where TelemetrySavedPercentage is below 100 means that telemetry type is being sampled:

union traces, dependencies, requests
| where timestamp > ago(1d)
| summarize
    TelemetrySavedPercentage = round(100.0 / avg(itemCount), 1),
    TelemetryDroppedPercentage = round(100.0 - 100.0 / avg(itemCount), 1)
    by bin(timestamp, 1h), itemType
| where TelemetrySavedPercentage < 100
| order by timestamp desc

The itemCount field on each telemetry item tells you how many similar items it represents. An itemCount of 5 means Application Insights kept one item and estimated it represents five. If your traces show 30% dropped, either raise maxTelemetryItemsPerSecond or add Trace to excludedTypes for the categories that matter most to your debugging workflow. Watch your ingestion costs, though: excluding too many types from sampling at high volume can push you past your daily data cap quickly.

Reading Traces and Metrics

Once telemetry is flowing into Application Insights, you need to know where to look and what to ask. The portal gives you three entry points: Transaction Search for hunting specific executions, Log queries (KQL) for anything that requires aggregation or correlation, and the Application Map for a visual snapshot of your function app's dependencies.

Transaction Search

Transaction Search is the fastest way to find what happened to a specific function execution. Open it from the left nav in your Application Insights resource, or use the shortcut from the Investigate section of the overview blade.

The filters that matter most for Azure Functions:

Operation name: the function name as registered in the runtime (e.g., ProcessOrderFunction). Filter here when you want all executions of a specific function in a time window.
Result code: for HTTP triggers, this is the HTTP status code (200, 500, etc.). For non-HTTP triggers (queue, timer, blob), 0 means success and 1 means failure. Combine with operation name to pull only failed runs.
Time range: narrow this first, before adding other filters. Application Insights searches can time out on broad time ranges at high volume.

Click any result to open the End-to-end transaction view. This is where distributed tracing pays off: you'll see the full execution timeline as a Gantt chart, with your function's request at the top and every outbound dependency (HTTP calls to payment APIs, SQL queries, Service Bus operations) shown as child spans with their durations. If a queue-triggered ProcessOrderFunction call failed at 2am, this view tells you whether the failure was in your code, in a downstream HTTP call, or in a database query.

One limitation: Transaction Search shows individual telemetry items, not aggregated data. If you want to know "how many orders failed between 1am and 3am and which customer IDs were affected", you need KQL.

KQL Essentials for Azure Functions

All four tables you'll use most (requests, dependencies, traces, exceptions) share an operation_Id column. That ID is the distributed trace ID that ties every log line, dependency call, and exception back to the single function invocation that produced them.

Finding slow executions

The requests table records every function invocation. duration is in milliseconds.

requests
| where timestamp > ago(24h)
| where name == "ProcessOrderFunction"
| where duration > 5000
| project timestamp, id, duration, resultCode, operation_Id
| order by duration desc

This gives you the slowest ProcessOrderFunction executions in the last 24 hours. Swap > 5000 for whatever your SLA threshold is. The operation_Id in each row is your entry point into the full trace for that execution.

Tracing a single request end-to-end

You have an operation_Id from a failed execution (from Transaction Search, from an alert, or from a support ticket). This query reconstructs everything that happened:

let traceId = "abc123def456";
union requests, dependencies, traces, exceptions
| where operation_Id == traceId
| project timestamp, itemType, name, message, duration, success, resultCode, customDimensions
| order by timestamp asc

The union across all four tables is deliberate. A single function execution produces rows in multiple tables: a requests row for the invocation itself, dependencies rows for every outbound call, traces rows for your ILogger calls, and an exceptions row if something threw. The itemType column tells you which table each row came from.

If you set up BeginScope with OrderId and CustomerId as described in the logging section, those values appear in customDimensions on every trace row. You can also work backwards from a business ID when you don't have an operation_Id:

traces
| where timestamp > ago(24h)
| where customDimensions.OrderId == "ORD-20260330-1847"
| project operation_Id
| distinct operation_Id

Take that operation_Id and feed it into the union query above.

Counting failures by function name over time

requests
| where timestamp > ago(7d)
| where success == false
| summarize FailureCount = count() by bin(timestamp, 1h), name
| order by timestamp desc, FailureCount desc

This surfaces which functions fail most, and whether failures cluster at specific times (a sign of a dependency being unhealthy during a maintenance window, or a batch job hitting a resource limit).

Finding dependency bottlenecks

Your function may be fast; a downstream service may not be.

dependencies
| where timestamp > ago(24h)
| where cloud_RoleName == "your-function-app-name"
| summarize
    CallCount = count(),
    P50 = percentile(duration, 50),
    P95 = percentile(duration, 95),
    P99 = percentile(duration, 99),
    FailureRate = round(100.0 * countif(success == false) / count(), 1)
    by target, type
| order by P95 desc

Replace "your-function-app-name" with the value in your function app's Application Insights configuration (it defaults to the function app name). The target column shows the external endpoint or database, and type shows the dependency kind (HTTP, SQL, Azure Service Bus, etc.). A high P95 with a low FailureRate means the dependency is slow but not failing outright: the kind of problem that shows up as user-visible latency before it shows up as errors.

One gotcha with KQL in the portal: queries run against a Log Analytics workspace, and there's a default query scope. If you open KQL from the Application Insights blade, you're automatically scoped to that resource's workspace. If you open it from a general Log Analytics workspace, you need to add | where cloud_RoleName == "your-function-app-name" to every query, or you'll get results mixed across all resources in the workspace.

Application Map

The Application Map (left nav, under Investigate) renders your function app as a node and every dependency it calls as connected nodes. Each connection shows call volume, average duration, and failure rate. Nodes turn yellow when failure rates exceed roughly 20-30% and red above 50% (the thresholds aren't configurable).

For a ProcessOrderFunction that calls a payment API and writes to SQL, you'd see three nodes: your function app in the centre, the payment API to one side, and the SQL database to the other. The lines between them show call counts and P95 latency. If the payment API node is yellow, that's your first place to look during an incident.

The map is useful for a quick health check and for onboarding new team members, but it has limits. It aggregates across all functions in the app, so if you have ten functions and one is hammering a slow dependency, the map shows the aggregate. It also doesn't distinguish between functions calling the same dependency: if both ProcessOrderFunction and RefundOrderFunction call the same SQL database, the database shows one aggregated node. For function-level dependency analysis, go back to the KQL query above.

Alerts

Alert rules in Application Insights let you define a condition and trigger an action group (email, Teams webhook, PagerDuty, etc.) when the condition is met. You configure them under the Alerts section of your Application Insights resource.

To create a failure rate alert for ProcessOrderFunction:

Select Create > Alert rule.
Set the signal to Custom log search.

Use this KQL as the condition:

requests
| where name == "ProcessOrderFunction"
| where timestamp > ago(5m)
| summarize
    Total = count(),
    Failed = countif(success == false)
| extend FailureRate = round(100.0 * Failed / Total, 1)
| where FailureRate > 5

Set evaluation frequency to every 1 minute, and the lookback window to 5 minutes.
Configure the threshold: alert when the query returns any rows (meaning FailureRate > 5 for that window).

Action groups are the notification mechanism. One action group can send email, post to a Teams incoming webhook, and call an Azure Automation runbook simultaneously. Define your on-call action group once, then reuse it across all alert rules.

A few practical notes on alert tuning:

Start with a 5-minute window and a 5% threshold, then tighten after you've seen a few weeks of baseline data. Alerting on 1-minute windows at 1% failure rate on a low-volume function produces a lot of noise for transient errors.
The requests table has a 1-2 minute ingestion delay under normal conditions and up to 5 minutes during ingestion spikes. A 5-minute lookback window accounts for this. A 1-minute window can miss failures entirely if ingestion is delayed.
For queue-triggered functions, complement failure rate alerts with a queue depth alert on the source queue (configured through Azure Monitor metrics, not Application Insights). A growing queue combined with low invocation count means your function is failing at startup, before it even executes: a scenario that produces no requests rows and won't trigger a failure rate alert.

Common Issues and Fixes

"My logs aren't showing up in Application Insights"

It comes down to one of four things, and you can rule them out in order.

Check the connection string first. Open your function app in the portal, go to Configuration, and confirm APPLICATIONINSIGHTS_CONNECTION_STRING is set and points to the right resource. If it's missing or set to an instrumentation key only (the InstrumentationKey=<guid> format without an IngestionEndpoint), nothing reaches Application Insights at all.

Check the worker log level config. As covered in the two-pipeline gotcha above: host.json controls the host process, but your Program.cs or appsettings.json controls the worker. If you haven't explicitly configured the worker's ApplicationInsights log level, the SDK default applies: Warning only. Add this to your appsettings.json:

{
  "Logging": {
    "ApplicationInsights": {
      "LogLevel": {
        "Default": "Information"
      }
    }
  }
}

Or remove the default filter rule entirely in Program.cs, as shown in the setup section.

Check sampling. If requests appear but traces for specific functions don't, sampling may be discarding them. Run the KQL query from the sampling section to see which telemetry types are being dropped. Add Trace to excludedTypes in host.json if you need full trace fidelity for a critical function.

Check the Function log level in host.json. If Function is set to Warning or higher, LogInformation calls from your function code never leave the host. Set it to Information to restore them.

"Dependencies are missing from the Application Map"

When your Application Map shows your function app as an isolated node with no outbound edges, check for a missing ConfigureFunctionsApplicationInsights() call in Program.cs.

AddApplicationInsightsTelemetryWorkerService() registers the SDK. ConfigureFunctionsApplicationInsights() is what connects the Functions runtime's ActivitySource to that SDK so outbound HTTP, SQL, and Azure SDK calls produce dependency telemetry with the correct operation IDs. Without it, dependencies are either not tracked at all or tracked with broken correlation (they appear in the dependencies table but don't link back to the parent request).

builder.Services
    .AddApplicationInsightsTelemetryWorkerService()
    .ConfigureFunctionsApplicationInsights(); // Required for dependency tracking

If both calls are present and you're still missing HTTP dependencies: check how HttpClient is registered. The Application Insights SDK instruments HttpClient via IHttpClientFactory. If you're creating HttpClient instances with new HttpClient() directly instead of injecting an IHttpClientFactory-managed instance, those calls bypass the instrumentation entirely.

// Not tracked
private readonly HttpClient _client = new HttpClient();

// Tracked (inject IHttpClientFactory via primary constructor)
public class MyFunction(IHttpClientFactory httpClientFactory)
{
    private readonly HttpClient _client = httpClientFactory.CreateClient();
}

builder.Services.AddHttpClient();

"I see duplicate telemetry for every request"

In the isolated worker model, both the host process and the worker process can emit telemetry for the same function invocation. When both are sending to the same Application Insights resource, you get duplicate requests entries, inflated counts, and misleading failure rates.

This is controlled by the telemetryMode setting at the root level of host.json (not inside logging). The default is default, which allows both sides to emit. Setting it to OpenTelemetry resolves the duplication, but note that when you do, the logging.applicationInsights section of host.json no longer applies:

{
  "version": "2.0",
  "telemetryMode": "OpenTelemetry"
}

Alternatively, suppress host-side request telemetry while keeping your worker-side telemetry by raising Host.Results above Information in host.json's logLevel section. The tradeoff: this also removes successful execution records from the portal's Function Monitor tab. Use telemetryMode when you want clean deduplication without losing host-side visibility.

To confirm duplication before changing anything, run this in KQL:

requests
| where timestamp > ago(1h)
| summarize count() by operation_Id
| where count_ > 1
| order by count_ desc

Any operation_Id appearing more than once is a duplicated invocation.

"Cold start latency spikes in my metrics"

Cold starts produce latency spikes that look identical to slow execution in your metrics. Before investigating application code, confirm whether a spike is a cold start or an actual regression.

A cold start request carries a specific pattern: high latency on the first invocation from a given instance, with subsequent requests from the same instance running at normal duration. The cloud_RoleInstance dimension on each request record identifies the instance.

requests
| where timestamp > ago(24h)
| where name == "ProcessOrderFunction"
| summarize
    first_request = min(timestamp),
    p50 = percentile(duration, 50),
    p99 = percentile(duration, 99),
    request_count = count()
    by cloud_RoleInstance
| extend is_cold_start_instance = (request_count <= 3)
| order by first_request desc

Instances where request_count is 1 or 2 are almost certainly fresh scale-out instances, and their durations are not representative of your steady-state performance. Filter them out when computing your SLA metrics:

requests
| where timestamp > ago(24h)
| where name == "ProcessOrderFunction"
| join kind=inner (
    requests
    | summarize request_count = count() by cloud_RoleInstance
    | where request_count > 5
) on cloud_RoleInstance
| summarize p50 = percentile(duration, 50), p99 = percentile(duration, 99)

If the spike appears in warm instances, you have a real slowdown. If it's limited to fresh instances appearing after a scale-out event, it's cold start behavior. The two require different responses: cold starts call for pre-warming strategies or Consumption to Premium plan migration; actual slowdowns point to profiling.

"Alerts fire but I can't find the failing requests"

You set up an alert on exception count, it fires, you open the Failures blade, and the requests that caused the exceptions are gone. Sampling is discarding the evidence.

By default, Exception telemetry is sampled alongside everything else. When the SDK keeps one exception and estimates it represents five, the other four are discarded permanently. Your alert fires because the metric aggregation (which runs before sampling discards anything) saw all five. Your query returns only the one that survived.

The fix is to exclude Exception from sampling in host.json:

{
  "logging": {
    "applicationInsights": {
      "samplingSettings": {
        "isEnabled": true,
        "maxTelemetryItemsPerSecond": 20,
        "excludedTypes": "Request;Exception"
      }
    }
  }
}

Adding Request to excludedTypes ensures the parent request record is also always kept, so you can correlate the exception back to its invocation through the operation_Id. Without both, you may find the exception but not the request that caused it.

If the alert is on a custom metric rather than exceptions, check whether customMetrics is being sampled. Custom metrics emitted through TelemetryClient.GetMetric() are not affected by sampling (they're pre-aggregated in the SDK before sending). Custom events emitted with TelemetryClient.TrackEvent() are sampled, and alerts based on custom event counts can suffer the same problem. Add Event to excludedTypes if that's your signal source.

OpenTelemetry Alternative

The classic Application Insights SDK works well if your entire stack lives in Azure. But if you need to send telemetry to Grafana, Datadog, Jaeger, or any other backend alongside (or instead of) Azure Monitor, you're duplicating instrumentation code for each target. OpenTelemetry solves this at the protocol level: one set of instrumentation, one exporter interface, multiple backends.

OpenTelemetry is a CNCF project that defines a vendor-neutral API and wire format (OTLP) for traces, metrics, and logs. The same instrumentation code that sends data to Application Insights can also send to Zipkin or Collector pipelines with a config change, not a code rewrite.

The Setup

Microsoft publishes the Microsoft.Azure.Functions.Worker.OpenTelemetry package for this purpose, paired with the Azure Monitor exporter:

dotnet add package Microsoft.Azure.Functions.Worker.OpenTelemetry
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package Azure.Monitor.OpenTelemetry.Exporter

First, enable OpenTelemetry output from the Functions host by adding "telemetryMode": "OpenTelemetry" at the root of your host.json (the same setting described in the duplicate telemetry section). Then the Program.cs registration replaces the classic SDK calls:

using Azure.Monitor.OpenTelemetry.Exporter;
using Microsoft.Azure.Functions.Worker;
using Microsoft.Azure.Functions.Worker.Builder;
using Microsoft.Azure.Functions.Worker.OpenTelemetry;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;

var builder = FunctionsApplication.CreateBuilder(args);

builder.Services
    .AddOpenTelemetry()
    .UseFunctionsWorkerDefaults()
    .UseAzureMonitorExporter(); // reads APPLICATIONINSIGHTS_CONNECTION_STRING automatically

builder.Build().Run();

UseFunctionsWorkerDefaults() hooks into the Functions runtime's ActivitySource for proper distributed trace correlation (the OpenTelemetry equivalent of ConfigureFunctionsApplicationInsights() from the classic SDK). Without it, dependency telemetry won't correlate back to the parent function invocation. See the full setup in EventHubDemo/Program.cs.

The connection string is still required; UseAzureMonitorExporter() reads APPLICATIONINSIGHTS_CONNECTION_STRING from the environment the same way the classic SDK does. If you also want to export to a second backend (requires the additional OpenTelemetry.Exporter.OpenTelemetryProtocol package), register it separately:

builder.Services
    .AddOpenTelemetry()
    .UseFunctionsWorkerDefaults()
    .UseAzureMonitorExporter();

builder.Services
    .AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .AddOtlpExporter(otlp =>
        {
            otlp.Endpoint = new Uri("http://localhost:4317");
        }));

UseAzureMonitorExporter() is a cross-cutting registration that configures all signals at once. Chaining signal-specific exporters like AddOtlpExporter after it in the same builder can throw a NotSupportedException. Separate AddOpenTelemetry() calls avoid the conflict.

What You Gain

W3C Trace Context is the default propagation format, which means your distributed traces correlate correctly with other OpenTelemetry-instrumented services regardless of what backend they report to. With the classic SDK you get this too, but only within the Application Insights ecosystem; outside it, the format diverges.

You also get multi-backend export: Azure Monitor for your ops team, a Grafana stack for your platform team, and a local Collector for local debugging, all from the same process. And if you ever migrate off Azure Monitor entirely, you replace one exporter registration, not every TelemetryClient call in your codebase.

What You Lose Today

The OpenTelemetry path is not at feature parity with the classic SDK for Functions specifically.

Live Metrics (the real-time stream at monitor.azure.com) does not work with the distro. It relies on a proprietary push mechanism in the classic SDK that has no OpenTelemetry equivalent yet.

Snapshot Debugger is unavailable. It's a classic SDK feature with no OTLP counterpart.

Auto-collection gaps: some dependency types that the classic SDK instruments automatically (certain Azure SDK operations, Service Bus settlement calls) may not be captured out of the box, depending on which OpenTelemetry instrumentation libraries you've added. You may need to add AddAzureClientsInstrumentation() or equivalent packages explicitly.

Documentation: the distro's documentation for Functions scenarios specifically is thin. Most samples target ASP.NET Core web apps; you'll spend time adapting them and testing whether auto-collection works for your trigger types.

When to Choose Which

Use the classic SDK if:

your entire workload runs on Azure and you have no multi-vendor requirements
you need Live Metrics or Snapshot Debugger
you want the richest out-of-the-box experience with the Application Insights portal today

Use OpenTelemetry if:

you're sending telemetry to multiple backends, or planning to
the rest of your services are already OpenTelemetry-instrumented and you need consistent trace propagation across the board
you're building something that might not always live in Azure

If you're greenfield on a purely Azure stack, the classic SDK is less configuration for the same result right now. If you're instrumenting a heterogeneous system or building for portability, OpenTelemetry's overhead is worth it; you pay once at setup and gain the flexibility when requirements change.

What's Next

This is Part 9 and the final article in the core series. Over nine weeks, this series went from "what is serverless" to querying production telemetry in KQL. If you followed along and built something, you now have a function app with HTTP and queue triggers, proper configuration with Key Vault, a CI/CD pipeline through GitHub Actions, and Application Insights wired up for structured logging, distributed tracing, and alerting. That covers the full lifecycle: build, test, deploy, monitor.

The companion repository at azure-functions-samples has working code for every article in the series. Clone it, break things, wire up your own alerts.

Next week is a bonus article outside the core series: production cost realities on the Consumption plan, and the signals that tell you it's time to move to Flex Consumption or Premium. If you've ever wondered why your monthly bill looked nothing like the pricing calculator, that one is for you.

When you first wired up monitoring on a production function app, which alert did you set up first: failure rate or latency?

Azure Functions for .NET Developers: Series

Part 1: Why Azure Functions? Serverless for .NET Developers

Part 2: Your First Azure Function: HTTP Triggers Step-by-Step

Part 3: Beyond HTTP: Timer, Queue, and Blob Triggers

Part 4: Local Development Setup: Tools, Debugging, and Hot Reload

Part 5: Understanding the Isolated Worker Model

Part 6: Configuration Done Right: Settings, Secrets, and Key Vault

Part 7: Testing Azure Functions: Unit, Integration, and Local

Part 8: Deploying to Azure: CI/CD with GitHub Actions

Part 9: Azure Functions Observability: From Blind Spots to Production Clarity (this article)

Bonus: Production Realities: When Serverless Stops Being Serverless

Deploying to Azure: CI/CD with GitHub Actions

Martin Oehlert — Fri, 27 Mar 2026 06:36:06 +0000

Introduction: from local to production

Local tooling hides four things you have to own in production: packaging, authentication, configuration injection, and rollback. func start handles all of them silently; a CI/CD pipeline does not, and the decisions you make about each one compound quickly.

The gap is easy to miss. Your local environment reads from local.settings.json, authenticates with your personal identity, and recovers from bad deploys by letting you just restart. Azure does none of that for you. You need a packaging step, a way to authenticate from a pipeline without storing secrets, a strategy for injecting environment-specific configuration, and some mechanism for rolling back when a deploy breaks something.

This article covers two stages of that journey. First, manual deployment using the Azure CLI and the Functions Core Tools: useful for quick validation and understanding what the automated pipeline will do under the hood. Then a GitHub Actions workflow with two jobs, OIDC authentication (no stored credentials in your repository), deployment slots for zero-downtime releases, and configuration management that keeps secrets out of your pipeline definition entirely.

Manual deployment options

Before wiring up a full CI/CD pipeline, understand what actually happens when code reaches Azure. Manual deployment gives you that visibility, and it remains useful long after you've automated everything: for one-off hotfixes, for validating a packaging issue, or for deploying to a scratch environment without spinning up a workflow run.

func azure functionapp publish

The Core Tools command is the closest thing to a one-stop deploy:

func azure functionapp publish <APP_NAME>

Under the hood, it runs dotnet build --output bin/publish, creates a .zip archive (filtered by your .funcignore), uploads the archive via the Kudu ZipDeploy API (or One Deploy for Flex Consumption plans), and then syncs triggers and restarts the host. By default it also sets WEBSITE_RUN_FROM_PACKAGE=1 on the app, covered in the next subsection.

Flags you'll reach for regularly:

# Skip the local build — useful when you've already built in CI
func azure functionapp publish <APP_NAME> --no-build

# Deploy to a staging slot instead of production
func azure functionapp publish <APP_NAME> --slot staging

# Push local.settings.json values to app settings (prompts for confirmation)
func azure functionapp publish <APP_NAME> --publish-local-settings -i

# Verify what files will be included before committing to a deploy
func azure functionapp publish <APP_NAME> --list-included-files

Run --list-included-files at least once per project. If your archive includes bin/ debug artifacts, test assemblies, or secrets you meant to .funcignore, you want to catch that before it's sitting on a production host.

A minimal .funcignore for a .NET project:

*.csproj
*.sln
.git/
.vscode/
local.settings.json
test/

local.settings.json is the most important exclusion: it often contains connection strings and keys meant for local development only.

Azure CLI: two commands, two APIs

The Azure CLI gives you two distinct options, and picking the wrong one for your plan type will fail silently or throw a confusing error.

# Kudu ZipDeploy — works for Consumption, Premium, and Dedicated plans
az functionapp deployment source config-zip \
  -g <RESOURCE_GROUP> -n <APP_NAME> --src ./publish.zip

# One Deploy API — required for Flex Consumption, also valid elsewhere
az functionapp deploy \
  -g <RESOURCE_GROUP> -n <APP_NAME> --src-path ./publish.zip --type zip

The older config-zip command talks directly to Kudu and does no building; you're responsible for providing a publish-ready zip. It does not support Flex Consumption, the newer serverless plan that bypasses Kudu entirely. If you're on Flex Consumption, az functionapp deploy is the only CLI path that works. It also gives you --clean to remove files not in the archive and --async to return immediately without polling for completion.

A rule of thumb: if you're writing a deploy script that needs to work across plan types, use az functionapp deploy. If you're on a legacy plan and config-zip already exists in your runbooks, it's fine to leave it.

Run-From-Package and why it matters

When WEBSITE_RUN_FROM_PACKAGE=1 is set, Azure mounts your zip archive as a read-only filesystem at wwwroot rather than extracting files into it. This is the default behavior when you publish with Core Tools, and it has real production benefits: deployment is atomic (the old package stays mounted until the new one is ready), file-copy locking errors disappear, and cold start times improve because the runtime reads directly from the zip.

The constraints: wwwroot becomes read-only (portal-based editing no longer works), the archive has a 1 GB limit, and you should not set this value on Flex Consumption plans, which manage packages differently.

Which method to use

For anything beyond a one-off fix or an afternoon prototype, these manual commands are the foundation you'll extract into a pipeline. Knowing what each one does makes the GitHub Actions steps in the next section easier to reason about when something goes wrong.

GitHub Actions workflow setup

The pieces fit together like this:

The build job produces a single artifact. The deploy job authenticates via OIDC, pushes to a staging slot, and swaps it into production.

The complete workflow is below. Read through it first; the walkthrough after explains the decisions behind each piece.

name: Deploy Azure Functions (.NET 10)

on:
  push:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest
    permissions:
      contents: read
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '10.0.x'
          cache: true

      - run: dotnet restore --locked-mode

      - run: >
          dotnet publish src/MyFunctionApp
          --configuration Release
          --output ./output
          --runtime linux-x64
          --self-contained true

      - uses: actions/upload-artifact@v4
        with:
          name: function-app
          path: ./output
          retention-days: 3

  deploy:
    runs-on: ubuntu-latest
    needs: build
    environment: production
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: actions/download-artifact@v4
        with:
          name: function-app
          path: ./output

      - uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

      - uses: Azure/functions-action@v1
        with:
          app-name: ${{ vars.FUNCTION_APP_NAME }}
          slot-name: staging
          package: ./output

      - name: Swap staging to production
        uses: azure/cli@v2
        with:
          inlineScript: |
            az functionapp deployment slot swap \
              --name ${{ vars.FUNCTION_APP_NAME }} \
              --resource-group ${{ vars.RESOURCE_GROUP }} \
              --slot staging \
              --target-slot production

This is the same function app from Parts 1 through 7. The complete source is in the azure-functions-samples repository. Every push to main builds it, deploys to a staging slot, and swaps to production. No secrets stored, no manual steps, and a rollback is one swap away.

If your plan doesn't support slots (Consumption with only one slot available, or Flex Consumption), remove the slot-name parameter and the swap step. The functions-action will deploy directly to production.

Why two jobs instead of one

The split between build and deploy exists for two reasons.

First, the artifact produced by build is reusable. If you add a staging environment later, the deploy job can run twice against the same artifact without rebuilding. Build once, deploy to as many environments as you need.

Second, the id-token: write permission required for OIDC authentication (covered in the next section) is scoped to the deploy job only. If you set it at the workflow level, every job gets that elevated permission. Keeping it on the deploy job limits the blast radius if something goes wrong.

The build job

actions/checkout@v4 pulls your code. actions/setup-dotnet@v4 installs the SDK, and the cache: true option tells it to cache the NuGet package cache between runs.

That cache only works if your project has a lock file. Add this to your .csproj:

<RestorePackagesWithLockFile>true</RestorePackagesWithLockFile>

Then commit the generated packages.lock.json. Without it, cache: true has nothing to hash (so every run misses the cache), and --locked-mode silently regenerates a new lock file instead of validating against a committed one. With both in place, clean builds skip the network entirely for packages that haven't changed.

The publish step is where .NET 10 requires extra care:

dotnet publish src/MyFunctionApp \
  --configuration Release \
  --output ./output \
  --runtime linux-x64 \
  --self-contained true

--self-contained true is required for .NET 10. The Azure Functions v4 host runs on .NET 8. If you publish a framework-dependent app targeting .NET 10, the host cannot find the .NET 10 runtime and the deployment fails with exit code 150 (0x96). A self-contained publish bundles the runtime with your app, so the host's .NET version becomes irrelevant.

actions/upload-artifact@v4 takes the ./output folder and makes it available to downstream jobs. The name value (function-app) is how the deploy job will refer to it.

The deploy job

needs: build means this job waits for the build to succeed before starting. environment: production ties the job to a GitHub environment, which lets you add required reviewers or protection rules before any deployment proceeds.

actions/download-artifact@v4 retrieves the artifact by the same name used during upload and places it in ./output.

azure/login@v2 handles authentication using OIDC; the specifics of how to configure this are in the next section. This step must come before functions-action, and the three secrets (AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID) must be set in your repository or environment settings.

Azure/functions-action@v1 does the actual deployment. Two parameters are required: app-name (the name of your Function App in Azure) and package (the path to your artifact). An optional slot-name parameter targets a deployment slot if you are using them.

The deployment method the action uses depends on your hosting plan. Flex Consumption plans use One Deploy; all other plans use Zip Deploy. The action picks this automatically based on your app's plan type, so you do not need to configure it explicitly.

OIDC authentication (no stored secrets)

The workflow above uses three secrets: AZURE_CLIENT_ID, AZURE_TENANT_ID, and AZURE_SUBSCRIPTION_ID. None of them are actual credentials. That's the point of OIDC.

Why not publish profiles or service principal secrets?

Publish profiles are XML files containing deployment credentials baked into the Function App. They work, but they create problems at scale: they can't be scoped to a branch or environment, they don't expire on a schedule, and if one leaks, anyone with the file can deploy to your app until you manually reset it.

Service principal secrets are better (they support expiration and RBAC scoping), but you still have a secret stored in GitHub that needs rotating every 6-24 months. Miss a rotation and your pipeline breaks silently on the next deploy.

OIDC eliminates stored credentials entirely. GitHub mints a short-lived token for each workflow run, Azure validates that token against a federated credential you configure once, and nothing secret ever sits in your repository settings.

How it works

Your workflow requests an OIDC token from GitHub's token service
The azure/login action sends that token to Microsoft Entra ID
Entra validates the token's issuer (token.actions.githubusercontent.com), audience, and subject claim (which encodes the repo, branch, and environment)
If the claims match your federated credential configuration, Entra issues an Azure access token
The access token is used for the deployment, then expires

The subject claim is what makes this granular. You can restrict a credential to only work from a specific environment (repo:your-org/your-repo:environment:production), a specific branch, or even pull requests. A token minted from a feature branch won't match a credential scoped to the production environment.

Setup steps

1. Create an Entra app registration with a service principal:

az ad app create --display-name "github-deploy-my-func-app"
az ad sp create --id <APP_ID>

2. Assign the Website Contributor role scoped to the resource group containing your Function App:

az role assignment create \
  --assignee <APP_ID> \
  --role "Website Contributor" \
  --scope /subscriptions/<SUB_ID>/resourceGroups/<RG_NAME>

Website Contributor is enough for deploying code. Contributor works too but grants more access than the pipeline needs.

3. Configure a federated identity credential:

{
  "name": "github-actions-production",
  "issuer": "https://token.actions.githubusercontent.com",
  "subject": "repo:your-org/your-repo:environment:production",
  "audiences": ["api://AzureADTokenExchange"]
}

az ad app federated-credential create \
  --id <APP_ID> \
  --parameters @credential.json

The subject field must match exactly. If your deploy job uses environment: production, the subject must end with :environment:production. If you deploy from a branch without an environment, use :ref:refs/heads/main instead.

4. Store the IDs as GitHub environment secrets:

Go to your repository Settings > Environments > production > Environment secrets, and add:

AZURE_CLIENT_ID: the Application (client) ID from your app registration
AZURE_TENANT_ID: your Entra tenant ID
AZURE_SUBSCRIPTION_ID: the subscription containing your Function App

These are identifiers, not credentials. Even if someone reads them, they can't authenticate without a valid OIDC token from your specific repository and environment.

Workflow permissions

The deploy job needs id-token: write to mint the OIDC token:

deploy:
  permissions:
    id-token: write
    contents: read

Set this on the deploy job only, not at the workflow level. The build job doesn't need token-minting permissions.

One gotcha

The Azure/functions-action supports two authentication methods: publish-profile and the azure/login action. They are mutually exclusive. If you pass a publish-profile parameter while also using azure/login, the action uses the publish profile and ignores your OIDC session. Remove the publish-profile parameter entirely when switching to OIDC.

Deployment slots and zero-downtime releases

Deploying directly to production means every release has a moment where either the old code or the new code is partially running. Deployment slots give you a staging URL to validate before any production traffic sees the new version, and an instant rollback if something goes wrong.

What each plan supports

If you're on Flex Consumption, skip to the rolling updates section below.

The blue-green pattern

Deploy to a staging slot, verify it works, then swap staging into production.

Deploy to staging: your CI/CD pipeline targets the staging slot instead of production
Validate: hit the staging URL (your-func-app-staging.azurewebsites.net) with smoke tests or manual checks
Swap: Azure switches the routing so staging serves production traffic
Rollback if needed: swap again to revert (the old production code is now in the staging slot)

The swap itself takes seconds. Your users see either the old version or the new version, never a half-deployed state.

What swaps and what stays

This trips people up. During a swap, code and most settings travel together from staging to production. But some things are pinned to the slot:

Travels with code (gets swapped): general app settings (unless marked sticky), connection strings (unless marked sticky), handler mappings, public certificates.

Stays with the slot: publishing endpoints, custom domains, TLS/SSL certificates, scale settings, IP restrictions, Always On, FUNCTIONS_EXTENSION_VERSION (sticky by default).

Sticky settings that cause problems

Two settings deserve special attention:

FUNCTIONS_EXTENSION_VERSION is sticky by default. If your staging slot runs ~4 and production also runs ~4, this is invisible. But if you ever need to change the version, the stickiness means the setting won't swap with the code. To make it travel with the swap, set WEBSITE_OVERRIDE_STICKY_EXTENSION_VERSIONS=0 on all slots.

WEBSITE_CONTENTSHARE is auto-generated per slot and should never be set manually. Each slot needs its own content share to avoid file locking conflicts. If you see deployment failures mentioning "cannot access file," check whether slots are sharing this value.

Deploy-to-slot and swap in GitHub Actions

Add slot-name to the deploy step, then swap using the Azure CLI:

- uses: Azure/functions-action@v1
  with:
    app-name: 'my-func-app'
    slot-name: staging
    package: ./output

- name: Swap staging to production
  uses: azure/cli@v2
  with:
    inlineScript: |
      az functionapp deployment slot swap \
        --name my-func-app \
        --resource-group my-rg \
        --slot staging \
        --target-slot production

Swap gotchas

Watch for these:

Running functions are terminated during a swap. There is no graceful drain. If you have long-running executions, they will be killed. For timer or queue triggers, the runtime will pick up incomplete work after the swap, but HTTP requests in flight will fail.
Warm-up matters. After a swap, the new production instances need to initialize. Set WEBSITE_SWAP_WARMUP_PING_PATH to an endpoint (like a health check) that forces initialization before traffic arrives.
Keep app names under 32 characters. Longer names can cause host ID collisions between slots, leading to unexpected behavior.

Flex Consumption alternative: rolling updates

Flex Consumption doesn't support slots, but it offers rolling updates as an alternative. With siteUpdateStrategy.type set to RollingUpdate, Azure replaces instances in batches rather than all at once, giving in-progress executions a 60-minute grace period to complete.

The trade-off: there's no separate staging URL for validation, no way to split traffic between versions, and rollback means redeploying the previous version rather than an instant swap.

Environment configuration in pipelines

A deployment pipeline needs to put the right configuration in the right environment without leaking secrets into workflow files. GitHub Environments, the secrets hierarchy, and Key Vault references each handle a piece of this.

GitHub Environments

Environments are configured under your repository's Settings > Environments. Each environment can have:

Required reviewers (up to 6 people who must approve before the deploy job runs)
Wait timers (a delay before deployment proceeds, useful for change windows)
Deployment branches (restrict which branches can target this environment)

In the workflow, environment: production on a job ties it to that environment's rules. The job will pause and wait for approval if reviewers are configured.

Secrets hierarchy

GitHub secrets exist at three levels:

When the same secret name exists at multiple levels, environment wins over repository, which wins over organization. This means you can set AZURE_CLIENT_ID at the environment level with different values for development, staging, and production, each pointing to a different service principal scoped to its own resource group.

Setting app configuration during deployment

Your Function App needs configuration values beyond what's in the code. The most direct approach is the Azure CLI:

- name: Configure app settings
  uses: azure/cli@v2
  with:
    inlineScript: |
      az functionapp config appsettings set \
        --name ${{ vars.FUNCTION_APP_NAME }} \
        --resource-group ${{ vars.RESOURCE_GROUP }} \
        --settings \
          "ServiceBus__Connection=${{ secrets.SERVICEBUS_CONNECTION }}" \
          "FeatureFlags__NewCheckout=true"

Use vars (GitHub Variables) for non-sensitive configuration and secrets for anything you wouldn't put in a log file.

One warning if you manage settings through Bicep or ARM templates instead: the ARM API replaces all app settings on each deployment. If your template omits a setting that exists on the app, that setting gets deleted. The CLI's appsettings set command merges instead, which is safer for incremental updates.

Multi-environment workflow

The build-once-deploy-many pattern chains environments with approval gates:

jobs:
  build:
    runs-on: ubuntu-latest
    # ... build steps from earlier ...

  deploy-dev:
    needs: build
    environment: development
    runs-on: ubuntu-latest
    steps:
      # download artifact, azure/login, functions-action
      # (same structure, different secrets per environment)

  deploy-staging:
    needs: deploy-dev
    environment: staging
    runs-on: ubuntu-latest
    steps:
      # deploy to staging slot, run smoke tests

  deploy-production:
    needs: deploy-staging
    environment: production  # approval gate triggers here
    runs-on: ubuntu-latest
    steps:
      # swap staging to production

The same artifact flows through all three environments. The only things that change are the secrets (different AZURE_CLIENT_ID per environment, each scoped to its own resource group) and the deployment target.

Key Vault integration

This ties back to Part 6 (Configuration Done Right). Instead of passing secret values through your pipeline, store them in Key Vault and reference them in app settings:

ServiceBus__Connection=@Microsoft.KeyVault(VaultName=my-kv;SecretName=servicebus-conn)

Your pipeline sets the reference, not the secret value. The Function App's managed identity resolves the actual value at runtime using the Key Vault Secrets User role. The pipeline never sees the secret, and rotating it in Key Vault takes effect without redeployment.

If you use deployment slots, mark Key Vault references as slot settings when different environments need different secrets (e.g., staging points to a staging Key Vault, production to a production Key Vault).

What goes where

Closing

Eight articles, one function app, and a pipeline that deploys itself. If something in your own setup doesn't match what's here, the series navigation links every piece: from the first HTTP trigger through testing to this deployment workflow.

Do you deploy straight to production or use a staging slot? What made you choose one over the other?

Azure Functions for .NET Developers: Series

Part 1: Why Azure Functions? Serverless for .NET Developers

Part 2: Your First Azure Function: HTTP Triggers Step-by-Step

Part 3: Beyond HTTP: Timer, Queue, and Blob Triggers

Part 4: Local Development Setup: Tools, Debugging, and Hot Reload

Part 5: Understanding the Isolated Worker Model

Part 6: Configuration Done Right: Settings, Secrets, and Key Vault

Part 7: Testing Azure Functions: Unit, Integration, and Local

Part 8: Deploying to Azure: CI/CD with GitHub Actions (this article)

Part 9: Azure Functions Observability: From Blind Spots to Production Clarity

Bonus: Production Realities: When Serverless Stops Being Serverless

Figuring out what actually needs a real Azure connection vs. what you can just test with a plain class… that’s where most testing headaches start. I wrote up how I handle it: unit tests, Testcontainers + Azurite, and full func start pipelines.

Martin Oehlert — Sat, 21 Mar 2026 06:16:23 +0000

Martin Oehlert

Mar 20

Testing Azure Functions: Unit, Integration, and Local

#azure #azurefunctions #serverless #dotnet

Comments

15 min read

Testing Azure Functions: Unit, Integration, and Local

Martin Oehlert — Fri, 20 Mar 2026 07:17:37 +0000

Where do you draw the line between what needs a full Azure connection and what can be tested with a plain class instantiation? The isolated worker model makes the answer concrete: the function class is just wiring. Everything testable lives in a service class that knows nothing about Azure.

Most testing pain comes from not drawing that line early enough.

The design decision that makes testing possible

Consider a function that does its own work:

public class OrderFunction(ILogger<OrderFunction> logger, SqlConnection db)
{
    [Function("CreateOrder")]
    public async Task<IActionResult> CreateOrder(
        [HttpTrigger(AuthorizationLevel.Anonymous, "post", Route = "orders")] HttpRequest req,
        [FromBody] CreateOrderRequest order)
    {
        if (order.Quantity <= 0)
            return new BadRequestObjectResult("Quantity must be greater than zero");

        var orderId = "ORD-" + Guid.NewGuid().ToString("N")[..8];

        await db.ExecuteAsync(
            "INSERT INTO Orders (OrderId, ProductId, Quantity) VALUES (@OrderId, @ProductId, @Quantity)",
            new { orderId, order.ProductId, order.Quantity });

        logger.LogInformation("Created order {OrderId}", orderId);
        return new CreatedResult($"/orders/{orderId}", new { orderId, order.ProductId, order.Quantity });
    }
}

To unit test this, you need a real SqlConnection. That means a real database, which means either a running SQL Server, Testcontainers, or a brittle in-memory substitute. Every test becomes an infrastructure test, even for something as simple as verifying that a zero quantity returns 400.

The fix is to move the logic into a service class, leaving the function with nothing to do except call the service and map the result to a response:

public class OrderFunction(IOrderService orderService)
{
    [Function("CreateOrder")]
    public async Task<IActionResult> CreateOrder(
        [HttpTrigger(AuthorizationLevel.Anonymous, "post", Route = "orders")] HttpRequest req,
        [FromBody] CreateOrderRequest order)
    {
        var result = await orderService.CreateOrderAsync(order);

        if (!result.IsSuccess)
            return new BadRequestObjectResult(result.Error);

        return new CreatedResult($"/orders/{result.Order!.OrderId}", result.Order);
    }
}

The function class is now three lines of routing logic. IOrderService is a plain interface with no Azure types, no infrastructure, nothing that requires a running host to instantiate.

This gives you two separate test targets. The service holds the logic and gets fast, isolated unit tests with no framework setup. The function class holds the routing and gets a thin layer of tests that verify the HTTP response shapes. Each layer can be tested on its own terms.

Unit testing the service layer

The service has one dependency worth injecting for tests: IOrderRepository. Here's the full service:

public class OrderService(ILogger<OrderService> logger, IOrderRepository repository) : IOrderService
{
    public async Task<OrderResult> CreateOrderAsync(CreateOrderRequest request)
    {
        if (request.Quantity <= 0)
            return OrderResult.Failure("Quantity must be greater than zero");

        var order = new Order(
            OrderId: "ORD-" + Guid.NewGuid().ToString("N")[..8],
            ProductId: request.ProductId,
            Quantity: request.Quantity);

        await repository.SaveAsync(order);

        logger.LogInformation("Created order {OrderId} for {ProductId}", order.OrderId, order.ProductId);

        return OrderResult.Success(order);
    }
}

To test it, you need xUnit and NSubstitute. The .csproj is minimal:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <IsTestProject>true</IsTestProject>
    <!-- Test method names use underscores by convention (MethodName_Condition_Expected) -->
    <NoWarn>$(NoWarn);CA1707</NoWarn>
  </PropertyGroup>
  <ItemGroup>
    <FrameworkReference Include="Microsoft.AspNetCore.App" />
    <PackageReference Include="Microsoft.NET.Test.Sdk" />
    <PackageReference Include="NSubstitute" />
    <PackageReference Include="xunit" />
    <PackageReference Include="xunit.runner.visualstudio">
      <PrivateAssets>all</PrivateAssets>
      <IncludeAssets>runtime; build; native; contentfiles; analyzers</IncludeAssets>
    </PackageReference>
  </ItemGroup>
  <ItemGroup>
    <ProjectReference Include="../HttpTriggerDemo/HttpTriggerDemo.csproj" />
  </ItemGroup>
</Project>

The tests themselves need no Azure infrastructure:

public class OrderServiceTests
{
    private readonly IOrderRepository _repository = Substitute.For<IOrderRepository>();
    private readonly OrderService _service;

    public OrderServiceTests()
    {
        _service = new OrderService(NullLogger<OrderService>.Instance, _repository);
    }

    [Fact]
    public async Task CreateOrderAsync_WithValidRequest_ReturnsSuccess()
    {
        var request = new CreateOrderRequest("WIDGET-42", 3);

        var result = await _service.CreateOrderAsync(request);

        Assert.True(result.IsSuccess);
        Assert.NotNull(result.Order);
        Assert.Equal("WIDGET-42", result.Order.ProductId);
        Assert.Equal(3, result.Order.Quantity);
    }

    [Fact]
    public async Task CreateOrderAsync_WithValidRequest_SavesOrderToRepository()
    {
        var request = new CreateOrderRequest("WIDGET-42", 3);

        await _service.CreateOrderAsync(request);

        await _repository.Received(1).SaveAsync(Arg.Is<Order>(o =>
            o.ProductId == "WIDGET-42" && o.Quantity == 3));
    }

    [Theory]
    [InlineData(0)]
    [InlineData(-1)]
    [InlineData(-100)]
    public async Task CreateOrderAsync_WithInvalidQuantity_ReturnsFailure(int quantity)
    {
        var request = new CreateOrderRequest("WIDGET-42", quantity);

        var result = await _service.CreateOrderAsync(request);

        Assert.False(result.IsSuccess);
        Assert.NotNull(result.Error);
    }

    [Theory]
    [InlineData(0)]
    [InlineData(-1)]
    public async Task CreateOrderAsync_WithInvalidQuantity_DoesNotCallRepository(int quantity)
    {
        var request = new CreateOrderRequest("WIDGET-42", quantity);

        await _service.CreateOrderAsync(request);

        await _repository.DidNotReceive().SaveAsync(Arg.Any<Order>());
    }
}

NullLogger<T>.Instance is the right choice for service tests. You are testing behavior, not logging output. Using Substitute.For<ILogger<T>>() to verify that specific log messages were emitted is a fragile approach: log messages are implementation details that change often and aren't part of the service contract. Save NSubstitute for dependencies whose behavior actually matters to the test outcome, like IOrderRepository.

[Theory] + [InlineData] handles the validation branches without duplicating test body. Each InlineData value runs as a separate test in the output, so you get clear signal on exactly which inputs fail. The two [Theory] blocks above run 3 + 2 = 5 test cases from a handful of lines.

Received() and DidNotReceive() are NSubstitute's call-count assertions. The second [Fact] verifies the repository was called with the right data; the second [Theory] verifies it was never called when validation fails. Together they cover both the happy path and the guard clause.

Unit testing the function class

When you use ConfigureFunctionsWebApplication() (the ASP.NET Core integration mode), the function's HttpRequest is a standard ASP.NET Core HttpRequest. That means you can construct a DefaultHttpContext in tests and pass context.Request directly to the function method, with no Functions runtime involved:

public class OrderFunctionTests
{
    private readonly IOrderService _orderService = Substitute.For<IOrderService>();
    private readonly OrderFunction _function;

    public OrderFunctionTests()
    {
        _function = new OrderFunction(_orderService);
    }

    [Fact]
    public async Task CreateOrder_WhenServiceSucceeds_Returns201Created()
    {
        var request = new CreateOrderRequest("WIDGET-42", 3);
        var order = new Order("ORD-ABCD1234", "WIDGET-42", 3);
        _orderService.CreateOrderAsync(request).Returns(OrderResult.Success(order));

        var result = await _function.CreateOrder(new DefaultHttpContext().Request, request);

        var created = Assert.IsType<CreatedResult>(result);
        Assert.Equal("/orders/ORD-ABCD1234", created.Location);
        Assert.Equal(order, created.Value);
    }

    [Fact]
    public async Task CreateOrder_WhenServiceFails_Returns400BadRequest()
    {
        var request = new CreateOrderRequest("WIDGET-42", -1);
        _orderService.CreateOrderAsync(request)
            .Returns(OrderResult.Failure("Quantity must be greater than zero"));

        var result = await _function.CreateOrder(new DefaultHttpContext().Request, request);

        var bad = Assert.IsType<BadRequestObjectResult>(result);
        Assert.Equal("Quantity must be greater than zero", bad.Value);
    }
}

Notice what these tests do not cover: the [HttpTrigger] attribute, binding resolution, middleware, or anything the Functions host owns. That's intentional. The function's responsibility is to map a service result to an HTTP response. Two tests cover both outcome branches. Anything beyond that is integration territory.

The [HttpTrigger] and [FromBody] attributes are metadata for the runtime. They don't execute during a direct method call, so there's nothing to test or mock.

Timer triggers

Timer functions follow the same pattern. TimerInfo is a concrete class from the Functions SDK with settable properties, so you construct it directly:

public class CleanupFunctionTests
{
    private readonly CleanupFunction _function =
        new(NullLogger<CleanupFunction>.Instance);

    [Fact]
    public async Task Run_WhenOnSchedule_CompletesWithoutError()
    {
        var timer = new TimerInfo { IsPastDue = false };

        await _function.Run(timer);

        // No exception thrown = the function handled the timer correctly.
        // Timer functions have no return value — the observable outcome is
        // either successful completion or an exception.
    }

    [Fact]
    public async Task Run_WhenPastDue_StillCompletes()
    {
        var timer = new TimerInfo { IsPastDue = true };

        await _function.Run(timer);
    }
}

Timer function tests are often this minimal. The function's behavior on IsPastDue = true is to log a warning; there's no meaningful return value to assert on. What you're verifying is that the function reaches completion without throwing, and that the IsPastDue branch doesn't break anything. If your cleanup function does real work (deleting records, archiving blobs), that work lives in a service and gets tested through the service tests, not through the function.

Integration testing with Testcontainers

Unit tests get you 80% of the way. They don't verify that DI registrations are correct, that your database schema matches your queries, or that a real TableClient call actually persists what you think it does. That's two separate problems: the composition root, and the data layer.

Verifying the composition root

The first failure mode is silent: a service registration is missing, and OrderFunction's constructor throws at runtime while every unit test passes. A composition test catches this without Docker:

public class HostIntegrationTests : IAsyncLifetime
{
    private IHost _host = null!;

    public async Task InitializeAsync()
    {
        _host = new HostBuilder()
            .ConfigureFunctionsWebApplication()
            .ConfigureServices(services =>
            {
                // WorkerHostedService opens a gRPC channel to the Functions host.
                // That host doesn't exist in tests — remove it or the build hangs.
                var worker = services.FirstOrDefault(s =>
                    s.ImplementationType?.Name == "WorkerHostedService");
                if (worker is not null)
                    services.Remove(worker);
            })
            .Build();

        await _host.StartAsync();
    }

    [Fact]
    public void IOrderService_ResolvesFromDi()
    {
        var service = _host.Services.GetService<IOrderService>();
        Assert.NotNull(service);
    }

    public async Task DisposeAsync() => await _host.StopAsync();
}

WebApplicationFactory<Program> fails with Azure Functions isolated worker. The model uses gRPC for host-worker communication, and the factory hits a channel URI parsing error when no Functions host is running. The HostBuilder approach mirrors Program.cs exactly, with the gRPC listener stripped. This test doesn't call any function logic; it only verifies the container compiles.

Testing the data layer

InMemoryOrderRepository lets unit tests run fast, but it tells you nothing about whether your actual persistence works. A production implementation using Azure Table Storage looks like this:

public class TableStorageOrderRepository(TableClient tableClient) : IOrderRepository
{
    public async Task SaveAsync(Order order)
    {
        var entity = new TableEntity(order.ProductId, order.OrderId)
        {
            ["Quantity"] = order.Quantity
        };
        await tableClient.AddEntityAsync(entity);
    }
}

The integration test spins up Azurite in Docker via Testcontainers:

public class TableStorageOrderRepositoryTests : IAsyncLifetime
{
    private readonly AzuriteContainer _azurite = new AzuriteBuilder().Build();

    public async Task InitializeAsync() => await _azurite.StartAsync();

    [Fact]
    public async Task SaveAsync_WithValidOrder_PersistsToTableStorage()
    {
        var client = new TableClient(_azurite.GetConnectionString(), "orders");
        await client.CreateIfNotExistsAsync();

        var repository = new TableStorageOrderRepository(client);
        var order = new Order("ORD-TEST01", "WIDGET-42", 3);

        await repository.SaveAsync(order);

        var entity = await client.GetEntityAsync<TableEntity>(
            order.ProductId, order.OrderId);
        Assert.Equal(3, entity.Value["Quantity"]);
    }

    public async Task DisposeAsync() => await _azurite.DisposeAsync();
}

Add one package to the test project:

<PackageReference Include="Testcontainers.Azurite" />

Each test run gets a fresh container. No ports to reserve, no cleanup between runs; Testcontainers handles port allocation for parallel CI execution automatically.

The same pattern covers blob and queue operations. For relational databases, Testcontainers.MsSql and Testcontainers.PostgreSql provide the same lifecycle wrapper for SQL Server and Postgres.

Local E2E testing with `func start` + Azurite

Logic tests cover OrderService. Repository tests cover TableStorageOrderRepository. Neither covers what happens when the Functions host receives an HTTP request, routes it through middleware, deserializes the body, calls the function, and returns a response.

For that, the host must be running. The approach is to start Azurite and func start together in the test fixture:

public class FunctionsE2ETests : IAsyncLifetime
{
    private readonly AzuriteContainer _azurite = new AzuriteBuilder().Build();
    private Process? _funcProcess;
    private readonly HttpClient _client = new();

    public async Task InitializeAsync()
    {
        await _azurite.StartAsync();

        _funcProcess = Process.Start(new ProcessStartInfo
        {
            FileName = "func",
            Arguments = "start --port 7071",
            WorkingDirectory = Path.GetFullPath("../../../HttpTriggerDemo"),
            EnvironmentVariables =
            {
                ["AzureWebJobsStorage"] = _azurite.GetConnectionString(),
                ["FUNCTIONS_WORKER_RUNTIME"] = "dotnet-isolated"
            },
            RedirectStandardOutput = true,
            UseShellExecute = false
        });

        await WaitForHostReady(_funcProcess, TimeSpan.FromSeconds(30));
    }

    [Fact]
    public async Task CreateOrder_WithValidRequest_Returns201()
    {
        var response = await _client.PostAsJsonAsync(
            "http://localhost:7071/api/orders",
            new CreateOrderRequest("WIDGET-42", 3));

        Assert.Equal(HttpStatusCode.Created, response.StatusCode);
    }

    private static async Task WaitForHostReady(Process process, TimeSpan timeout)
    {
        var ready = new TaskCompletionSource<bool>();
        process.OutputDataReceived += (_, e) =>
        {
            if (e.Data?.Contains("Host started") == true)
                ready.TrySetResult(true);
        };
        process.BeginOutputReadLine();
        await ready.Task.WaitAsync(timeout);
    }

    public async Task DisposeAsync()
    {
        _funcProcess?.Kill(entireProcessTree: true);
        await _azurite.DisposeAsync();
        _client.Dispose();
    }
}

func must be on the PATH. CI pipelines need npm install -g azure-functions-core-tools@4 before these tests run: it's a test infrastructure dependency that bites if you assume it's there.

Kill(entireProcessTree: true) matters on Windows. func start spawns child processes; killing just the parent leaves orphaned processes holding port 7071, which causes every subsequent E2E test run in that session to hang on startup.

WaitForHostReady polls stdout for "Host started". Startup takes 3-10 seconds depending on cold JIT and machine speed. Set the timeout conservatively: a flaky timeout is harder to debug than a slow test.

Put these tests in a separate project with [Trait("Category", "E2E")] and exclude them from the fast inner development loop. They're most useful in CI as a gate before deployment, not as daily feedback during development.

Testing an event-driven function

HTTP triggers test cleanly: call the function directly, inspect the return value. Event Hub triggers are different. The function receives a batch of EventData, deserializes each message, and delegates to a service; the trigger binding itself is provided by the runtime. That runtime can run locally.

The scenario here is an IoT pipeline: devices publish sensor readings to an Event Hub, and a function consumes the batch, validates each reading, and writes to Cosmos DB.

The function

The function stays thin. Deserialize the batch, call the service, nothing else:

public class SensorReadingFunction(
    ILogger<SensorReadingFunction> logger,
    ISensorProcessor processor)
{
    [Function(nameof(SensorReadingFunction))]
    public async Task Run(
        [EventHubTrigger("sensor-readings", Connection = "EventHubConnection")]
        EventData[] events)
    {
        logger.LogInformation("Processing batch of {Count} events", events.Length);

        foreach (var eventData in events)
        {
            var reading = JsonSerializer.Deserialize<SensorReading>(eventData.Body.Span);
            if (reading is null) continue;

            await processor.ProcessAsync(reading);
        }
    }
}

The EventData[] parameter receives the batch. The function doesn't know or care how many partitions the hub has, how messages were routed, or what retry policy applies: that's the runtime's job.

Unit testing the function

Construct EventData directly with a JSON body and call Run(). No containers, no emulator:

[Fact]
public async Task Run_WithBatchOfThreeEvents_CallsProcessorThreeTimes()
{
    var processor = Substitute.For<ISensorProcessor>();
    var function = new SensorReadingFunction(NullLogger<SensorReadingFunction>.Instance, processor);

    EventData[] events =
    [
        CreateEventData(new SensorReading("device-01", 22.5, 60.0, DateTimeOffset.UtcNow)),
        CreateEventData(new SensorReading("device-02", 25.0, 55.0, DateTimeOffset.UtcNow)),
        CreateEventData(new SensorReading("device-03", 18.3, 72.0, DateTimeOffset.UtcNow)),
    ];

    await function.Run(events);

    await processor.Received(3).ProcessAsync(Arg.Any<SensorReading>());
}

private static EventData CreateEventData(SensorReading reading)
    => new(JsonSerializer.SerializeToUtf8Bytes(reading));

This catches deserialization bugs and verifies the batch loop without starting anything.

Full trigger integration test

Unit tests verify the dispatch and deserialization logic in isolation. The full pipeline test goes further: a real message flows from Event Hubs through the function and into Cosmos DB.

The fixture starts three containers in parallel (Azurite for the Functions runtime, the Event Hubs emulator, and the Cosmos DB emulator), then launches func start as a child process wired to all three. The full source is in SensorPipelineFixture.cs. The container declarations and process wiring both require non-obvious configuration.

Container configuration:

// Use latest: Core Tools 4.8 sends an API version that Azurite 3.28.0 (the Testcontainers default) rejects.
private readonly AzuriteContainer _azurite = new AzuriteBuilder()
    .WithImage("mcr.microsoft.com/azure-storage/azurite:latest")
    .Build();

// WithPortBinding pins the host port so localhost:8081 resolves from the func child process.
private readonly CosmosDbContainer _cosmos = new CosmosDbBuilder()
    .WithImage("mcr.microsoft.com/cosmosdb/linux/azure-cosmos-emulator:vnext-preview")
    .WithPortBinding(8081, 8081)
    .WithWaitStrategy(Wait.ForUnixContainer()
        .UntilMessageIsLogged("Gateway=OK, Explorer=OK"))
    .Build();

private readonly EventHubsContainer _eventHubs;

public SensorPipelineFixture()
{
    _eventHubs = new EventHubsBuilder()
        .WithAcceptLicenseAgreement(true)
        .WithConfigurationBuilder(EventHubsServiceConfiguration.Create()
            .WithEntity("sensor-readings", 2, []))
        .WithWaitStrategy(Wait.ForUnixContainer()
            .UntilMessageIsLogged("Emulator Service is Successfully Up!"))
        .Build();
}

Testcontainers pins azurite:3.28.0 as its default. Azure Functions Core Tools 4.8 sends API version 2024-08-04; Azurite 3.28.0 rejects that version with a 400. Pinning to latest resolves it.

Both the Event Hubs emulator and the Cosmos vnext-preview image are distroless: no shell, no /bin/sh. The default Testcontainers port-check wait strategy execs /bin/sh inside the container to verify the port is listening. On a distroless image, that exec fails and the strategy hangs indefinitely. UntilMessageIsLogged() watches the container's stdout stream directly, bypassing the shell dependency.

The Cosmos emulator returns its own internal address in the account metadata it sends back to clients. The test-process CosmosClient receives localhost:8081 as the endpoint and follows it there. WithPortBinding(8081, 8081) ensures that host port is pinned, so the func child process (which constructs its own CosmosClient) lands on the same address.

WithResourceMapping mounts a JSON configuration file into the Event Hubs emulator container, but it doesn't set the ServiceConfiguration property the builder reads at Build() time. The build throws at runtime. WithConfigurationBuilder uses the fluent API to set ServiceConfiguration directly, and the configuration is validated at build time.

Process wiring:

var cosmosPort = _cosmos.GetMappedPublicPort(8081);
var cosmosKey = _cosmos.GetConnectionString()
    .Split(';').First(p => p.StartsWith("AccountKey=", StringComparison.Ordinal))
    .Substring("AccountKey=".Length);

CosmosClient = new CosmosClient(
    _cosmos.GetConnectionString(),
    new CosmosClientOptions
    {
        ConnectionMode = ConnectionMode.Gateway,
        HttpClientFactory = () => new HttpClient(new CosmosEmulatorHandler(cosmosPort)),
        SerializerOptions = new CosmosSerializationOptions
        {
            PropertyNamingPolicy = CosmosPropertyNamingPolicy.CamelCase
        }
    });

startInfo.Environment["CosmosDbConnection"] =
    $"AccountEndpoint=http://localhost:{cosmosPort}/;AccountKey={cosmosKey}";

The func child process constructs its own CosmosClient from the CosmosDbConnection environment variable; it can't share the test process's HttpClient handler across the process boundary. Passing AccountEndpoint=http://localhost:{port}/ with an explicitly extracted key gives the child process a direct HTTP connection to the emulator without needing the handler.

CosmosEmulatorHandler is an HttpMessageHandler that rewrites outgoing requests from the emulator's self-reported internal hostname to localhost:{cosmosPort}. Without it, the SDK follows the internal address the emulator returns in its account metadata and misses the container.

The full fixture also implements WaitForFunctionsHostAsync (polls localhost:7071/admin/host/status until the host responds) and DisposeAsync (kills the process tree and disposes all three containers). Both are in the full source.

The test publishes a batch and polls Cosmos DB until the document appears:

[Collection(SensorPipelineFixture.Name)]
public class SensorPipelineTests(SensorPipelineFixture fixture)
{
    [Fact]
    public async Task PublishedEvent_WithValidReading_AppearsInCosmosDb()
    {
        var reading = new SensorReading(
            DeviceId: $"device-{Guid.NewGuid():N}",
            Temperature: 23.4,
            Humidity: 58.0,
            Timestamp: DateTimeOffset.UtcNow);

        var batch = await fixture.ProducerClient.CreateBatchAsync();
        batch.TryAdd(new EventData(JsonSerializer.SerializeToUtf8Bytes(reading)));
        await fixture.ProducerClient.SendAsync(batch);

        var container = fixture.CosmosClient.GetContainer("SensorData", "readings");
        var deadline = DateTime.UtcNow.AddSeconds(30);
        List<dynamic> results = [];

        while (DateTime.UtcNow < deadline)
        {
            var query = container.GetItemQueryIterator<dynamic>(
                $"SELECT * FROM c WHERE c.deviceId = '{reading.DeviceId}'");
            results.Clear();
            while (query.HasMoreResults)
                results.AddRange(await query.ReadNextAsync());

            if (results.Count > 0) break;
            await Task.Delay(500);
        }

        Assert.Single(results);
        Assert.Equal(23.4, (double)results[0].temperature, precision: 1);
    }
}

The fixture takes 60–90 seconds to start. Run it separately from unit tests in CI using xUnit's [Collection] trait or a test filter.

Add the packages:

<PackageReference Include="Testcontainers.EventHubs" />
<PackageReference Include="Testcontainers.CosmosDb" />
<PackageReference Include="Testcontainers.Azurite" />
<PackageReference Include="Azure.Messaging.EventHubs" />
<PackageReference Include="Newtonsoft.Json" />

Cosmos SDK v3 requires Newtonsoft.Json at runtime via an internal dependency. Omitting it produces a FileNotFoundException at startup with no message connecting it to Cosmos.

What can't be emulated locally

Azurite and func start cover wiring and trigger dispatch. Some behaviors only emerge in Azure.

Cold starts. Local tests keep the host warm throughout the run. Consumption plan cold starts in Azure hit 500ms-2s for .NET depending on deployment size. If your SLA depends on p99 latency, that gap only shows in production traffic — local tests give you no signal on it.

Managed identity credential resolution. DefaultAzureCredential falls through a chain of credential sources. Locally it uses developer machine credentials or environment variables. In Azure it uses the Managed Identity endpoint. A misconfigured client ID or missing role assignment won't surface until the function runs with a real identity attached. The local credential chain doesn't exercise the same code path.

Scale-out behavior. func start runs one worker. Azure scales to N workers based on trigger backlog. Race conditions, partition contention, and shared-state bugs appear only under concurrent load across multiple instances. No local setup replicates this.

KEDA-based scaling decisions. Event Hub and Service Bus triggers scale based on message lag, but the scaling decisions come from the infrastructure, not the worker process. There's no local equivalent for how Azure routes partitions across workers as instances scale up.

The useful takeaway: unit tests and integration tests give fast, reliable feedback on logic and wiring. They don't give confidence about latency under cold conditions, behavior at scale, or cloud-managed auth. Build those signals from production observability (Application Insights, structured logs, alert rules), not from test infrastructure.

Patterns that cause pain

A few mistakes appear repeatedly in Azure Functions test suites.

Asserting on log messages. Substitute.For<ILogger<T>>() lets you verify that specific log calls were made. Don't. Log messages are implementation details: they change wording, get split into multiple calls, or get removed during refactoring. When they do, your test breaks without any behavior change. Use NullLogger<T>.Instance for services and only substitute loggers when logging output is the actual behavior under test (which is almost never).

Reaching into the runtime from unit tests. [HttpTrigger], [FromBody], and [QueueTrigger] are metadata for the runtime to read. They don't execute during a direct method call. Trying to test that binding attributes are present, or that the runtime would route correctly, puts you in the business of testing the Functions SDK rather than your code. The routing table lives in the host config; your job is to test what happens once the host calls your method.

Using constructors for container lifecycle. xUnit creates test class instances before running tests, but StartAsync() is async. Initializing a AzuriteContainer in a constructor blocks the thread and causes tests to hang silently. Always use IAsyncLifetime: InitializeAsync for startup, DisposeAsync for teardown.

Testing the service layer twice. Once you have thorough OrderServiceTests, the function-level tests (OrderFunctionTests) should only cover the HTTP response mapping: does a successful result return 201, does a failure return 400. Repeating the validation and business logic assertions at the function level creates duplicate coverage that breaks together whenever the service contract changes.

Choosing your testing strategy

Layer	Approach	Infrastructure needed
Service logic	Unit test	None
Function routing	Unit test	None
DI wiring + middleware	HostBuilder trick	None (gRPC stripped)
Data layer round-trips	Testcontainers (SQL/Postgres/Azurite)	Docker
Trigger dispatch	`func start` + Azurite	Core Tools + Docker
Full pipeline	Testcontainers Docker image	Docker

Start from the top and stop as soon as the tests cover the risk you're managing. For most business logic, unit tests against the service layer are enough. The function class tests add a few minutes of coverage for the HTTP response shapes. Integration and E2E tests are worth the infrastructure cost only when you need to verify wiring, real database behavior, or trigger dispatch.

Do you unit test your function class directly, or do you treat the service layer as the boundary and skip function-level tests entirely?

Azure Functions for .NET Developers: Series

Part 1: Why Azure Functions? Serverless for .NET Developers

Part 2: Your First Azure Function: HTTP Triggers Step-by-Step

Part 3: Beyond HTTP: Timer, Queue, and Blob Triggers

Part 4: Local Development Setup: Tools, Debugging, and Hot Reload

Part 5: Understanding the Isolated Worker Model

Part 6: Configuration Done Right: Settings, Secrets, and Key Vault

Part 7: Testing Azure Functions: Unit, Integration, and Local (this article)

Part 8: Deploying to Azure: CI/CD with GitHub Actions

Part 9: Azure Functions Observability: From Blind Spots to Production Clarity

Bonus: Production Realities: When Serverless Stops Being Serverless

DEV Community: Martin Oehlert

Preparing for Migration: Decoupling Your Function Logic

The trigger as a thin controller

Before: the trigger doing too much

After: the contract in code

What does not belong, with evidence from the sample

Why thinness matters when the host changes

Abstracting Azure SDK dependencies

Infrastructure clients live above Settlement.Core

The Functions host does not call AddAzureClients

Building portable service classes

Rule 1: zero references to Microsoft.Azure.Functions.* in the project file

Rule 2: configuration via IOptions<T> with validated binding

Rule 3: logging via ILogger<T>

The BackgroundService lifetime trap

Testing without the Functions runtime

Integration tests at the same boundary

The smoke test that earns its keep

Before and after: a refactoring walkthrough

Series wrap-up: where to go from here

Closing question

When Azure Functions Fight Back: Signs You've Outgrown Them

Performance walls you will hit

Execution timeout per plan

The fix that lets you stay (when it can)

Memory ceilings

SNAT ports and connection exhaustion

Database connections do not get the same fix

Local file system

Complexity signals that indicate a problem

Durable Functions sub-orchestration sprawl

Orchestrator code constraints

Shared external state as bottleneck

Service Bus session locks

Function count vs feature count

Coupling patterns that fight the serverless model

Sequential chains (the service with three methods)

Shared database schemas across Function Apps

Circular queue dependencies and poison loops

Detecting poison loops in Application Insights

Configuration drift

The cost crossover point

A worked example: 10 RPS at 200 ms / 256 MB

The hidden bill in App Insights

The other hidden bill: the storage account

People-time

Making the decision: stay, refactor, or migrate

Refactor within Functions

Extract specific functions

Full migration

Decision matrix

Wrap-up

Structuring Complex Function Apps: Project Organization

When one Function App is too many

Monolith vs multiple Function Apps

Cold start: how much does function count actually cost?

Sharing code without copy-paste

The static-client exception

When project reference stops scaling

Dependency injection with Keyed Services

Resolving the right key

Two breaking changes worth knowing

Folder structure that holds up at 30 functions

When the project also runs in Aspire

When to split, when to keep together

Wrap-up

Scaling Azure Functions: Consumption vs Premium vs Dedicated

Consumption: true serverless, true cold starts

How the scale controller works

Instance limits and billing

Cold start reality on .NET

What Consumption can't do

Flex Consumption: the middle ground

Always-ready instances vs on-demand

Billing: per-second, not per-execution

Scale behavior

The constraints to know about

Premium: warm instances, guaranteed

What you get for $146/month

Pre-warmed instances and elastic scale

Infrastructure clients live above `Settlement.Core`

The Functions host does not call `AddAzureClients`

Rule 1: zero references to `Microsoft.Azure.Functions.*` in the project file

Rule 2: configuration via `IOptions<T>` with validated binding

Rule 3: logging via `ILogger<T>`

The `BackgroundService` lifetime trap

The second trap: `.env` files that silently mangle values