Martin Oehlert

Posted on May 15

When Azure Functions Fight Back: Signs You've Outgrown Them

#azure #azurefunctions #dotnet #architecture

Your queue handler hit the 10-minute Consumption ceiling last week. You restructured it to checkpoint, and the next month-end batch still creeps over. The question now is not how to wring one more workaround out of Functions. It is when the next workaround stops being cheaper than moving the job onto a different host. Four signals push the answer past "still cheaper": performance walls, complexity sprawl, coupling patterns the platform makes worse, and a cost crossover that arrives sooner than most teams plan for.

Performance walls you will hit

Four limits decide how far Functions can carry the workload: how long any one invocation can run, how much memory it can use, how many sockets it can hold open, and what its file system actually persists.

Execution timeout per plan

The numbers from the hosting plan timeout reference:

Cross-cutting cap: HTTP triggers that do not respond within 230 seconds are cut off by the Azure Load Balancer with HTTP 502 regardless of functionTimeout. The function keeps running but cannot return a response. Sources: HTTP trigger limits, Web request times out in App Service. For longer work, Microsoft points at the Durable Functions async HTTP pattern.

host.json: functionTimeout describes what happens at the cap: "When an execution exceeds this duration, a timeout error occurs and the language worker process restarts." The worker is killed, and in-flight invocations on it are lost. What the trigger does next is per binding:

Service Bus: PeekLock with autoComplete = true. On host crash the lock expires, and on next visibility the message reappears with DeliveryCount incremented. After MaxDeliveryCount (default 10) it lands in <queue>/$deadletterqueue (Service Bus dead-letter queues).
Storage Queue: visibility timeout per message. On host crash the storage-default 10-minute timeout takes over. After 5 failed attempts the message moves to <queue>-poison (Storage queue trigger: Peek lock).
Blob trigger: same five-attempt default, with failed blobs landing in webjobs-blobtrigger-poison (Poison blobs).
HTTP: caller already saw the 502 at 230 s. No automatic retry.
Timer: per Timer trigger: Retry behavior, "the timer trigger doesn't retry after a function fails."

A subtle one to flag: only Cosmos DB, Event Hubs, Kafka, and Timer support host-level retry policies ([FixedDelayRetry], [ExponentialBackoffRetry]). For Service Bus and Storage Queue, the binding's native retry semantics are the only mechanism (Azure Functions error handling and retries). Decorating a Service Bus trigger with [ExponentialBackoffRetry] does nothing.

The fix that lets you stay (when it can)

Most "we hit the timeout" stories are workloads that can be split into chunks with a checkpoint between them. A queue-triggered batch that processes 50,000 items in 50 ms each runs 42 minutes. The same handler that processes 500 at a time and writes a cursor blob between chunks survives any number of restarts. From the companion sample:

[Function(nameof(DrainBatchFunction))]
public async Task Run(
    [QueueTrigger("batches", Connection = "AzureWebJobsStorage")] BatchCommand command,
    CancellationToken cancellationToken)
{
    await cursors.CreateIfNotExistsAsync(cancellationToken: cancellationToken);

    var checkpoint = cursors.GetBlobClient($"batch-{command.BatchId}.cursor");
    var lastCommitted = await ReadCursorAsync(checkpoint, cancellationToken);

    await foreach (var chunk in source.ChunksAfterAsync(
                       command.BatchId, lastCommitted, _chunkSize, cancellationToken))
    {
        await ProcessAsync(chunk, cancellationToken);

        // Commit the cursor *before* the next chunk; if the worker is killed
        // immediately after this upload, the next attempt resumes here.
        await checkpoint.UploadAsync(
            BinaryData.FromString(chunk.LastItemId.ToString(CultureInfo.InvariantCulture)),
            overwrite: true,
            cancellationToken);
    }
}

The shape is what matters: chunk, process, commit cursor, repeat. Each chunk must be small enough that the worst-case batch (500 items at 50 ms = 25 s) finishes well inside the timeout. The cursor write is the moment durability shifts. On retry, ReadCursorAsync resumes at the last committed item instead of restarting at item 1.

This pattern keeps you on Functions. When the single chunk itself runs longer than the timeout (a 12-minute database query, a multi-gigabyte file copy, a 30-minute model inference), the workload has outgrown the platform. No chunk size helps.

Memory ceilings

Per-instance memory from the service limits table:

Two surprises in that list. Flex Consumption has a 512 MB SKU. Most teams reading the marketing page assume Flex starts where Premium does. And from the trigger's perspective, OOM and timeout look the same: the OS terminates the worker, the host restarts, and the per-binding retry semantics from the previous section decide whether the input is replayed or dropped. The closest official reference is the host health monitor, which can recycle the host preemptively when performance counters stay above 80% within the health-check window.

The blob trigger has a documented memory amplifier worth knowing. Microsoft's Blob trigger: Memory usage and concurrency warns that "the runtime must load the entire blob into memory more than one time during processing" if you bind to a non-streaming type, and concurrency multiplies the effect. Bind to Stream for anything past a few MB.

SNAT ports and connection exhaustion

Two distinct outbound limits, easy to confuse.

SNAT port budget per instance. From Troubleshoot intermittent outbound connection errors: "Each instance on Azure App service is initially given a preallocated number of 128 SNAT ports." Ports apply to the same destination tuple (address + port) and are reclaimed by the load balancer four minutes after the connection closes. Microsoft's recommendation is to keep usage under 100 outbound connections per unique remote endpoint per instance.

Total outbound TCP connections per instance. Consumption is capped at 600 active (1,200 total) per instance, and the runtime logs Host thresholds exceeded: Connections at the limit. Flex, Premium, and Dedicated are listed as "unbounded" (Dedicated still subject to App Service worker-size caps).

The fastest way to walk into both limits is the canonical Functions anti-pattern: new HttpClient() inside a function body. Each invocation creates a new socket pool, and sockets sit in TIME_WAIT after disposal, compounded by the four-minute SNAT reclaim. At any reasonable RPS, the 128-port budget for a destination host is exhausted, and the function sees intermittent connect failures or SocketException. The AZF0002 analyzer flags the call site at build time.

The recommended fix is IHttpClientFactory, registered once in Program.cs. From the companion sample:

var builder = FunctionsApplication.CreateBuilder(args);
builder.ConfigureFunctionsWebApplication();

builder.Services
    .AddOptions<PaymentsOptions>()
    .Bind(builder.Configuration.GetSection(PaymentsOptions.SectionName))
    .ValidateDataAnnotations()
    .ValidateOnStart();

builder.Services
    .AddHttpClient<IPaymentsApi, PaymentsApi>((sp, client) =>
    {
        var options = sp.GetRequiredService<IOptions<PaymentsOptions>>().Value;
        client.BaseAddress = new Uri(options.BaseAddress);
        client.Timeout = TimeSpan.FromSeconds(options.TimeoutSeconds);
    })
    .AddStandardResilienceHandler();

The factory caches HttpMessageHandler instances (default lifetime 2 min per HttpClient lifetime management) and rotates them so DNS changes are picked up. The AddStandardResilienceHandler call layers in retry, circuit breaker, and timeout policies from Microsoft.Extensions.Http.Resilience without an extra DI dance. The function takes IPaymentsApi on its primary constructor and never calls new HttpClient() again.

One caveat from HttpClient guidelines: Recommended use: the factory shares a CookieContainer across pooled handlers. If your client depends on cookies, prefer a singleton with an explicit SocketsHttpHandler that sets PooledConnectionLifetime instead.

Database connections do not get the same fix

ADO.NET pools per process, and Functions runs one process per instance. Microsoft's SqlClient guidance is direct: "ADO.NET implements connection pooling by default. But because you can still run out of connections, you should optimize connections to the database."

The math is the trap. Each Consumption instance carries its own pool (default Max Pool Size = 100), and the platform can spin up to 200 Consumption instances (service limits). At full scale-out: 200 instances * 100 connections = 20,000 potential connections against the database. Premium is capped at 100 instances, Flex at 1,000. Most managed databases fall over well below that.

The fix lives at the database, not in Functions. Lower the pool size per instance, cap the application's max scale-out, or front the database with a connection pooler (PgBouncer, RDS Proxy equivalents on Azure Database for PostgreSQL Flexible Server).

Local file system

Three behaviours worth distinguishing.

Per-instance ephemeral temp. From Operating system functionality in App Service: %SystemDrive%\local is reserved for temporary local storage, "not persistent across app restarts." Plan-by-plan capacity ranges from 0.5 GB (Consumption) to 21-140 GB (Premium / Dedicated). Scale-out does not just reset temp. It makes the data invisible: instance A writes /tmp/cache.json, and instance B never sees it.

Persisted shares. Premium and Dedicated can mount Azure Files (SMB or NFS). Flex Consumption supports SMB and read-only Blobs but not NFS (Choose a file access strategy). SMB cold-start latency is documented at 200-500 ms on first execution.

Read-only file systems. When WEBSITE_RUN_FROM_PACKAGE is set, "the wwwroot folder is read-only and you receive an error if you write files to this directory" (Run from package: General considerations). The same page is explicit: "Don't add the WEBSITE_RUN_FROM_PACKAGE app setting to apps on the Flex Consumption plan." Flex's deployment model treats the package as read-only by design. Container Apps follows container-image semantics: image layers are read-only.

The Durable Functions provider doc gives the blunt warning: "Storing payloads to local disks is not recommended, since on-disk state isn't guaranteed to be available" (Azure Storage representation in a task hub). Anything you want to read on a different instance, or after a restart, belongs in Blob, Cosmos, or another external store.

Complexity signals that indicate a problem

Performance walls fail the workload outright. Complexity signals fail it slowly: the function count keeps climbing, the orchestration diagram keeps growing legends, and at some point you cannot describe the system without a whiteboard. None of the limits below are timeouts. They are the shape of the architecture telling you it has outgrown the deployment unit.

Durable Functions sub-orchestration sprawl

Microsoft lists three legitimate uses for CallSubOrchestratorAsync (Sub-orchestrations: When to use):

Compose reusable workflow building blocks shared across parents.
Fan out parallel instances of the same orchestrator and wait for all.
Organise a large orchestration into named, testable pieces.

If a sub-orchestration is not doing one of those three jobs, it is decoration, and the cost compounds. The same docs add a hard constraint: "Sub-orchestrations must be defined in the same app as the parent orchestration." The cross-app workaround is the HTTP 202 polling pattern, a different programming model with no parent/child semantics, no shared retry policy, and no automatic exception propagation. The reason is the task hub model: "If multiple apps use the same task hub, they compete for messages, which can result in undefined behavior, including orchestrations getting unexpectedly stuck."

The decision worth surfacing in your design review: task hub partition count is immutable after creation. Default 4, max 16. From Performance and scale: Partition count: "You can't change the partition count after you create a task hub. Set it high enough to meet expected scale-out requirements." Most teams discover this when their orchestrator throughput plateaus and they reach for the dial that does not exist.

Sprawl signals inside one app:

Parent depth of 3+ levels where the grandchild does almost no work.
Sub-orchestrators that wrap a single activity call.
Parents fanning out to sub-orchestrators that themselves fan out: one instance ID can spawn dozens of children, each consuming control queue capacity.

Orchestrator code constraints

Orchestrators replay from history every time a new event arrives. The runtime requires deterministic code: the same input must produce the same call sequence on every replay. From Orchestrator function code constraints, the following must not appear inside an orchestrator body:

DateTime.Now / DateTime.UtcNow (use context.CurrentUtcDateTime).
Guid.NewGuid() (use context.NewGuid(), which returns Type 5 UUIDs derived from instance ID).
Bindings, including the orchestration client and entity client bindings. I/O lives in activities.
Static variables and environment variables.
Direct network or HTTP calls.
Task.Run, Task.Delay, Thread.Sleep, HttpClient.SendAsync. Use context.CreateTimer for delays.

The clean rule is one sentence: orchestrators schedule, activities act. Anything that reads time, calls the network, or generates a random value belongs in an activity. The framework throws NonDeterministicOrchestrationException sometimes, but the docs are explicit: "this detection behavior won't catch all violations, and you shouldn't depend on it." Violations ship and break weeks later.

Shared external state as bottleneck

Account-level ceilings from Standard storage account scalability targets:

20,000 RPS per general-purpose v2 account in most regions, and 40,000 RPS in higher-tier regions.
Hitting any of these returns HTTP 503 Server Busy or HTTP 500 Operation Timeout.

Per-service ceilings (the ones that bite first):

Storage Queue: account 20,000 msg/s, single queue tops out at 2,000 msg/s (Queue Storage scalability, Data partitioning strategies).
Storage Table: 20,000 entities/s account-wide, 2,000 entities/s per partition (Table Storage scalability). Throttling shows as PercentThrottlingError. Date-as-partition-key is the canonical anti-pattern.
Cosmos DB hot partition: hard ceiling of 10,000 RU/s per logical partition (Partitioning and horizontal scaling), regardless of total provisioned throughput. Splitting does not help when the key is genuinely hot (Redistribute throughput). The fix is a different partition key.

Durable Functions sits on top of those numbers. Every task hub creates two Azure Tables (<hub>History, <hub>Instances), one work-item queue, one control queue per partition, and blob containers for leases and large messages (Azure Storage representation in a task hub). When several apps point at the same storage account, every one fights for the same per-partition 2,000 entities/s on the History and Instances tables. Microsoft's reliability guidance is direct: "use a separate storage account for each function app. This aspect is especially true with Durable Functions and Event Hubs triggered functions" (Best practices for reliable Azure Functions).

Service Bus session locks

The architectural smell is two or more functions in the same function app triggering on the same session-enabled queue. Sessions hold an exclusive lock per session ID (Message sessions): only one receiver at a time, per-session FIFO, default MaxDeliveryCount = 10. The Functions session host defaults are aggressive (Service Bus host.json settings): maxConcurrentSessions = 2000, maxConcurrentCalls = 16 per session. Thread starvation at high MaxConcurrentSessions is documented in the troubleshooting guide.

Two functions in the same app are not parallelising work over those sessions. They are competing for session locks. SessionLockLost shows up when the lock expires before renewal, the partition rebalances, or the AMQP link is idle for 10 minutes (Service Bus messaging exceptions: SessionLockLost). The fix is one consumer per session-enabled queue per app, period.

Function count vs feature count

Microsoft does not publish a number for "too many functions in one app", but the Function organization best practices describe the failure mode plainly:

Each function that you create has a memory footprint. While this footprint is usually small, having too many functions within a function app can lead to slower startup of your app on new instances.

Connection strings and other credentials stored in application settings gives all of the functions in the function app the same set of permissions in the associated resource. Consider minimizing the number of functions with access to specific credentials by moving functions that don't use those credentials to a separate function app.

A self-test you can run in 10 minutes:

Can you name the feature each function delivers without opening the code?
Does explaining one feature require drawing a diagram of 5+ function boundaries?
Is your function-to-feature ratio above 3:1? (50 functions for 8 features is 6:1.)
Do all functions share the same connection strings whether they use them or not?
Do load profiles inside the app diverge? (Chatty queue trigger next to memory-heavy report function.)
Does shipping one function redeploy 49 others?
Does cold start time grow with every release?
Does the same Durable task hub serve more than one feature area?

Three or more is the smell. Six or more is "this should have been split a quarter ago." The W19 split-apps approach handles the first wave of this. If splitting still leaves you fighting the platform, the article you are reading is the second wave.

Coupling patterns that fight the serverless model

Some of the worst Functions deployments do not look bad on any one screen. The handlers are clean, every function is short, the metrics are healthy. The damage is in the topology: the way the functions wire to each other multiplies cost, slows change, or quietly burns money in a loop. Four shapes show up over and over.

Sequential chains (the service with three methods)

Function A writes a message. Function B is triggered by it. Function C is triggered by B's message. Every input traverses the same three hops, with no branching and no fan-out. It is a workflow, not a serverless decomposition.

Microsoft's Pipes and Filters pattern is explicit on when not to use it: "the processing steps performed by an application aren't independent, or they have to be performed together as part of a single transaction." A three-step lockstep chain meets that condition. The same page points at Compute Resource Consolidation for the consolidation: "You can group filters that should scale together in the same process."

Cost per hop is five charges:

Source-function invocation (per-execution + GB-second).
Queue write (one transaction).
Queue read by the next function (another transaction).
Serialize + deserialize (CPU on both sides).
New consumer invocation (per-execution + GB-second again).

A three-function chain triples that. The corrected shapes are documented: collapse to one Functions invocation that calls the three operations as private methods (single trigger, single execution, no inter-function queues), or move to Durable Functions function chaining if the steps are genuinely separate but coordinated. The orchestrator keeps durable state and tracks the choreography explicitly.

Shared database schemas across Function Apps

Microsoft's Data considerations for microservices is unambiguous:

Two services shouldn't share a data store. Each service manages its own private data store, and other services can't access it directly.

Services can safely share the same physical database server. Problems occur when services share the same schema, or they read and write to the same set of database tables.

Two Function Apps on the same Azure SQL server with separate schemas is fine. Two Function Apps writing the same Orders table is the antipattern.

The cost surfaces at migration time. Additive changes (new column, new table) work. Destructive changes (rename, drop, type change, NOT NULL on a populated column) require all N apps to agree on a release order: forward-compat code shipped in advance to every app, or a coordinated cutover that removes the ability of any one app to deploy independently. Two-phase migrations (expand-and-contract) become the default. The blue/green compatibility window is the intersection of every app's deploy windows. Each schema version must be readable and writable by every prior version of every consumer that might still be running (multitenant antipatterns).

If three apps share an Orders table and one ships weekly, one biweekly, one monthly, the slowest cadence sets the floor. A hosted service that owns the schema collapses N consumers to 1, and migrations stop being a coordination problem. The trade is what you give up: per-function scaling and trigger-binding ergonomics. The Saga pattern: Context and problem acknowledges the trade explicitly.

Circular queue dependencies and poison loops

Two defaults to cite side by side, because readers conflate them:

Storage Queue maxDequeueCount = 5. After 5 failures the message goes to <originalqueue>-poison (Storage queue host.json settings).
Service Bus MaxDeliveryCount = 10. After 10 attempts the message goes to the DLQ at <queue path>/$deadletterqueue (Service Bus dead-letter queues). "There's no automatic cleanup of the DLQ. Messages remain in the DLQ until you explicitly retrieve them."

Functions handles the settlement automatically: "By default, the runtime calls Complete on the message if the function finishes successfully, or calls Abandon if the function fails" (Service Bus trigger: PeekLock). The two antipatterns that subvert this safety net:

Retry queue that re-enqueues to the input queue. Handler catches the exception, writes the message back to the input queue with a transient-failure tag, returns success. The runtime never sees a failure, dequeueCount resets each round trip, and the message lives forever. The MessageId changes because the application is publishing a new message each time, so log correlation by ID misses it.
Dead-letter handler that re-triggers the original function. A second function with a trigger on the poison/DLQ subqueue picks up failed messages and calls back into the original function's logic (or worse, writes back to the original input queue). Result: input -> processing -> poison -> handler -> input ad infinitum. Service Bus eventually surfaces QuotaExceeded (messaging exceptions). Storage Queues do not fail nearly as loudly. The loop just costs money until somebody notices the bill.

Microsoft names the failure shape directly. The Choreography pattern carries the warning: "There's a risk of cyclic dependency between saga participants because they have to consume each other's commands."

Detecting poison loops in Application Insights

The End-to-end transaction details view shows a Gantt chart of every server-side telemetry event for a correlated operation_Id across all instrumented components. For a Functions chain, each invocation shows up as a request span with the queue-dependency calls between them, all under the same operation. N+1 invocations of the same function name under the same operation_Id is the loop signature. The Application Map makes the cycle visible at the topology level: a node with a self-edge or a tight A->B->A cycle.

The detection workflow is two clicks: Failures view, drill into a sample exception, the End-to-end transaction view opens with the trace tree expanded. If you cannot tell at a glance whether a transaction is one logical request or three loops of the same one, instrument before you refactor.

Configuration drift

Splitting one Function App into N copies the configuration N times. Each app has its own connection strings, API keys, storage keys in App Settings. Rotating a secret means N updates. Missing one leaves a stale credential in production until the next deploy.

Key Vault references move the storage out of App Settings and into a vault. The syntax (Use Key Vault references as app settings) takes one of two forms:

@Microsoft.KeyVault(SecretUri=https://myvault.vault.azure.net/secrets/mysecret)
@Microsoft.KeyVault(VaultName=myvault;SecretName=mysecret)

The failure mode the same page documents: "If a reference isn't resolved properly, the reference string is used instead." Real production failure pattern: the function tries to authenticate with the literal string @Microsoft.KeyVault(...), gets a 401, and the operator stares at App Settings that look correct in the portal. The WEBSITE_KEYVAULT_REFERENCES env var holds resolution status for every reference, and the portal exposes a Key Vault Application Settings Diagnostics detector. Both are worth knowing before the first incident.

The "partially adopted" antipattern is the worst version: some apps reference @Microsoft.KeyVault(...), others have raw values because the migration was incomplete. Rotating the secret in the vault updates the references and leaves the raw-value apps stale. Configuration looks correct in the portal until the next failure surface, usually a 401 from a downstream service hours after rotation.

Azure App Configuration collapses N App Settings stores into one. "Spreading configuration settings across these components can lead to hard-to-troubleshoot errors during an application deployment. Use App Configuration to store all the settings for your application and secure their accesses in one place." App Configuration handles non-secret config and holds Key Vault references for secret values. Key Vault stays the secret store. The trade is one runtime dependency, N role assignments, and refresh semantics that are opt-in (without dynamic refresh, settings are read once at startup). For two-app workloads it might not pay. For ten-app workloads it almost always does (App Configuration: high resiliency).

The cost crossover point

"Functions is too expensive at scale" and "Functions is the cheap option" are both true. They describe different points on the same curve. The crossover happens earlier than most teams plan for, and the worked example below makes it concrete.

A worked example: 10 RPS at 200 ms / 256 MB

Assumptions: 10 RPS sustained for 30 days, 200 ms execution, 256 MB memory. East US 2. Single subscription with the per-month free grants applied. All numbers verified against the Azure Retail Prices API on 2026-05-08.

Volumes:

Executions: 10 * 3,600 * 24 * 30 = 25,920,000
GB-seconds: 0.25 * 0.2 * 25,920,000 = 1,296,000

The Flex number is the trap. Flex bills a 1-second minimum per execution then rounds to 100 ms above that (Flex billing). At 200 ms / 256 MB, every invocation bills as if it ran 1 second, which is 5x the GB-seconds and pushes the bill 9x above legacy Consumption. Teams switching from Consumption to Flex "for the per-function scaling" walk into this and call the platform expensive. Verify on real workload before committing.

The crossover with EP1 is the other number worth keeping in your head. Setting Consumption_total = EP1_floor and solving for RPS at 200 ms / 256 MB:

Cost ≈ 2.592 * R - 6.60 = 145.93
R ≈ 58.8 RPS sustained

At the 200 ms / 256 MB shape, Consumption ties EP1 around 60 RPS sustained. Below that, Consumption wins on cost. Above that, EP1 starts winning and the Premium-only reasons (always-on, VNet integration, longer timeouts) compound the case. The crossover shifts with execution length: at 1 s / 256 MB it falls to about 12 RPS, and at 50 ms it pushes past 240 RPS. Pick the shape that matches your workload before you quote a number.

AKS gets one sentence: if you need Kubernetes primitives, you are no longer comparing against Functions, and the comparison belongs in a different article.

The hidden bill in App Insights

App Insights bills through Log Analytics (App Insights billing) at $2.76 per GB ingested above the 5 GB free grant per workspace. For most apps the default adaptive sampling at 5 events/second per host keeps ingestion under the grant (Sampling in Application Insights). The failure mode is operational: an engineer disables sampling to chase a bug, forgets to re-enable, and the next month's bill arrives. At 100 RPS unsampled with 10 KB telemetry events, ingestion is 86 GB/day, which is 2.6 TB/month, which is roughly $7,180/month at the PAYG rate.

The mitigation is a daily cap, set per workspace. The portal default for a workspace-based App Insights resource is 100 GB/day, but resources created via Visual Studio default to 32.3 MB/day (Daily cap). Whichever number you pick, set it before someone disables sampling.

The other hidden bill: the storage account

Every Function App requires a general-purpose storage account (Storage considerations). With one or two apps, storage is rounding error. With thirty apps after a W19-style split, it becomes a line item:

Queue triggers: every poll is a Class 2 op, roughly one per second when idle. Idle alone is 86,400 polls/day * 30 / 10,000 * $0.004 ≈ $1/month per queue. Real processing adds put + get + delete = 3 ops per message.
Durable task hubs: orchestration and history tables grow into millions of rows with replays, plus control queues, work-item queues, and instance tables. A busy task hub easily reaches $20-50/month before the orchestrator's compute cost.
Internal runtime traffic: lease blobs for the scale controller, host locks. Negligible per app, multiplied by N apps it shows up.

Microsoft's own guidance is to put each app on its own storage account, especially for Durable Functions and Event Hubs triggers. That doubles or triples your storage line item, and it is still the right call.

People-time

The cost crossover argument rarely decides the migration. People-time almost always does.

Debugging across N function apps requires correlated query plumbing in App Insights plus a mental model of which app owns which trigger.
Each function app is its own deployment unit, so coordinated releases need a release pipeline that understands ordering and rollback.
An on-call alert that fires "queue X is backing up" requires the operator to know which app owns queue X, which trigger, and which version is deployed.
Configuration drift compounds with app count, as the previous section already showed.

Cognitive load scales super-linearly with app count. It shows up as slower MTTR, not as a Functions bill line item, and that is exactly why it stays invisible until the team is exhausted.

Making the decision: stay, refactor, or migrate

Most "outgrowing Functions" complaints are organisational, not technical. Apply the W19 split-apps approach first, then revisit. The four concrete signals that say you can stay: no timeout pressure (longest job under half the plan limit), no memory pressure (peak under 60% of the instance ceiling), SNAT and connection issues fixed by IHttpClientFactory, function-to-feature ratio under 3:1. If all four hold, the platform is not the problem.

Refactor within Functions

This is W19 territory: split into multiple apps, extract a shared library, isolate triggers by scaling profile. If you are on Consumption, the next step before Premium is Flex Consumption (per-function scaling, longer timeouts, larger SKUs). The caveat from the cost section bears repeating: Flex's 1-second billing floor punishes sub-second functions. Verify on real workload before committing.

For workloads that bump the timeout but can be split, the checkpoint pattern from the Performance walls section keeps you on Functions: chunk, process, commit cursor, repeat. That keeps the trigger ergonomics, the deployment unit, and the scale controller. The cost is one cursor blob per active batch.

Extract specific functions

The middle ground. The function app stays, and one or two functions move out, usually because they hit a wall the rest of the app does not.

Long-running jobs. Hosted service in App Service, or a Container Apps job. Same business logic, no per-invocation cap. The companion sample shows the same payment-settlement workload as a BackgroundService next to its Function App original. The diff is the hosting wrapper, not the algorithm:

public sealed class SettlementWorker(
    QueueClient queueClient,
    IPaymentSettler settler,
    IOptions<QueueOptions> queueOptions,
    SettlementWorkerStatus status,
    ILogger<SettlementWorker> logger) : BackgroundService
{
    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        await queueClient.CreateIfNotExistsAsync(cancellationToken: stoppingToken);

        while (!stoppingToken.IsCancellationRequested)
        {
            var response = await queueClient.ReceiveMessagesAsync(
                maxMessages: _options.MaxBatchMessages,
                visibilityTimeout: TimeSpan.FromSeconds(_options.VisibilityTimeoutSeconds),
                cancellationToken: stoppingToken);

            if (response.Value.Length == 0)
            {
                await Task.Delay(_options.IdlePollingDelayMs, stoppingToken);
                continue;
            }

            foreach (var message in response.Value)
            {
                await ProcessAsync(message, stoppingToken);
            }
        }
    }
}

Same IPaymentSettler from the shared library, same queue, no per-invocation timeout. The trade is paying for an always-on worker (App Service) or accepting cold-start on first replica scale-out (Container Apps with KEDA).

Stateful workflows hitting Durable limits. Cross-app sub-orchestration is impossible, and shared task hub contention is real. Logic Apps Standard or a hosted workflow engine (Temporal, Elsa, Conductor) trades the binding ergonomics for a workflow surface that scales the way Durable does not.
CPU- or memory-bound work above EP3. Container Apps with the Dedicated workload profile, or AKS if Kubernetes is already a platform decision in your org.

Full migration

Rare. Criteria: timeout + memory + connection + cost crossover all present, all blocking, and the W19 organisational fixes have been applied without relief. The migration path is contract-first: extract HTTP triggers into thin wrappers, push business logic into testable libraries (a *.Core project consumed by both the Function App and the destination host), then move binding by binding.

The companion sample lays this out concretely. Settlement.Core is consumed unchanged by Settlement.FunctionApp (the timeout-wall starting point), Settlement.AppService (always-on, adds HTTP endpoints), and Settlement.ContainerApp (KEDA-scaled, scale-to-zero, no web host). The diffs are pure hosting concerns. The algorithm and the contract are identical. Reading the three side by side makes the migration question stop being abstract: you are pricing one wrapper against another, not rewriting the workload.

Decision matrix

The cost crossover at the bottom of the table is the one most teams reach for first. The decision rarely turns on it. It turns on people-time and on whether the abstraction is fighting your design or supporting it. When the abstraction is supporting the design, the right answer is almost always "stay and clean up the wiring."

Wrap-up

The four signals from the opening map to the four sections you just read. Timeouts, memory, sockets, and file system are the platform telling you the workload is too big. Sub-orchestration sprawl, session-lock contention, and high function-to-feature ratios are your team telling you the deployment unit is too big. Sequential chains, shared schemas, and poison loops are the topology making both worse. The cost crossover decides whether the right move is a different plan, a different host, or a different problem statement. None of the signals on its own says "migrate." Two of them at once says "look harder." Three of them blocking says "the abstraction is no longer paying its freight."

When you last looked at outgrowing Functions, did you stay and refactor, or did you extract the workload to a different host?

Azure Functions Beyond the Basics
Continues from Azure Functions for .NET Developers (Parts 1-9)

Part 1: Running Azure Functions in Docker: Why and How

Part 2: Docker Pitfalls I Hit (And How to Avoid Them)

Part 3: Scaling Azure Functions: Consumption vs Premium vs Dedicated

Part 4: Structuring Complex Function Apps: Project Organization

Part 5: When Azure Functions Fight Back: Signs You've Outgrown Them (this article)

DEV Community