Martin Oehlert

Posted on Apr 3 • Edited on Apr 24

Azure Functions Observability: From Blind Spots to Production Clarity

#azure #azurefunctions #dotnet #serverless

Your function works locally, passes all tests, and deploys without errors. But how do you know it's healthy at 2am when a queue-triggered function silently drops messages? With a traditional web app on a VM, you'd SSH in, check logs, inspect process health. Serverless strips all of that away.

The observability gap in serverless is real, and it's structural. Your function runs inside an ephemeral container that spins up on demand, processes an event, and disappears. There's no persistent server to monitor, no process to attach a debugger to, no /var/log to tail. When your function app scales to zero between invocations, even continuous metric collection breaks down: there is literally nothing running to emit telemetry.

And when it scales from zero to fifty concurrent instances under load, correlating a single failed request across that distributed execution becomes a different problem entirely. Traditional APM tools assume long-lived processes with stable identities. Serverless functions violate every one of those assumptions.

Application Insights fills that gap. When connected to your function app (via a connection string, not the deprecated instrumentation key), it automatically captures request telemetry for every function execution, tracks dependencies like HTTP calls and database queries, collects host-level performance counters, and aggregates invocation metrics you can query from the portal or through code.

On top of that, it gives you structured log queries with KQL (Kusto Query Language), an application map that visualizes how your function calls downstream services, distributed tracing that follows a single request across multiple functions and dependencies, and alerting rules that can page you before your users notice something is wrong.

The examples below use the classic Application Insights SDK with the isolated worker model, which is what most production .NET function apps run today. The companion repository at azure-functions-samples has working examples of both the classic SDK (HttpTriggerDemo) and OpenTelemetry (EventHubDemo).

Setting Up Application Insights

Creating the Resource

You can create an Application Insights resource through the Azure Portal (search "Application Insights" and click Create) or provision it with Bicep alongside your function app:

resource appInsights 'Microsoft.Insights/components@2020-02-02' = {
  name: 'appi-orders-prod'
  location: location
  kind: 'web'
  properties: {
    Application_Type: 'web'
    WorkspaceResourceId: logAnalyticsWorkspace.id
  }
}

The resource gives you a connection string, which looks like InstrumentationKey=<guid>;IngestionEndpoint=https://region.in.applicationinsights.azure.com/. Use this, not the instrumentation key alone. Microsoft deprecated standalone instrumentation key ingestion in March 2025, and connection strings are required for sovereign clouds, regional endpoints, and Entra ID-authenticated ingestion. Store the value in your function app's APPLICATIONINSIGHTS_CONNECTION_STRING application setting, and Azure picks it up automatically.

NuGet Packages

The isolated worker model needs two packages:

dotnet add package Microsoft.ApplicationInsights.WorkerService --version 2.22.0
dotnet add package Microsoft.Azure.Functions.Worker.ApplicationInsights --version 2.0.0

Pin Microsoft.ApplicationInsights.WorkerService to 2.22.0. Version 3.0.0 migrated to OpenTelemetry internally and broke the ITelemetryInitializer interface that Microsoft.Azure.Functions.Worker.ApplicationInsights depends on. The result is a TypeLoadException at startup:

System.TypeLoadException: Could not load type
'Microsoft.ApplicationInsights.Extensibility.ITelemetryInitializer'
from assembly 'Microsoft.ApplicationInsights, Version=3.0.0.1'

This affects every .NET version (not just .NET 10). Until the Functions worker package ships a compatible update, stay on 2.22.0. Add a comment in your .csproj so the next person who runs dotnet outdated doesn't blindly upgrade:

<!-- Do NOT upgrade to 3.x: breaks Functions worker. See github.com/Azure/azure-functions-dotnet-worker/issues/3322 -->
<PackageReference Include="Microsoft.ApplicationInsights.WorkerService" Version="2.22.0" />
<PackageReference Include="Microsoft.Azure.Functions.Worker.ApplicationInsights" Version="2.0.0" />
<!-- Check https://github.com/Azure/azure-functions-dotnet-worker/issues/3322 before upgrading either package -->

The second package, Microsoft.Azure.Functions.Worker.ApplicationInsights, is what connects your dependency telemetry (HTTP calls, SQL queries, queue operations) back to the parent function invocation. Without it, correlation breaks.

Program.cs Configuration

Two method calls handle the setup:

using Microsoft.Azure.Functions.Worker;
using Microsoft.Azure.Functions.Worker.Builder;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;

var builder = FunctionsApplication.CreateBuilder(args);

builder.Services
    .AddApplicationInsightsTelemetryWorkerService()
    .ConfigureFunctionsApplicationInsights();

builder.Build().Run();

AddApplicationInsightsTelemetryWorkerService() registers the Application Insights SDK for worker-style apps (background services, Functions). ConfigureFunctionsApplicationInsights() hooks into the Functions runtime's activity pipeline so that incoming triggers and outbound calls produce the right request and dependency telemetry.

See this in context in the HttpTriggerDemo Program.cs.

One catch: the SDK registers a default logging filter that suppresses everything below Warning. If you leave it in place, your ILogger.LogInformation() calls never reach Application Insights. Remove it explicitly:

using Microsoft.Azure.Functions.Worker;
using Microsoft.Azure.Functions.Worker.Builder;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.Logging;

var builder = FunctionsApplication.CreateBuilder(args);

builder.Services
    .AddApplicationInsightsTelemetryWorkerService()
    .ConfigureFunctionsApplicationInsights();

builder.Logging.Services.Configure<LoggerFilterOptions>(options =>
{
    LoggerFilterRule? defaultRule = options.Rules.FirstOrDefault(
        rule => rule.ProviderName ==
            "Microsoft.Extensions.Logging.ApplicationInsights.ApplicationInsightsLoggerProvider");

    if (defaultRule is not null)
    {
        options.Rules.Remove(defaultRule);
    }
});

builder.Build().Run();

Alternatively, manage log levels through appsettings.json (loaded automatically by FunctionsApplication.CreateBuilder):

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft": "Warning"
    },
    "ApplicationInsights": {
      "LogLevel": {
        "Default": "Information"
      }
    }
  }
}

What Gets Auto-Collected vs. What You Add Manually

Once that's in place, the SDK and the Functions runtime collect telemetry automatically, with no extra code:

Gotcha: Two Log Pipelines, Two Configurations

The isolated worker model runs your code in a separate process from the Functions host. This means host.json controls logging for the host process (trigger dispatch, scaling decisions, extension lifecycle), while your Program.cs or appsettings.json controls logging for the worker process (your function code, your dependencies, your ILogger calls). If you set "Default": "Information" in host.json but never configure your worker, your application logs still default to Warning only. You have to configure both sides, and they use different files.

Logging Best Practices

Structured Logging with ILogger

Structured logging writes log entries as key-value pairs instead of flat strings. In Application Insights, those keys become columns in the customDimensions property of the traces table, which means you can filter and aggregate by them in KQL without parsing text.

The ILogger API supports this through message templates: named placeholders wrapped in curly braces, filled by positional arguments.

public class ProcessOrderFunction(ILogger<ProcessOrderFunction> logger)
{
    [Function(nameof(ProcessOrderFunction))]
    public async Task Run(
        [QueueTrigger("orders")] OrderMessage message)
    {
        logger.LogInformation(
            "Order received: {OrderId} from customer {CustomerId}, total {OrderTotal}",
            message.OrderId,
            message.CustomerId,
            message.Total);

        var stopwatch = Stopwatch.StartNew();
        await ProcessAsync(message);
        stopwatch.Stop();

        logger.LogInformation(
            "Order {OrderId} processed successfully in {ElapsedMs}ms",
            message.OrderId,
            stopwatch.ElapsedMilliseconds);
    }
}

Two things to watch here. First, use PascalCase for placeholder names ({OrderId}, not {orderId} or {order_id}). Application Insights stores these as customDimensions keys, and PascalCase matches the convention for the rest of the telemetry schema. Second, never use string interpolation ($"Order {orderId}") in log calls. Interpolated strings defeat structured logging entirely: the provider receives a pre-formatted string with no queryable fields, and the arguments are evaluated even when the log level is disabled.

In the Application Insights traces table, a query like this pulls all logs for a specific order:

traces
| where customDimensions.OrderId == "ORD-20260330-1847"
| project timestamp, message, customDimensions.CustomerId, customDimensions.ElapsedMs
| order by timestamp asc

High-Performance Logging with Source Generators

For functions processing thousands of messages per second (high-throughput queue or Event Hub triggers), the standard LogInformation extension methods have measurable overhead: they box value-type arguments, allocate a params object[] on every call, and parse the message template at runtime.

The [LoggerMessage] source generator eliminates all three costs by generating strongly typed methods at compile time:

public static partial class OrderLogs
{
    [LoggerMessage(
        EventId = 1001,
        Level = LogLevel.Information,
        Message = "Order {OrderId} received from {CustomerId}, total {OrderTotal}")]
    public static partial void OrderReceived(
        ILogger logger, string orderId, string customerId, decimal orderTotal);

    [LoggerMessage(
        EventId = 1002,
        Level = LogLevel.Warning,
        Message = "Order {OrderId} retry attempt {RetryCount}")]
    public static partial void OrderRetrying(
        ILogger logger, string orderId, int retryCount);
}

Call these directly: OrderLogs.OrderReceived(logger, message.OrderId, message.CustomerId, message.Total). The generated code includes a log-level check before evaluating any arguments, so you pay zero cost when Information-level logging is disabled in production. Use this pattern on hot paths; for functions running a few times per minute, the standard extension methods are fine. The companion repo has source generator examples in both HttpTriggerDemo/Logging/OrderLogs.cs and EventHubDemo/Logging/SensorLogs.cs.

Correlation with BeginScope

Individual log lines tell you what happened; BeginScope ties related entries into a single operation. When you wrap your function body in a scope, every log entry inside it automatically inherits the scope's properties as customDimensions in Application Insights.

The critical detail: you must pass a Dictionary<string, object> to BeginScope for the keys to appear as individual customDimensions columns. A plain string or a message template with arguments produces a single formatted string in the scope, which is much harder to query.

[Function(nameof(ProcessOrderFunction))]
public async Task Run(
    [QueueTrigger("orders")] OrderMessage message)
{
    using var scope = logger.BeginScope(new Dictionary<string, object>
    {
        ["OrderId"] = message.OrderId,
        ["CustomerId"] = message.CustomerId,
        ["TenantId"] = message.TenantId
    });

    logger.LogInformation("Validating order");
    await ValidateAsync(message);

    logger.LogInformation("Charging payment");
    await ChargePaymentAsync(message);

    logger.LogInformation("Order complete");
}

Every log line inside that using block now carries OrderId, CustomerId, and TenantId in its customDimensions, even the ones from ValidateAsync and ChargePaymentAsync (assuming they use the same ILogger instance). This is how you trace a complete business operation across multiple internal methods without threading correlation IDs through every method signature.

Log Levels for Production

The host.json logLevel section controls which categories reach Application Insights. Two categories are easy to misconfigure, and getting them wrong silently breaks your monitoring dashboards.

{
  "logging": {
    "logLevel": {
      "default": "Warning",
      "Function": "Information",
      "Host.Results": "Information",
      "Host.Aggregator": "Trace"
    }
  }
}

Host.Results feeds the requests table. If you raise this above Information, successful function executions stop appearing in the Application Insights Performance and Failures blades, and the Function Monitor tab in the portal goes blank. You lose your primary visibility into whether functions are running at all.

Host.Aggregator feeds the customMetrics table with aggregated counts and durations. Set it to Trace so the runtime writes every batch. If you raise this to Warning or higher, the function overview dashboard in the portal loses its success rate and duration charts.

The default: Warning baseline keeps noise low for framework categories (Microsoft.*, Worker, System.*) while Function: Information ensures your own function logs reach Application Insights.

When you need to change log levels without redeploying, override any host.json value through app settings. The pattern replaces dots with double underscores:

AzureFunctionsJobHost__logging__logLevel__Function.ProcessOrder = Debug

This takes effect on the next function host restart (which happens automatically when you update an app setting) and lets you temporarily increase verbosity for a single function without touching host.json.

Sampling Configuration

Application Insights enables adaptive sampling by default, targeting 20 telemetry items per second per host. At low volume, you won't notice. At scale, sampling can silently discard traces, dependencies, and custom events before they reach your workspace.

The recommended production configuration excludes Request and Exception from sampling, so you never lose function execution records or error details:

{
  "logging": {
    "applicationInsights": {
      "samplingSettings": {
        "isEnabled": true,
        "maxTelemetryItemsPerSecond": 20,
        "excludedTypes": "Request;Exception"
      }
    }
  }
}

To check whether sampling is actively dropping data, run this KQL query. Any row where TelemetrySavedPercentage is below 100 means that telemetry type is being sampled:

union traces, dependencies, requests
| where timestamp > ago(1d)
| summarize
    TelemetrySavedPercentage = round(100.0 / avg(itemCount), 1),
    TelemetryDroppedPercentage = round(100.0 - 100.0 / avg(itemCount), 1)
    by bin(timestamp, 1h), itemType
| where TelemetrySavedPercentage < 100
| order by timestamp desc

The itemCount field on each telemetry item tells you how many similar items it represents. An itemCount of 5 means Application Insights kept one item and estimated it represents five. If your traces show 30% dropped, either raise maxTelemetryItemsPerSecond or add Trace to excludedTypes for the categories that matter most to your debugging workflow. Watch your ingestion costs, though: excluding too many types from sampling at high volume can push you past your daily data cap quickly.

Reading Traces and Metrics

Once telemetry is flowing into Application Insights, you need to know where to look and what to ask. The portal gives you three entry points: Transaction Search for hunting specific executions, Log queries (KQL) for anything that requires aggregation or correlation, and the Application Map for a visual snapshot of your function app's dependencies.

Transaction Search

Transaction Search is the fastest way to find what happened to a specific function execution. Open it from the left nav in your Application Insights resource, or use the shortcut from the Investigate section of the overview blade.

The filters that matter most for Azure Functions:

Operation name: the function name as registered in the runtime (e.g., ProcessOrderFunction). Filter here when you want all executions of a specific function in a time window.
Result code: for HTTP triggers, this is the HTTP status code (200, 500, etc.). For non-HTTP triggers (queue, timer, blob), 0 means success and 1 means failure. Combine with operation name to pull only failed runs.
Time range: narrow this first, before adding other filters. Application Insights searches can time out on broad time ranges at high volume.

Click any result to open the End-to-end transaction view. This is where distributed tracing pays off: you'll see the full execution timeline as a Gantt chart, with your function's request at the top and every outbound dependency (HTTP calls to payment APIs, SQL queries, Service Bus operations) shown as child spans with their durations. If a queue-triggered ProcessOrderFunction call failed at 2am, this view tells you whether the failure was in your code, in a downstream HTTP call, or in a database query.

One limitation: Transaction Search shows individual telemetry items, not aggregated data. If you want to know "how many orders failed between 1am and 3am and which customer IDs were affected", you need KQL.

KQL Essentials for Azure Functions

All four tables you'll use most (requests, dependencies, traces, exceptions) share an operation_Id column. That ID is the distributed trace ID that ties every log line, dependency call, and exception back to the single function invocation that produced them.

Finding slow executions

The requests table records every function invocation. duration is in milliseconds.

requests
| where timestamp > ago(24h)
| where name == "ProcessOrderFunction"
| where duration > 5000
| project timestamp, id, duration, resultCode, operation_Id
| order by duration desc

This gives you the slowest ProcessOrderFunction executions in the last 24 hours. Swap > 5000 for whatever your SLA threshold is. The operation_Id in each row is your entry point into the full trace for that execution.

Tracing a single request end-to-end

You have an operation_Id from a failed execution (from Transaction Search, from an alert, or from a support ticket). This query reconstructs everything that happened:

let traceId = "abc123def456";
union requests, dependencies, traces, exceptions
| where operation_Id == traceId
| project timestamp, itemType, name, message, duration, success, resultCode, customDimensions
| order by timestamp asc

The union across all four tables is deliberate. A single function execution produces rows in multiple tables: a requests row for the invocation itself, dependencies rows for every outbound call, traces rows for your ILogger calls, and an exceptions row if something threw. The itemType column tells you which table each row came from.

If you set up BeginScope with OrderId and CustomerId as described in the logging section, those values appear in customDimensions on every trace row. You can also work backwards from a business ID when you don't have an operation_Id:

traces
| where timestamp > ago(24h)
| where customDimensions.OrderId == "ORD-20260330-1847"
| project operation_Id
| distinct operation_Id

Take that operation_Id and feed it into the union query above.

Counting failures by function name over time

requests
| where timestamp > ago(7d)
| where success == false
| summarize FailureCount = count() by bin(timestamp, 1h), name
| order by timestamp desc, FailureCount desc

This surfaces which functions fail most, and whether failures cluster at specific times (a sign of a dependency being unhealthy during a maintenance window, or a batch job hitting a resource limit).

Finding dependency bottlenecks

Your function may be fast; a downstream service may not be.

dependencies
| where timestamp > ago(24h)
| where cloud_RoleName == "your-function-app-name"
| summarize
    CallCount = count(),
    P50 = percentile(duration, 50),
    P95 = percentile(duration, 95),
    P99 = percentile(duration, 99),
    FailureRate = round(100.0 * countif(success == false) / count(), 1)
    by target, type
| order by P95 desc

Replace "your-function-app-name" with the value in your function app's Application Insights configuration (it defaults to the function app name). The target column shows the external endpoint or database, and type shows the dependency kind (HTTP, SQL, Azure Service Bus, etc.). A high P95 with a low FailureRate means the dependency is slow but not failing outright: the kind of problem that shows up as user-visible latency before it shows up as errors.

One gotcha with KQL in the portal: queries run against a Log Analytics workspace, and there's a default query scope. If you open KQL from the Application Insights blade, you're automatically scoped to that resource's workspace. If you open it from a general Log Analytics workspace, you need to add | where cloud_RoleName == "your-function-app-name" to every query, or you'll get results mixed across all resources in the workspace.

Application Map

The Application Map (left nav, under Investigate) renders your function app as a node and every dependency it calls as connected nodes. Each connection shows call volume, average duration, and failure rate. Nodes turn yellow when failure rates exceed roughly 20-30% and red above 50% (the thresholds aren't configurable).

For a ProcessOrderFunction that calls a payment API and writes to SQL, you'd see three nodes: your function app in the centre, the payment API to one side, and the SQL database to the other. The lines between them show call counts and P95 latency. If the payment API node is yellow, that's your first place to look during an incident.

The map is useful for a quick health check and for onboarding new team members, but it has limits. It aggregates across all functions in the app, so if you have ten functions and one is hammering a slow dependency, the map shows the aggregate. It also doesn't distinguish between functions calling the same dependency: if both ProcessOrderFunction and RefundOrderFunction call the same SQL database, the database shows one aggregated node. For function-level dependency analysis, go back to the KQL query above.

Alerts

Alert rules in Application Insights let you define a condition and trigger an action group (email, Teams webhook, PagerDuty, etc.) when the condition is met. You configure them under the Alerts section of your Application Insights resource.

To create a failure rate alert for ProcessOrderFunction:

Select Create > Alert rule.
Set the signal to Custom log search.

Use this KQL as the condition:

requests
| where name == "ProcessOrderFunction"
| where timestamp > ago(5m)
| summarize
    Total = count(),
    Failed = countif(success == false)
| extend FailureRate = round(100.0 * Failed / Total, 1)
| where FailureRate > 5

Set evaluation frequency to every 1 minute, and the lookback window to 5 minutes.
Configure the threshold: alert when the query returns any rows (meaning FailureRate > 5 for that window).

Action groups are the notification mechanism. One action group can send email, post to a Teams incoming webhook, and call an Azure Automation runbook simultaneously. Define your on-call action group once, then reuse it across all alert rules.

A few practical notes on alert tuning:

Start with a 5-minute window and a 5% threshold, then tighten after you've seen a few weeks of baseline data. Alerting on 1-minute windows at 1% failure rate on a low-volume function produces a lot of noise for transient errors.
The requests table has a 1-2 minute ingestion delay under normal conditions and up to 5 minutes during ingestion spikes. A 5-minute lookback window accounts for this. A 1-minute window can miss failures entirely if ingestion is delayed.
For queue-triggered functions, complement failure rate alerts with a queue depth alert on the source queue (configured through Azure Monitor metrics, not Application Insights). A growing queue combined with low invocation count means your function is failing at startup, before it even executes: a scenario that produces no requests rows and won't trigger a failure rate alert.

Common Issues and Fixes

"My logs aren't showing up in Application Insights"

It comes down to one of four things, and you can rule them out in order.

Check the connection string first. Open your function app in the portal, go to Configuration, and confirm APPLICATIONINSIGHTS_CONNECTION_STRING is set and points to the right resource. If it's missing or set to an instrumentation key only (the InstrumentationKey=<guid> format without an IngestionEndpoint), nothing reaches Application Insights at all.

Check the worker log level config. As covered in the two-pipeline gotcha above: host.json controls the host process, but your Program.cs or appsettings.json controls the worker. If you haven't explicitly configured the worker's ApplicationInsights log level, the SDK default applies: Warning only. Add this to your appsettings.json:

{
  "Logging": {
    "ApplicationInsights": {
      "LogLevel": {
        "Default": "Information"
      }
    }
  }
}

Or remove the default filter rule entirely in Program.cs, as shown in the setup section.

Check sampling. If requests appear but traces for specific functions don't, sampling may be discarding them. Run the KQL query from the sampling section to see which telemetry types are being dropped. Add Trace to excludedTypes in host.json if you need full trace fidelity for a critical function.

Check the Function log level in host.json. If Function is set to Warning or higher, LogInformation calls from your function code never leave the host. Set it to Information to restore them.

"Dependencies are missing from the Application Map"

When your Application Map shows your function app as an isolated node with no outbound edges, check for a missing ConfigureFunctionsApplicationInsights() call in Program.cs.

AddApplicationInsightsTelemetryWorkerService() registers the SDK. ConfigureFunctionsApplicationInsights() is what connects the Functions runtime's ActivitySource to that SDK so outbound HTTP, SQL, and Azure SDK calls produce dependency telemetry with the correct operation IDs. Without it, dependencies are either not tracked at all or tracked with broken correlation (they appear in the dependencies table but don't link back to the parent request).

builder.Services
    .AddApplicationInsightsTelemetryWorkerService()
    .ConfigureFunctionsApplicationInsights(); // Required for dependency tracking

If both calls are present and you're still missing HTTP dependencies: check how HttpClient is registered. The Application Insights SDK instruments HttpClient via IHttpClientFactory. If you're creating HttpClient instances with new HttpClient() directly instead of injecting an IHttpClientFactory-managed instance, those calls bypass the instrumentation entirely.

// Not tracked
private readonly HttpClient _client = new HttpClient();

// Tracked (inject IHttpClientFactory via primary constructor)
public class MyFunction(IHttpClientFactory httpClientFactory)
{
    private readonly HttpClient _client = httpClientFactory.CreateClient();
}

builder.Services.AddHttpClient();

"I see duplicate telemetry for every request"

In the isolated worker model, both the host process and the worker process can emit telemetry for the same function invocation. When both are sending to the same Application Insights resource, you get duplicate requests entries, inflated counts, and misleading failure rates.

This is controlled by the telemetryMode setting at the root level of host.json (not inside logging). The default is default, which allows both sides to emit. Setting it to OpenTelemetry resolves the duplication, but note that when you do, the logging.applicationInsights section of host.json no longer applies:

{
  "version": "2.0",
  "telemetryMode": "OpenTelemetry"
}

Alternatively, suppress host-side request telemetry while keeping your worker-side telemetry by raising Host.Results above Information in host.json's logLevel section. The tradeoff: this also removes successful execution records from the portal's Function Monitor tab. Use telemetryMode when you want clean deduplication without losing host-side visibility.

To confirm duplication before changing anything, run this in KQL:

requests
| where timestamp > ago(1h)
| summarize count() by operation_Id
| where count_ > 1
| order by count_ desc

Any operation_Id appearing more than once is a duplicated invocation.

"Cold start latency spikes in my metrics"

Cold starts produce latency spikes that look identical to slow execution in your metrics. Before investigating application code, confirm whether a spike is a cold start or an actual regression.

A cold start request carries a specific pattern: high latency on the first invocation from a given instance, with subsequent requests from the same instance running at normal duration. The cloud_RoleInstance dimension on each request record identifies the instance.

requests
| where timestamp > ago(24h)
| where name == "ProcessOrderFunction"
| summarize
    first_request = min(timestamp),
    p50 = percentile(duration, 50),
    p99 = percentile(duration, 99),
    request_count = count()
    by cloud_RoleInstance
| extend is_cold_start_instance = (request_count <= 3)
| order by first_request desc

Instances where request_count is 1 or 2 are almost certainly fresh scale-out instances, and their durations are not representative of your steady-state performance. Filter them out when computing your SLA metrics:

requests
| where timestamp > ago(24h)
| where name == "ProcessOrderFunction"
| join kind=inner (
    requests
    | summarize request_count = count() by cloud_RoleInstance
    | where request_count > 5
) on cloud_RoleInstance
| summarize p50 = percentile(duration, 50), p99 = percentile(duration, 99)

If the spike appears in warm instances, you have a real slowdown. If it's limited to fresh instances appearing after a scale-out event, it's cold start behavior. The two require different responses: cold starts call for pre-warming strategies or Consumption to Premium plan migration; actual slowdowns point to profiling.

"Alerts fire but I can't find the failing requests"

You set up an alert on exception count, it fires, you open the Failures blade, and the requests that caused the exceptions are gone. Sampling is discarding the evidence.

By default, Exception telemetry is sampled alongside everything else. When the SDK keeps one exception and estimates it represents five, the other four are discarded permanently. Your alert fires because the metric aggregation (which runs before sampling discards anything) saw all five. Your query returns only the one that survived.

The fix is to exclude Exception from sampling in host.json:

{
  "logging": {
    "applicationInsights": {
      "samplingSettings": {
        "isEnabled": true,
        "maxTelemetryItemsPerSecond": 20,
        "excludedTypes": "Request;Exception"
      }
    }
  }
}

Adding Request to excludedTypes ensures the parent request record is also always kept, so you can correlate the exception back to its invocation through the operation_Id. Without both, you may find the exception but not the request that caused it.

If the alert is on a custom metric rather than exceptions, check whether customMetrics is being sampled. Custom metrics emitted through TelemetryClient.GetMetric() are not affected by sampling (they're pre-aggregated in the SDK before sending). Custom events emitted with TelemetryClient.TrackEvent() are sampled, and alerts based on custom event counts can suffer the same problem. Add Event to excludedTypes if that's your signal source.

OpenTelemetry Alternative

The classic Application Insights SDK works well if your entire stack lives in Azure. But if you need to send telemetry to Grafana, Datadog, Jaeger, or any other backend alongside (or instead of) Azure Monitor, you're duplicating instrumentation code for each target. OpenTelemetry solves this at the protocol level: one set of instrumentation, one exporter interface, multiple backends.

OpenTelemetry is a CNCF project that defines a vendor-neutral API and wire format (OTLP) for traces, metrics, and logs. The same instrumentation code that sends data to Application Insights can also send to Zipkin or Collector pipelines with a config change, not a code rewrite.

The Setup

Microsoft publishes the Microsoft.Azure.Functions.Worker.OpenTelemetry package for this purpose, paired with the Azure Monitor exporter:

dotnet add package Microsoft.Azure.Functions.Worker.OpenTelemetry
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package Azure.Monitor.OpenTelemetry.Exporter

First, enable OpenTelemetry output from the Functions host by adding "telemetryMode": "OpenTelemetry" at the root of your host.json (the same setting described in the duplicate telemetry section). Then the Program.cs registration replaces the classic SDK calls:

using Azure.Monitor.OpenTelemetry.Exporter;
using Microsoft.Azure.Functions.Worker;
using Microsoft.Azure.Functions.Worker.Builder;
using Microsoft.Azure.Functions.Worker.OpenTelemetry;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;

var builder = FunctionsApplication.CreateBuilder(args);

builder.Services
    .AddOpenTelemetry()
    .UseFunctionsWorkerDefaults()
    .UseAzureMonitorExporter(); // reads APPLICATIONINSIGHTS_CONNECTION_STRING automatically

builder.Build().Run();

UseFunctionsWorkerDefaults() hooks into the Functions runtime's ActivitySource for proper distributed trace correlation (the OpenTelemetry equivalent of ConfigureFunctionsApplicationInsights() from the classic SDK). Without it, dependency telemetry won't correlate back to the parent function invocation. See the full setup in EventHubDemo/Program.cs.

The connection string is still required; UseAzureMonitorExporter() reads APPLICATIONINSIGHTS_CONNECTION_STRING from the environment the same way the classic SDK does. If you also want to export to a second backend (requires the additional OpenTelemetry.Exporter.OpenTelemetryProtocol package), register it separately:

builder.Services
    .AddOpenTelemetry()
    .UseFunctionsWorkerDefaults()
    .UseAzureMonitorExporter();

builder.Services
    .AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .AddOtlpExporter(otlp =>
        {
            otlp.Endpoint = new Uri("http://localhost:4317");
        }));

UseAzureMonitorExporter() is a cross-cutting registration that configures all signals at once. Chaining signal-specific exporters like AddOtlpExporter after it in the same builder can throw a NotSupportedException. Separate AddOpenTelemetry() calls avoid the conflict.

What You Gain

W3C Trace Context is the default propagation format, which means your distributed traces correlate correctly with other OpenTelemetry-instrumented services regardless of what backend they report to. With the classic SDK you get this too, but only within the Application Insights ecosystem; outside it, the format diverges.

You also get multi-backend export: Azure Monitor for your ops team, a Grafana stack for your platform team, and a local Collector for local debugging, all from the same process. And if you ever migrate off Azure Monitor entirely, you replace one exporter registration, not every TelemetryClient call in your codebase.

What You Lose Today

The OpenTelemetry path is not at feature parity with the classic SDK for Functions specifically.

Live Metrics (the real-time stream at monitor.azure.com) does not work with the distro. It relies on a proprietary push mechanism in the classic SDK that has no OpenTelemetry equivalent yet.

Snapshot Debugger is unavailable. It's a classic SDK feature with no OTLP counterpart.

Auto-collection gaps: some dependency types that the classic SDK instruments automatically (certain Azure SDK operations, Service Bus settlement calls) may not be captured out of the box, depending on which OpenTelemetry instrumentation libraries you've added. You may need to add AddAzureClientsInstrumentation() or equivalent packages explicitly.

Documentation: the distro's documentation for Functions scenarios specifically is thin. Most samples target ASP.NET Core web apps; you'll spend time adapting them and testing whether auto-collection works for your trigger types.

When to Choose Which

Use the classic SDK if:

your entire workload runs on Azure and you have no multi-vendor requirements
you need Live Metrics or Snapshot Debugger
you want the richest out-of-the-box experience with the Application Insights portal today

Use OpenTelemetry if:

you're sending telemetry to multiple backends, or planning to
the rest of your services are already OpenTelemetry-instrumented and you need consistent trace propagation across the board
you're building something that might not always live in Azure

If you're greenfield on a purely Azure stack, the classic SDK is less configuration for the same result right now. If you're instrumenting a heterogeneous system or building for portability, OpenTelemetry's overhead is worth it; you pay once at setup and gain the flexibility when requirements change.

What's Next

This is Part 9 and the final article in the core series. Over nine weeks, this series went from "what is serverless" to querying production telemetry in KQL. If you followed along and built something, you now have a function app with HTTP and queue triggers, proper configuration with Key Vault, a CI/CD pipeline through GitHub Actions, and Application Insights wired up for structured logging, distributed tracing, and alerting. That covers the full lifecycle: build, test, deploy, monitor.

The companion repository at azure-functions-samples has working code for every article in the series. Clone it, break things, wire up your own alerts.

Next week is a bonus article outside the core series: production cost realities on the Consumption plan, and the signals that tell you it's time to move to Flex Consumption or Premium. If you've ever wondered why your monthly bill looked nothing like the pricing calculator, that one is for you.

When you first wired up monitoring on a production function app, which alert did you set up first: failure rate or latency?

Azure Functions for .NET Developers: Series

Part 1: Why Azure Functions? Serverless for .NET Developers

Part 2: Your First Azure Function: HTTP Triggers Step-by-Step

Part 3: Beyond HTTP: Timer, Queue, and Blob Triggers

Part 4: Local Development Setup: Tools, Debugging, and Hot Reload

Part 5: Understanding the Isolated Worker Model

Part 6: Configuration Done Right: Settings, Secrets, and Key Vault

Part 7: Testing Azure Functions: Unit, Integration, and Local

Part 8: Deploying to Azure: CI/CD with GitHub Actions

Part 9: Azure Functions Observability: From Blind Spots to Production Clarity (this article)

Bonus: Production Realities: When Serverless Stops Being Serverless

DEV Community