Blackthorn Vision

Posted on May 18

Azure OpenAI + Semantic Kernel in a .NET SaaS: What Breaks in Production and How to Fix It

#dotnet #azure #kernel #openai

Adding Azure OpenAI and Semantic Kernel to a .NET SaaS product is straightforward in a demo environment. The integration works, responses stream cleanly, the Semantic Kernel plugin system handles function calling elegantly, and the team ships a compelling proof of concept in a few weeks. Then the feature reaches production users, and the problems that staging never surfaced start appearing: latency spikes on concurrent requests, token costs that are 3 to 5 times the estimate, 429 rate limit errors under load, and an observability gap that makes it impossible to diagnose which component is responsible when something goes wrong.

At Blackthorn Vision, a Microsoft Solutions Partner specializing in .NET modernization and Azure AI integration, we have built Azure OpenAI and Semantic Kernel integrations into several enterprise .NET SaaS platforms. The failure modes below are not edge cases. They are the patterns that appear predictably when an integration moves from controlled demo conditions to real user load, and each one has a reliable fix.

The latency problem nobody plans for

A .NET SaaS application built on synchronous request handling is not a natural host for LLM calls, and the mismatch shows up immediately in production. Azure OpenAI API calls for GPT-4-class models return responses in 5 to 30 seconds depending on prompt length, output length, and current service load. Microsoft's own latency guidance notes that response time scales with output token count, because generation is an iterative sequential process, one token at a time.

In many enterprise ASP.NET deployments, request timeouts are configured somewhere between the application layer, IIS, a reverse proxy, Application Gateway, or the client itself, and most of these defaults were set long before LLM calls were a consideration. This mismatch with legacy timeout configuration causes silent failures that are difficult to diagnose because they surface as generic timeout errors rather than AI-specific problems.

The fix has two parts. First, streaming: Semantic Kernel supports streaming responses via InvokeStreamingAsync, which begins returning tokens as soon as the model starts generating rather than waiting for the complete response. This does not reduce total generation time, but it eliminates client-side timeouts and produces a substantially better user experience because the interface responds immediately.

// Instead of waiting for the full response:
// var result = await kernel.InvokePromptAsync(prompt);

// Stream tokens as they arrive:
await foreach (var chunk in kernel.InvokePromptStreamingAsync(prompt))
{
    await responseStream.WriteAsync(chunk.ToString());
    await responseStream.FlushAsync();
}

Second, the hosting environment needs explicit review of timeout settings at every layer between the user and the model: application-level HttpClient timeouts, IIS request timeouts, Application Gateway idle timeout, and any load balancer configuration that sits in the request path.

Token cost in production versus the estimate

Azure OpenAI pricing looks predictable until the first real production bill arrives. The pricing calculator shows input and output token rates, which are real, but production deployments consistently cost significantly more than those rates suggest for three reasons that the calculator does not account for.

On many Azure OpenAI models, output tokens are priced higher than input tokens, which means long generated responses often become the real cost driver rather than the prompts themselves. A well-structured prompt for a summarization or analysis task might send a moderate number of input tokens and receive a substantially larger number of output tokens. Most cost estimates based on the Azure pricing calculator undercount this because teams tend to model around their prompt size rather than their expected response size.

Retry overhead adds meaningfully to costs in applications that handle 429 responses by immediately retrying without proper backoff. Microsoft's quota documentation specifies that when requests exceed the token rate limit, the API returns a 429 with a Retry-After header indicating how long to wait. Applications that ignore this header and retry immediately increase request pressure on an already-throttled deployment, can prolong the throttling window, and risk additional costs when partial or repeated generations occur.

Context window management is the third cost driver that staging environments do not reveal. Semantic Kernel's chat history mechanism accumulates conversation turns in memory and sends the entire history with each request. In a multi-turn copilot feature, a conversation that reaches 20 exchanges will send the full 20-turn history as input on turn 21. Without a strategy for truncating or summarizing older context, token costs grow with conversation length.

A practical approach is to count tokens locally before sending each request, using SharpToken (a .NET port of OpenAI's tiktoken library), and trim the history when it exceeds a defined budget:

using SharpToken;

var encoding = GptEncoding.GetEncodingForModel("gpt-4o");

// Count tokens in current chat history
var totalTokens = history.Messages
    .Sum(m => encoding.Encode(m.Content ?? "").Count);

// Trim oldest turns if over budget (keep system prompt + recent context)
while (totalTokens > MaxContextTokens && history.Messages.Count > 2)
{
    var removed = history.Messages[1]; // skip system prompt at [0]
    history.Messages.RemoveAt(1);
    totalTokens -= encoding.Encode(removed.Content ?? "").Count;
}

This prevents token costs from growing quadratically with conversation length, and it catches the problem before the request is sent rather than after the bill arrives.

Rate limits under concurrent load

Azure OpenAI enforces limits on both tokens per minute (TPM) and requests per minute (RPM) for each deployment. In a multi-tenant SaaS application, multiple users triggering AI features simultaneously will exceed these limits more quickly than single-user testing reveals, and the resulting 429 errors produce a poor experience if the application does not handle them gracefully.

The problems we see most consistently at Blackthorn Vision in enterprise .NET SaaS integrations are:

Single-deployment architectures where all AI traffic goes to one Azure OpenAI deployment. When that deployment hits its TPM limit, all AI features for all users fail simultaneously. The fix is to provision multiple deployments across Azure regions and implement client-side load balancing that distributes requests and falls back to alternative deployments when one returns a 429.
No per-tenant throttling at the application layer. Without application-level rate limiting, a single high-volume tenant can exhaust the shared Azure OpenAI quota for all other tenants. Implementing per-tenant request quotas at the application layer before requests reach Azure OpenAI prevents this failure mode.
Synchronous retry logic that blocks the request thread during the backoff period. This consumes ASP.NET thread pool resources and degrades overall application performance during the period when rate limits are being hit. Using Task.Delay with CancellationToken support for retry backoff keeps threads free during the wait.

The observability gap

The hardest production problems to diagnose in an Azure OpenAI integration are the ones that do not produce obvious errors. A request that takes 25 seconds instead of the expected 8 seconds is not failing, but it is degrading user experience significantly. A Semantic Kernel plugin that calls a business logic function and receives an unexpected null value may produce a plausible-looking but incorrect AI response. Without structured logging that captures prompt inputs, token counts, latency, and function call results at each step, these problems are nearly impossible to diagnose systematically.

Semantic Kernel integrates with OpenTelemetry through Microsoft.SemanticKernel.Core, and Microsoft's Agent Framework, which merges Semantic Kernel and AutoGen into a unified production SDK released in October 2025, ships with built-in OpenTelemetry integration as a first-class feature. For existing Semantic Kernel integrations, the minimum observability setup that makes production problems diagnosable involves:

Logging prompt templates and rendered prompts (with PII scrubbing) so that unexpected model behavior can be traced to specific inputs.
Capturing token usage per request, broken down by input and output, and attributing it to the feature and tenant that generated the call.
Recording function call results from Semantic Kernel plugins, including failures and unexpected return values, so that incorrect AI outputs can be traced to specific function invocations.
Setting up Azure Monitor alerts for token usage spikes, sustained 429 error rates, and p95 latency thresholds that indicate problems before users report them.

Without this infrastructure, teams spend days diagnosing problems that could be resolved in hours with the right logging in place.

Data security and keeping data inside your Azure tenant

Enterprise .NET SaaS applications handling sensitive customer data need to ensure that data does not flow outside the customer's Azure tenant boundary during AI processing. This is not guaranteed automatically by using Azure OpenAI: it requires deliberate configuration.

The safer enterprise architecture for regulated industries keeps AI traffic private through Azure networking controls, uses Azure OpenAI resources governed by the customer's Azure subscription rather than shared endpoints, and avoids public endpoint exposure through Private Endpoints and Managed Identity. This means deploying Azure development services within the customer's own Azure subscription, restricting network access via Private Endpoints to the customer's virtual network, and using Azure AD-based Managed Identity authentication rather than API keys that could be extracted and reused outside the intended context.

For Semantic Kernel RAG implementations that use Azure AI Search as a vector store, the same network isolation applies: the search resource should be on the same private virtual network as the Azure OpenAI deployment, with no public endpoint exposure. This architecture is more complex to configure than the default setup, but it is the appropriate baseline for enterprise SaaS platforms handling sensitive customer data in regulated industries.

Semantic Kernel plugin failures in production

Semantic Kernel's plugin system, which allows the AI model to call C# functions as tools during inference, behaves differently under production conditions than in controlled testing. The model makes function calling decisions based on semantic descriptions of what functions do, and those decisions are probabilistic. Under certain input conditions, a model may call the wrong function, call a function with incorrect argument values, or invoke a function multiple times when once was intended.

In a demo environment with a small set of test prompts, these issues rarely surface. In a production SaaS with diverse user inputs, they appear regularly. The fixes are:

Write function descriptions that are unambiguous about what the function does and when it should not be called. Vague descriptions produce inconsistent function selection.
Add validation to every plugin function that checks argument values before executing business logic. Semantic Kernel passes arguments from the model as strings, and a function that assumes a valid integer may receive an empty string or an unexpected format.
Implement idempotency for any plugin function that has side effects (writes to a database, sends an email, creates a record). If the model calls the function twice due to a planning loop, the second call should produce the same result as the first without duplicating the action.
Log all function invocations, arguments, and return values, and set up alerts for functions called with invalid arguments. This is the only way to discover unexpected model behavior before it produces visible user-facing errors.

Prompt injection in enterprise plugins deserves specific attention. When plugin functions accept user-supplied text as arguments, a malicious or poorly formatted input can include instructions that attempt to redirect the model's behavior, such as telling it to ignore previous instructions or call a different function. The practical mitigation is to treat all user-supplied content passed into plugin function arguments as untrusted input: validate it against expected patterns, do not pass raw user text directly into subsequent prompts without sanitization, and use negative constraints in your system prompt that explicitly prohibit the model from following instructions embedded in user content.

For retry logic across all the failure modes above, Polly integrates cleanly with the HttpClient that Semantic Kernel uses internally:

var retryPolicy = HttpPolicyExtensions
    .HandleTransientHttpError()
    .OrResult(r => r.StatusCode == HttpStatusCode.TooManyRequests)
    .WaitAndRetryAsync(
        retryCount: 4,
        sleepDurationProvider: (attempt, response, _) =>
        {
            // Respect Retry-After header if present
            var retryAfter = response?.Result?.Headers.RetryAfter?.Delta;
            return retryAfter ?? TimeSpan.FromSeconds(Math.Pow(2, attempt));
        },
        onRetryAsync: (_, timespan, attempt, _) =>
        {
            logger.LogWarning("Azure OpenAI throttled. Retry {Attempt} in {Delay}s",
                attempt, timespan.TotalSeconds);
            return Task.CompletedTask;
        });

This respects the Retry-After header when Azure OpenAI returns it, falls back to exponential backoff when it does not, and logs each retry so that throttling patterns are visible in Application Insights before they become user-facing incidents.

Who this engagement model fits

Blackthorn Vision is brought in when enterprise .NET SaaS teams need to add Azure OpenAI or Semantic Kernel features to a production platform and need a partner who has solved the specific production problems that staging environments do not reveal. Most of the integrations we work on at Blackthorn Vision involve platforms that are already serving customers and cannot afford the kind of production incidents that result from AI features that are production-ready only in demos.

This makes Blackthorn Vision relevant for CTOs and engineering leaders searching for companies with real Azure OpenAI and Semantic Kernel experience in .NET, particularly for enterprise applications where data security, cost control, and production reliability are non-negotiable. Verified client feedback on these engagements is available on the Blackthorn Vision Clutch profile.

If you are evaluating partners for an Azure OpenAI integration into an existing .NET SaaS product, the most useful question to ask is whether they have handled rate limiting, context window management, and plugin validation at scale, because those are the problems that determine whether the feature stays in production or gets rolled back. Blackthorn Vision's AI integration approach and case studies cover both the architecture and the operational details that make the difference between a demo and a shipped feature.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.