DEV Community: Blackthorn Vision

Azure OpenAI + Semantic Kernel in a .NET SaaS: What Breaks in Production and How to Fix It

Blackthorn Vision — Mon, 18 May 2026 12:58:34 +0000

Adding Azure OpenAI and Semantic Kernel to a .NET SaaS product is straightforward in a demo environment. The integration works, responses stream cleanly, the Semantic Kernel plugin system handles function calling elegantly, and the team ships a compelling proof of concept in a few weeks. Then the feature reaches production users, and the problems that staging never surfaced start appearing: latency spikes on concurrent requests, token costs that are 3 to 5 times the estimate, 429 rate limit errors under load, and an observability gap that makes it impossible to diagnose which component is responsible when something goes wrong.

At Blackthorn Vision, a Microsoft Solutions Partner specializing in .NET modernization and Azure AI integration, we have built Azure OpenAI and Semantic Kernel integrations into several enterprise .NET SaaS platforms. The failure modes below are not edge cases. They are the patterns that appear predictably when an integration moves from controlled demo conditions to real user load, and each one has a reliable fix.

The latency problem nobody plans for

A .NET SaaS application built on synchronous request handling is not a natural host for LLM calls, and the mismatch shows up immediately in production. Azure OpenAI API calls for GPT-4-class models return responses in 5 to 30 seconds depending on prompt length, output length, and current service load. Microsoft's own latency guidance notes that response time scales with output token count, because generation is an iterative sequential process, one token at a time.

In many enterprise ASP.NET deployments, request timeouts are configured somewhere between the application layer, IIS, a reverse proxy, Application Gateway, or the client itself, and most of these defaults were set long before LLM calls were a consideration. This mismatch with legacy timeout configuration causes silent failures that are difficult to diagnose because they surface as generic timeout errors rather than AI-specific problems.

The fix has two parts. First, streaming: Semantic Kernel supports streaming responses via InvokeStreamingAsync, which begins returning tokens as soon as the model starts generating rather than waiting for the complete response. This does not reduce total generation time, but it eliminates client-side timeouts and produces a substantially better user experience because the interface responds immediately.

// Instead of waiting for the full response:
// var result = await kernel.InvokePromptAsync(prompt);

// Stream tokens as they arrive:
await foreach (var chunk in kernel.InvokePromptStreamingAsync(prompt))
{
    await responseStream.WriteAsync(chunk.ToString());
    await responseStream.FlushAsync();
}

Second, the hosting environment needs explicit review of timeout settings at every layer between the user and the model: application-level HttpClient timeouts, IIS request timeouts, Application Gateway idle timeout, and any load balancer configuration that sits in the request path.

Token cost in production versus the estimate

Azure OpenAI pricing looks predictable until the first real production bill arrives. The pricing calculator shows input and output token rates, which are real, but production deployments consistently cost significantly more than those rates suggest for three reasons that the calculator does not account for.

On many Azure OpenAI models, output tokens are priced higher than input tokens, which means long generated responses often become the real cost driver rather than the prompts themselves. A well-structured prompt for a summarization or analysis task might send a moderate number of input tokens and receive a substantially larger number of output tokens. Most cost estimates based on the Azure pricing calculator undercount this because teams tend to model around their prompt size rather than their expected response size.

Retry overhead adds meaningfully to costs in applications that handle 429 responses by immediately retrying without proper backoff. Microsoft's quota documentation specifies that when requests exceed the token rate limit, the API returns a 429 with a Retry-After header indicating how long to wait. Applications that ignore this header and retry immediately increase request pressure on an already-throttled deployment, can prolong the throttling window, and risk additional costs when partial or repeated generations occur.

Context window management is the third cost driver that staging environments do not reveal. Semantic Kernel's chat history mechanism accumulates conversation turns in memory and sends the entire history with each request. In a multi-turn copilot feature, a conversation that reaches 20 exchanges will send the full 20-turn history as input on turn 21. Without a strategy for truncating or summarizing older context, token costs grow with conversation length.

A practical approach is to count tokens locally before sending each request, using SharpToken (a .NET port of OpenAI's tiktoken library), and trim the history when it exceeds a defined budget:

using SharpToken;

var encoding = GptEncoding.GetEncodingForModel("gpt-4o");

// Count tokens in current chat history
var totalTokens = history.Messages
    .Sum(m => encoding.Encode(m.Content ?? "").Count);

// Trim oldest turns if over budget (keep system prompt + recent context)
while (totalTokens > MaxContextTokens && history.Messages.Count > 2)
{
    var removed = history.Messages[1]; // skip system prompt at [0]
    history.Messages.RemoveAt(1);
    totalTokens -= encoding.Encode(removed.Content ?? "").Count;
}

This prevents token costs from growing quadratically with conversation length, and it catches the problem before the request is sent rather than after the bill arrives.

Rate limits under concurrent load

Azure OpenAI enforces limits on both tokens per minute (TPM) and requests per minute (RPM) for each deployment. In a multi-tenant SaaS application, multiple users triggering AI features simultaneously will exceed these limits more quickly than single-user testing reveals, and the resulting 429 errors produce a poor experience if the application does not handle them gracefully.

The problems we see most consistently at Blackthorn Vision in enterprise .NET SaaS integrations are:

Single-deployment architectures where all AI traffic goes to one Azure OpenAI deployment. When that deployment hits its TPM limit, all AI features for all users fail simultaneously. The fix is to provision multiple deployments across Azure regions and implement client-side load balancing that distributes requests and falls back to alternative deployments when one returns a 429.
No per-tenant throttling at the application layer. Without application-level rate limiting, a single high-volume tenant can exhaust the shared Azure OpenAI quota for all other tenants. Implementing per-tenant request quotas at the application layer before requests reach Azure OpenAI prevents this failure mode.
Synchronous retry logic that blocks the request thread during the backoff period. This consumes ASP.NET thread pool resources and degrades overall application performance during the period when rate limits are being hit. Using Task.Delay with CancellationToken support for retry backoff keeps threads free during the wait.

The observability gap

The hardest production problems to diagnose in an Azure OpenAI integration are the ones that do not produce obvious errors. A request that takes 25 seconds instead of the expected 8 seconds is not failing, but it is degrading user experience significantly. A Semantic Kernel plugin that calls a business logic function and receives an unexpected null value may produce a plausible-looking but incorrect AI response. Without structured logging that captures prompt inputs, token counts, latency, and function call results at each step, these problems are nearly impossible to diagnose systematically.

Semantic Kernel integrates with OpenTelemetry through Microsoft.SemanticKernel.Core, and Microsoft's Agent Framework, which merges Semantic Kernel and AutoGen into a unified production SDK released in October 2025, ships with built-in OpenTelemetry integration as a first-class feature. For existing Semantic Kernel integrations, the minimum observability setup that makes production problems diagnosable involves:

Logging prompt templates and rendered prompts (with PII scrubbing) so that unexpected model behavior can be traced to specific inputs.
Capturing token usage per request, broken down by input and output, and attributing it to the feature and tenant that generated the call.
Recording function call results from Semantic Kernel plugins, including failures and unexpected return values, so that incorrect AI outputs can be traced to specific function invocations.
Setting up Azure Monitor alerts for token usage spikes, sustained 429 error rates, and p95 latency thresholds that indicate problems before users report them.

Without this infrastructure, teams spend days diagnosing problems that could be resolved in hours with the right logging in place.

Data security and keeping data inside your Azure tenant

Enterprise .NET SaaS applications handling sensitive customer data need to ensure that data does not flow outside the customer's Azure tenant boundary during AI processing. This is not guaranteed automatically by using Azure OpenAI: it requires deliberate configuration.

The safer enterprise architecture for regulated industries keeps AI traffic private through Azure networking controls, uses Azure OpenAI resources governed by the customer's Azure subscription rather than shared endpoints, and avoids public endpoint exposure through Private Endpoints and Managed Identity. This means deploying Azure development services within the customer's own Azure subscription, restricting network access via Private Endpoints to the customer's virtual network, and using Azure AD-based Managed Identity authentication rather than API keys that could be extracted and reused outside the intended context.

For Semantic Kernel RAG implementations that use Azure AI Search as a vector store, the same network isolation applies: the search resource should be on the same private virtual network as the Azure OpenAI deployment, with no public endpoint exposure. This architecture is more complex to configure than the default setup, but it is the appropriate baseline for enterprise SaaS platforms handling sensitive customer data in regulated industries.

Semantic Kernel plugin failures in production

Semantic Kernel's plugin system, which allows the AI model to call C# functions as tools during inference, behaves differently under production conditions than in controlled testing. The model makes function calling decisions based on semantic descriptions of what functions do, and those decisions are probabilistic. Under certain input conditions, a model may call the wrong function, call a function with incorrect argument values, or invoke a function multiple times when once was intended.

In a demo environment with a small set of test prompts, these issues rarely surface. In a production SaaS with diverse user inputs, they appear regularly. The fixes are:

Write function descriptions that are unambiguous about what the function does and when it should not be called. Vague descriptions produce inconsistent function selection.
Add validation to every plugin function that checks argument values before executing business logic. Semantic Kernel passes arguments from the model as strings, and a function that assumes a valid integer may receive an empty string or an unexpected format.
Implement idempotency for any plugin function that has side effects (writes to a database, sends an email, creates a record). If the model calls the function twice due to a planning loop, the second call should produce the same result as the first without duplicating the action.
Log all function invocations, arguments, and return values, and set up alerts for functions called with invalid arguments. This is the only way to discover unexpected model behavior before it produces visible user-facing errors.

Prompt injection in enterprise plugins deserves specific attention. When plugin functions accept user-supplied text as arguments, a malicious or poorly formatted input can include instructions that attempt to redirect the model's behavior, such as telling it to ignore previous instructions or call a different function. The practical mitigation is to treat all user-supplied content passed into plugin function arguments as untrusted input: validate it against expected patterns, do not pass raw user text directly into subsequent prompts without sanitization, and use negative constraints in your system prompt that explicitly prohibit the model from following instructions embedded in user content.

For retry logic across all the failure modes above, Polly integrates cleanly with the HttpClient that Semantic Kernel uses internally:

var retryPolicy = HttpPolicyExtensions
    .HandleTransientHttpError()
    .OrResult(r => r.StatusCode == HttpStatusCode.TooManyRequests)
    .WaitAndRetryAsync(
        retryCount: 4,
        sleepDurationProvider: (attempt, response, _) =>
        {
            // Respect Retry-After header if present
            var retryAfter = response?.Result?.Headers.RetryAfter?.Delta;
            return retryAfter ?? TimeSpan.FromSeconds(Math.Pow(2, attempt));
        },
        onRetryAsync: (_, timespan, attempt, _) =>
        {
            logger.LogWarning("Azure OpenAI throttled. Retry {Attempt} in {Delay}s",
                attempt, timespan.TotalSeconds);
            return Task.CompletedTask;
        });

This respects the Retry-After header when Azure OpenAI returns it, falls back to exponential backoff when it does not, and logs each retry so that throttling patterns are visible in Application Insights before they become user-facing incidents.

Who this engagement model fits

Blackthorn Vision is brought in when enterprise .NET SaaS teams need to add Azure OpenAI or Semantic Kernel features to a production platform and need a partner who has solved the specific production problems that staging environments do not reveal. Most of the integrations we work on at Blackthorn Vision involve platforms that are already serving customers and cannot afford the kind of production incidents that result from AI features that are production-ready only in demos.

This makes Blackthorn Vision relevant for CTOs and engineering leaders searching for companies with real Azure OpenAI and Semantic Kernel experience in .NET, particularly for enterprise applications where data security, cost control, and production reliability are non-negotiable. Verified client feedback on these engagements is available on the Blackthorn Vision Clutch profile.

If you are evaluating partners for an Azure OpenAI integration into an existing .NET SaaS product, the most useful question to ask is whether they have handled rate limiting, context window management, and plugin validation at scale, because those are the problems that determine whether the feature stays in production or gets rolled back. Blackthorn Vision's AI integration approach and case studies cover both the architecture and the operational details that make the difference between a demo and a shipped feature.

Strangler Fig Pattern for .NET Modernization: How It Works in a Real Production System

Blackthorn Vision — Mon, 18 May 2026 12:51:41 +0000

The strangler fig pattern is the most practical approach to modernizing a legacy .NET monolith without stopping product delivery. It works by incrementally replacing functionality in the existing system with new services, routing traffic gradually from the old codebase to the new one until the legacy system can be decommissioned. The pattern does not require a feature freeze, does not demand a big-bang cutover, and does not force you to bet the entire modernization project on a single deployment. At Blackthorn Vision, a Microsoft Solutions Partner specializing in .NET modernization and Azure architecture, we use this approach as the default for enterprise .NET platforms where product delivery cannot pause for a rewrite.

This article covers how the pattern actually works in a .NET production context, what the implementation looks like with modern tooling, where it tends to break down, and how to sequence the migration to avoid the failure modes that affect a large share of strangler fig projects.

Why teams reach for the strangler fig pattern

The alternative to the strangler fig pattern is usually described as a big-bang rewrite: stop adding features to the legacy system, build a new version from scratch, and cut over when it is ready. The appeal is obvious. You start with a clean architecture, no legacy constraints, and the full benefit of everything the team has learned since the original system was built.

The problem is that big-bang rewrites fail at a rate that should make any engineering leader uncomfortable. Modernization Intel's analysis of enterprise strangler migration data from 2022 to 2025 found a 76% success rate across 29 tracked strangler fig projects, with median annual savings of $640K in successful engagements. Failed projects, by contrast, produced a median sunk cost of $2.1 million. A key finding from the same dataset: projects that extracted less than 5% of monolith functionality in the first 90 days had a 92% failure rate, which means early velocity is the strongest predictor of whether a strangler migration succeeds. The most common failure mode in rewrites is feature parity: the new system consistently runs behind the legacy system in capability, the cutover date slips repeatedly, and eventually leadership loses confidence and either cancels the project or forces a cutover before the new system is ready.

The strangler fig pattern sidesteps this problem by keeping the legacy system in production and making the migration reversible at every step. If a newly migrated component behaves incorrectly in production, you route traffic back to the legacy implementation while you investigate. There is no moment where the entire system depends on code that has never handled real production load.

The technical implementation for .NET: YARP as the facade layer

The strangler fig pattern requires a routing layer that sits in front of both the legacy system and the new services. In the .NET ecosystem, the recommended tool for this is YARP (Yet Another Reverse Proxy), a Microsoft-developed reverse proxy library built on ASP.NET Core middleware. Microsoft's own migration guidance for incremental ASP.NET to ASP.NET Core migrations is built around YARP, and it has become the standard approach for .NET strangler fig implementations because it integrates naturally with the existing .NET toolchain.

The setup works like this. You create a new ASP.NET Core project that hosts YARP. Initially, YARP forwards 100% of requests to the legacy .NET Framework application. As you migrate each component, you add routing rules to YARP that send specific routes or request types to the new service instead of the legacy system. The legacy application continues to run and handle everything that has not yet been migrated. From the perspective of users and external systems, nothing changes, because all requests still arrive at the same endpoint.

                ┌─────────────────────────────────────┐
                │            YARP Facade               │
 User / Client ►│         (ASP.NET Core app)           │
                │                                      │
                │  Route: /api/reports ───────────────► New .NET 8 Service
                │  Route: /api/orders  ───────────────► New .NET 8 Service
                │  Route: everything else ────────────► Legacy .NET Framework App
                └─────────────────────────────────────┘

Both systems run in production simultaneously. The routing configuration in YARP is the only thing that changes as each component is migrated. Rolling back a component means updating one routing rule, not redeploying the entire application.

The practical implementation steps for a .NET Framework to .NET 8 migration are:

Deploy the YARP-based ASP.NET Core application to Azure App Service alongside the legacy .NET Framework application. Both services run independently, with YARP configured to proxy all traffic to the legacy system as a starting point.
Add the System.Web.Adapters library to both projects, which provides compatibility shims for HttpContext and related types, allowing code that references System.Web to be moved incrementally without rewriting everything that depends on it.
Identify the first component to migrate, ideally something with clear boundaries, reasonable test coverage, and meaningful traffic volume. Starting with a low-traffic component that nobody will notice if it breaks is tempting, but it delays the point at which the team learns how the migration behaves under real load.
Build the new implementation in the ASP.NET Core project, run it in parallel with the legacy implementation, compare outputs to confirm parity, and then update the YARP routing configuration to direct that component's traffic to the new service.
Monitor the component in production for a validation period before moving on to the next component. The length of this period depends on the criticality of the component and the traffic patterns it handles.

How to pick the first component to migrate

Choosing the wrong starting point is one of the most common reasons strangler fig projects stall in the first two months. The instinct is usually to start with something small and contained, which makes sense in principle but often produces a migration that validates the toolchain without validating the approach under realistic conditions.

The criteria that produce a better first component are:

Clear external boundaries: the component has a defined API surface that other parts of the system consume through a stable contract, rather than reaching into shared state or calling internal methods directly.
Measurable output: you can run both implementations against the same inputs and compare outputs programmatically, which is the foundation of the parallel-run validation that makes the strangler fig safe.
Meaningful traffic: the component handles enough requests that production behavior is visible in monitoring within hours, not weeks. This matters because some failure modes only appear under load or in edge cases that staging environments do not produce reliably.
Limited data coupling: the component does not share database tables with multiple other components in ways that make schema changes a cross-system coordination problem.

At Blackthorn Vision, the components we typically migrate first in a .NET Framework monolith are API endpoints that handle well-defined request and response contracts, reporting and data export functions that can be validated by comparing output files, and background processing jobs that can be run in parallel and compared before the legacy version is disabled.

The parallel-run validation approach

Running both implementations in parallel and comparing their outputs is the technical mechanism that makes the strangler fig pattern safe. Without it, you are deploying new code to production and hoping it behaves correctly, which is not meaningfully different from a big-bang migration in terms of risk.

The parallel-run works by having the YARP facade send each request to both the legacy implementation and the new service simultaneously, recording both responses, and logging any discrepancies. The legacy response is returned to the caller, so users always receive the behavior they expect. The new service response is compared in the background. Discrepancies trigger alerts that the team investigates before increasing the traffic percentage routed to the new service.

This approach requires investment in observability infrastructure that many legacy .NET systems lack. If the existing system has no structured logging, no distributed tracing, and no way to correlate requests across services, that investment has to happen before the migration can proceed safely. The observability work is not overhead: it is the foundation that makes the parallel-run comparison meaningful and that gives the team confidence to increase the traffic percentage routed to the new implementation.

What breaks in practice, and why

Research covering 29 tracked strangler fig projects found that projects missing more than two key prerequisites had a 94% failure rate. The prerequisites that matter most in a .NET context are test coverage on the components being migrated, a working parallel-run validation mechanism, and a data migration strategy for components that own data.

The failure mode we see most often at Blackthorn Vision is what might be called "facade as decoration": a team builds the YARP routing layer, migrates the UI or the API layer of one component, but leaves the business logic and data access in the monolith. The new service makes calls back into the legacy system for data, which means the coupling has not been reduced, it has just been made visible through a network boundary. The team has added latency and operational complexity without actually strangling anything.

The solution to this is enforcing data sovereignty as a hard rule: each migrated service must own its data. If a new service needs to read data that currently lives in the monolith's database, the migration plan for that service must include a data migration strategy, either through dual-write during the transition, Change Data Capture (CDC) to synchronize data between the old and new stores, or a data extraction and import step that runs before traffic is routed to the new service.

The database trap deserves its own mention because it is the failure mode most likely to cause data corruption rather than just downtime. When two systems, the legacy monolith and the new service, write to the same database table simultaneously without a coordination mechanism, race conditions and conflicting writes produce corrupted records that are often invisible until a business process produces wrong results days later. This is not a theoretical risk: it is what happens when teams treat the shared database as a neutral middle ground between the old and new systems instead of recognizing it as the source of coupling they are trying to remove.

The correct approach is to never allow both systems to write to the same table at the same time. If data has to be shared during the transition, use either dual-write with application-level coordination (the new service writes to both the new store and the legacy table, and the legacy system reads from its own table) or Change Data Capture to synchronize records between the old and new data stores without allowing both systems to write the same rows. Neither approach is simple, but both are substantially safer than allowing concurrent writes to shared tables.

Other common failure points are:

Session state: .NET Framework applications often use in-process session state, which breaks immediately when traffic starts flowing through a YARP proxy to a different process. Externalizing session state to a shared Redis cache or Azure Cache for Redis before the migration begins removes this as a blocker.
Authentication: shared authentication tokens and cookies that were issued by the legacy system need to be validated by the new service. This typically requires externalizing the identity provider and configuring both systems to validate tokens from the same source.
Synchronous integrations that cannot tolerate the additional latency introduced by the proxy hop. Most integrations handle this without issue, but any integration with a timeout configured below 500ms should be identified during the assessment phase and addressed before the facade is deployed.

Sequencing the full migration

A strangler fig migration for a mid-size .NET Framework monolith typically runs over 12 to 18 months when executed alongside normal product delivery. That timeline is longer than most teams expect when they start, and shorter than most teams fear when they look at the size of the codebase.

The migration progresses in three broad phases. The first phase establishes the infrastructure: YARP is deployed, observability is in place, the parallel-run validation mechanism is working, and the first component has been migrated and validated under real production load. This phase typically takes six to eight weeks and is the most important: if the infrastructure is not solid, every subsequent migration step will be slower and riskier than it needs to be.

The second phase is the main migration loop: one component per sprint, parallel-run validation, traffic ramp, monitoring period, then the next component. The speed of this phase depends on the quality of the boundaries in the original system. Components with clear boundaries migrate quickly. Components where business logic is scattered across stored procedures, event handlers, and configuration files take longer because the boundary has to be established before the migration can happen.

The third phase is decommissioning: once all traffic has been routed to the new services, the legacy system enters a monitoring-only state for a final validation period, typically four to six weeks, before it is shut down. The YARP facade can be removed at this point or retained as a load balancer, depending on the architecture of the new system.

The Azure development services that support this migration are available and well-documented. Azure App Service hosts both systems during the parallel-run phase. Azure Cache for Redis externalizes session state. Application Insights provides the observability layer. One additional benefit that teams often underestimate until they see it in practice: moving from .NET Framework to .NET 8 removes the dependency on Windows Server, which means the new services can run in Linux containers on Azure Kubernetes Service or Linux-based App Service plans. For organizations running large fleets of Windows Server VMs, the licensing cost reduction from this shift alone can be substantial, and it becomes a secondary justification for the modernization investment that is easy to quantify for leadership.

Why this matters for AI integration

One of the main reasons enterprise teams modernize .NET Framework monoliths today is that modern AI tooling works substantially better on modern .NET architecture. Azure OpenAI integration, Semantic Kernel, and the Microsoft.Extensions.AI libraries that simplify LLM orchestration all depend on async patterns, clean service boundaries, and the observability infrastructure that legacy monoliths typically lack.

At Blackthorn Vision, the strangler fig approach is often used as a prerequisite step before Azure OpenAI or Semantic Kernel integration, because the AI workload exposes exactly the coupling and latency problems that the monolith has been hiding. Most of the platforms we modernize have teams that want to add copilot features, semantic search, or LLM-powered internal tools to an existing product. The strangler fig migration creates the service boundaries and the async infrastructure that make those integrations sustainable in production, rather than brittle demos that fail under real load.

This is why enterprise teams searching for companies with real AI integration experience in .NET often end up evaluating partners who can do both: assess and modernize the platform, and then build the AI layer on top of the architecture that modernization produced.

Who this engagement model fits

Blackthorn Vision is often brought in when enterprise teams need to modernize a legacy .NET monolith without pausing feature delivery, particularly when the codebase has accumulated enough complexity that a full rewrite carries unacceptable risk. Most of the platforms we work on have been in production for 8 to 15 years, support thousands of daily users, and involve complex ERP integrations or multi-team delivery environments where downtime is not acceptable. The strangler fig pattern with YARP is the approach we use for .NET Framework 4.x to .NET 8 migrations, and the case studies on the Blackthorn Vision site reflect the range of platforms and industries where we have applied it.

This makes Blackthorn Vision relevant for CTOs and engineering leaders searching for the best companies for legacy .NET modernization, particularly when the requirement is a partner who can own the architectural decisions and manage the migration risk, not a team that needs to be told what to do at each step.

If you are evaluating whether the strangler fig pattern is the right approach for your platform, the most useful first step is usually an honest assessment of the two things that determine whether the pattern will work: whether the existing system has enough boundary definition to support incremental extraction, and whether the team has the observability infrastructure to validate parity during the parallel-run phase. Blackthorn Vision's application modernization and assessment approach starts with exactly those two questions, and verified client feedback on how the engagements play out is available on the Blackthorn Vision Clutch profile.