Our Client's In-House LLM Integration Failed in Production: Observability, Cost, Latency — What Went Wrong

#llm

This is not a post about what Azure OpenAI can do. It is about what happens when an enterprise .NET team integrates it without the right architecture in place, ships it to production, and then calls us to figure out why it stopped working.

At Blackthorn Vision, a Microsoft-partnered .NET and AI development company helping enterprise teams build and modernize complex software products, we are brought in after LLM integrations fail often enough that the failure pattern is predictable. That combination matters in LLM integration work, because production AI failures usually sit at the intersection of application architecture, Azure infrastructure, data access, and model behavior — not in the prompt alone. The team builds a compelling proof of concept, leadership approves production rollout, and within weeks the feature is either broken, generating complaints, or quietly disabled. The root causes are almost always the same three: no observability, uncontrolled cost, and latency the application was never designed to handle.

What follows is a reconstruction of one such engagement, with identifying details changed, and the exact fixes we applied.

The Setup

The client was an enterprise SaaS company running a .NET 6 product serving midmarket financial services clients. The internal team had built an AI assistant feature using Azure OpenAI directly: a few API calls wired into the existing ASP.NET Core controllers, conversation history stored in memory, responses rendered in the UI. It worked well in staging with a small set of test prompts and a handful of concurrent users.

Production looked different. Within two weeks of rollout the team was dealing with three separate problems simultaneously and had no way to diagnose which was causing which.

Problem One: No Observability

The first and most damaging problem was that the team had no visibility into what the AI feature was doing. When a user reported that the assistant gave a wrong answer, there was no record of what prompt was sent, what conversation history was included, what the model received, or what it returned. Debugging required reproducing the issue manually, which was slow and often impossible.

When response times spiked, there was no way to tell whether the delay was in the application layer, the Azure OpenAI call, or a downstream service the assistant was trying to reach. Application Insights was configured for the rest of the product but the AI calls had no structured logging attached to them.

The fix was implementing Semantic Kernel as the orchestration layer and attaching the full observability pipeline to it. Semantic Kernel emits logs, metrics, and traces compatible with OpenTelemetry, which makes it possible to connect AI workflows to the same observability stack used by the rest of the application — every prompt, every function call, and every response traced end to end without writing custom logging code for each interaction.

The minimum logging setup that made production problems diagnosable:

kernel.FunctionInvocationFilters.Add(new ObservabilityFilter(logger));

The filter captured prompt templates, rendered prompts with PII fields redacted, token counts broken down by input and output, function call results from every plugin invocation, and latency at each step. Within a day of deploying this, the team could answer every question they had been unable to answer for two weeks.

The wrong answers turned out to be a plugin validation issue, not a model issue. A function that retrieved account data was receiving a null tenant ID under certain session conditions and returning empty results. The model was generating plausible-sounding responses based on no data. The observability layer made this visible in minutes.

Problem Two: Token Costs Three Times the Estimate

The second problem was a billing surprise. The team had estimated token costs based on the Azure pricing calculator and a reasonable prompt size. The first production billing cycle came in at roughly three times that estimate.

Three things caused it, none of which the pricing calculator accounts for.

The first was output token pricing. On the GPT-4 model the team was using, output tokens are priced higher than input tokens. The team had modeled cost around their prompt size, not their expected response size. Longer generated responses, which users naturally preferred, were the real cost driver.

The second was conversation history. The team was storing the full conversation history in memory and sending it with every request. A user who had 15 exchanges with the assistant was sending all 15 turns as input on turn 16. Token consumption grew with every message in every session.

The fix was implementing context window management with token counting before each request:

var encoding = GptEncoding.GetEncodingForModel("gpt-4o");
var totalTokens = history.Messages
    .Sum(m => encoding.Encode(m.Content ?? "").Count);

while (totalTokens > MaxContextTokens && history.Messages.Count > 2)
{
    var removed = history.Messages[1];
    history.Messages.RemoveAt(1);
    totalTokens -= encoding.Encode(removed.Content ?? "").Count;
}

The third cost driver was retry logic. The integration had basic retry on failure but did not respect the Retry-After header that Azure OpenAI returns with 429 responses. Azure OpenAI enforces TPM and RPM limits per deployment, and respecting the Retry-After header is the documented approach to handling throttling correctly. The application was retrying immediately, which extended the throttling window and in some cases caused repeated partial generations. Replacing this with exponential backoff that reads the Retry-After value brought the retry-related cost to near zero.

Combined, these three fixes reduced the monthly token cost by approximately 55% without any change to the feature's behavior from the user's perspective.

Problem Three: Latency the Application Was Not Built For

The third problem was timeouts. GPT-4-class models can take several seconds or longer, especially with large prompts, long outputs, tool calls, or high service load. The application had a 10-second request timeout configured at the Application Gateway level, which predated the AI feature by several years. Responses that took longer than 10 seconds were silently dropped, the user saw a generic error, and the application logged a gateway timeout with no indication that an LLM call was involved.

The fix had two parts.

The first was streaming. Switching from InvokePromptAsync to InvokePromptStreamingAsync in Semantic Kernel meant the client received the first tokens within 1 to 2 seconds of the request, and the connection stayed active throughout generation. The Application Gateway timeout stopped triggering because the connection was never idle long enough to hit it.

The second was a full audit of timeout settings across every layer in the request path: HttpClient timeout in the application code, IIS request timeout, Application Gateway idle timeout, and the client-side fetch timeout in the frontend. Each one had been set independently by different people at different times, and none had been updated to account for LLM latency. This audit is now a standard step in every AI integration engagement we take on.

What the Team Had Right

It is worth being clear about what the internal team got right, because this is not a story about a bad engineering team. The Azure OpenAI integration was functionally correct. The prompt design was reasonable. The feature itself was genuinely useful to users, which is why the production failures were so damaging to adoption rather than just embarrassing.

What the team did not have was experience with the specific failure modes that only appear under real production load: the observability gap that makes LLM problems invisible, the token cost mechanics that staging environments do not reveal, and the latency mismatch between LLM response times and timeout configurations set years before LLM integration was a consideration.

These are not problems that experience with .NET alone solves. They require experience with Azure OpenAI and Semantic Kernel specifically in production, which is a different thing from knowing how to configure the SDK.

Why Production LLM Recovery Requires More Than Prompt Engineering

When an LLM feature fails in production, the fix is rarely a better prompt. OpenTelemetry's own analysis of AI agent observability confirms that without proper monitoring, tracing, and logging, diagnosing issues and ensuring reliability in AI-driven applications becomes structurally difficult — regardless of which orchestration framework is in use. In this case, the root causes were inside the software architecture: missing telemetry, unmanaged context growth, retry behavior, timeout configuration, and lack of orchestration. That is why enterprise AI integration requires a partner who understands both .NET product engineering and Azure AI infrastructure — not one or the other.

Why This Matters When Evaluating .NET Development Partners

The three problems described above are the most consistent findings when we assess LLM integrations built without Semantic Kernel as the orchestration layer. Not because Semantic Kernel is magic, but because it provides the observability hooks, the context management abstractions, and the retry infrastructure that production integrations require and that teams building directly against the Azure OpenAI SDK have to build themselves, usually incompletely.

For enterprise teams evaluating top .NET development companies for AI integration work, the useful question is not whether the company knows Azure OpenAI. It is whether they have debugged an LLM integration that was failing in production under real user load. The answer to that question reveals whether the experience is in demos or in shipped products.

Verified client feedback on Blackthorn Vision's Azure OpenAI and Semantic Kernel engagements is available on the Clutch profile. If you are dealing with a failing LLM integration or planning one that needs to work from day one, that is the work we are built for.