DEV Community: Sharad Kumar

Building Hybrid Semantic Search in ASP.NET Core — SQL Vector, Azure AI Search, and the Bugs Between Them

Sharad Kumar — Tue, 19 May 2026 04:31:22 +0000

Part 2 of building a public AI learning series on top of an existing Bulky MVC bookstore. Code is live at readify

Most semantic search tutorials start with a fresh project, a clean vector store, and a hand-picked dataset designed to make the demo look good. I had none of that.

I had an existing MVC application, an existing SQL Server database, an existing repository pattern I couldn't break, and seed data whose descriptions were all identical "lorem ipsum", which I discovered produces nearly identical embeddings, making cosine similarity essentially
random. You will not find that in the tutorials.

This article is about what actually happened: the architecture decisions, the failures that only surface at the intersection of AI, EF Core, and async, and a benchmark that showed SQL Vector outperforming Azure AI Search at a small scale, the opposite of what the plan document predicted. Here is how the system was designed before a line of code was written.

1. The Architecture: How the System Is Structured

The sequence I planned was deliberate: keyword search runs first (safe, never throws), semantic search runs second (can fail), the results merge, quality gets evaluated. That order shaped every decision that followed.

Every colour in the diagram maps to a failure zone. Green is safe keyword search that is pure SQL, with no AI dependency. Purple is the semantic path; it can fail on network, quota, or bad data. Orange is opt-in cost query expansion that only runs when the user asks for it. Red is fully decoupled RAG evaluation that is fire-and-forget and never touches the user response.

The key structural decision is sequencing: the safe path runs before the risky one. My first draft ran keyword search inside the catch block, which meant a database failure would also take down the fallback. The fix was reordering, not rewriting. A fallback is only safe if it was verified healthy before the failure happened.

2. Embedding Layer: Storing Vectors in SQL Server

An embedding is a fixed-size array of floats that represents the meaning of a piece of text. text-embedding-3-small converts a product description into a float[1536] vector — 1,536 numbers where similar meanings land close together in that space. Searching by meaning means converting the user's query into the same vector space and finding which products are closest. That's cosine similarity. None of this works without first getting those vectors into the database, which is where the storage problem starts.

The Two-Property Pattern (Because EF Core Can't Persist float[])

EF Core can't persist float[]. SQL Server can't store it either. The solution is a translation layer that lives entirely on the model and is invisible to every other layer:

// EF Core persists this — the actual database column
public byte[]? SearchEmbeddingData { get; set; }

// Application code uses this — [NotMapped] means EF ignores it entirely
[NotMapped]
public float[]? SearchEmbedding
{
    get => SearchEmbeddingData is null ? null
        : MemoryMarshal.Cast<byte, float>(SearchEmbeddingData).ToArray();
    set
    {
        if (value is null) { SearchEmbeddingData = null; return; }
        SearchEmbeddingData = MemoryMarshal.AsBytes(value.AsSpan()).ToArray();
    }
}

MemoryMarshal reinterprets the same memory block under a different type, no copy, no allocation. The getter converts bytes to floats on read. The setter converts floats to bytes on write. Every other layer works with float[] naturally and never knows bytes exist.

Seed Data Quality Is a Precondition, Not Configuration

The seed format is:

"{Title} by {Author}. Category: {Category}. {Description}"

"Gone Girl" alone tells the model less than "Gone Girl by Gillian Flynn. Category: Thriller. A psychological thriller about a missing woman and her husband's dark secrets." The surrounding context pulls the vector toward the right region of embedding space.

I discovered this the hard way: my initial seed data had identical lorem ipsum descriptions across all six products. The embeddings were nearly identical vectors cosine similarity had no meaningful signal to rank against. Search results looked random because they essentially were. Seed data quality is a precondition for RAG correctness. This is the kind of failure mode that only surfaces when you actually run the math and see the results make no sense.

The Embeddings That Were Generated But Never Saved

Embeddings generated successfully. Logs confirmed it. Then:

SELECT COUNT(*) FROM Products WHERE SearchEmbeddingData IS NOT NULL
-- Result: 0

Root cause: the existing repository's Update() method maps properties manually, one by one, a standard pattern for the rest of the codebase. SearchEmbeddingData was never added to that list. EF Core tracked the entity, SaveChanges() was called, and the byte array column was silently skipped. No exception. No warning. Nothing.

This is the specific tax of adding AI to a system with manual property-copy update methods. Every new column added to the model must be added to the update method by hand, with no compiler enforcement and no runtime signal when you forget.

[NotMapped] Is Invisible to EF — Including in Where Clauses

// Produces no SQL WHERE clause — EF silently ignores it
.Where(p => p.SearchEmbedding != null)

// Correct — filters on the actual mapped column
.Where(p => p.SearchEmbeddingData != null)

SearchEmbedding is [NotMapped], a computed property that exists only in memory. EF Core cannot translate it to SQL. Rather than throwing, it silently drops the filter entirely and performs a full table scan. The code compiles, runs, and loads every product, not only the ones with embeddings.

This only surfaces when you have a translation layer between your database type and your application type. Without the embedding system, you'd never have a [NotMapped] computed property in a LINQ filter. Add one, and your filter assumptions break without warning.

3. Search Layer: Hybrid Retrieval and Confidence Scoring

With vectors in the database, the search layer has three jobs: retrieve candidates, merge the semantic and keyword paths, and decide whether to trust the result.

Why Not a Vector Database

Reaching for Qdrant or Pinecone when someone says "embeddings" is a reasonable instinct. I didn't.

Each embedding from text-embedding-3-small is a float[1536] array: 1,536 × 4 bytes = 6 KB per product. 500 products are roughly 3 MB in memory. At that scale, loading vectors and computing cosine similarity in-process in C# is fast enough. One database, one EF Core provider, one migration. No second Azure resource to provision, bill, or debug.

The upgrade path to Azure AI Search exists, and I built it too. But building the simple path first forced me to implement cosine similarity from scratch and understand what the managed service abstracts. You can't explain why HNSW is faster if you've never felt the O(n) scan problem. The SQL Vector path is live in production. Azure AI Search is the comparison baseline. Both are benchmarked in Section 8.

Composite Confidence: Why One Threshold Wasn't Enough

My first implementation used LowConfidenceThreshold = 0.75f. A top score of 0.57 with a second score of 0.37, a gap of 0.20, was being flagged as low confidence. That result won clearly. The threshold was penalising it for not reaching an arbitrary ceiling.

A single threshold can't distinguish "this result won clearly" from "the top two results are nearly tied." A score of 0.57 with a gap of 0.20 is high confidence. The same score with a gap of 0.01 is a coin flip. The gap tells you something the absolute score cannot.

private const float LowConfidenceThreshold = 0.4f;

float topScore    = scored.ElementAtOrDefault(0)?.Score ?? 0;
float secondScore = scored.ElementAtOrDefault(1)?.Score ?? 0;
var   scoreGap    = topScore - secondScore;

bool lowConfidence =
    topScore < LowConfidenceThreshold ||           // absolute floor
    (topScore < 0.50f && scoreGap < 0.10f) ||     // mediocre + indistinct
    (topScore < 0.60f && scoreGap < 0.05f);       // decent score + coin-flip ranking

LowConfidence is driven purely by the semantic path's score, not by whether keyword results filled the remaining slots. The flag is about AI retrieval confidence, not result completeness. Mixing those signals would make it meaningless.

Keyword Search: Word-Splitting vs Single Phrase Match

The original plan described a single .Contains(query) checking if the full phrase appeared in a title or description. The actual implementation splits first:

private IReadOnlyList<Product> KeywordSearch(string query, int topK = 5)
{
    var queryWords = query
        .Split(' ', StringSplitOptions.RemoveEmptyEntries | StringSplitOptions.TrimEntries)
        .ToList();

    return _unitOfWork.Product
        .GetAll(includeProperties: "Category,ProductImages")
        .Where(p => queryWords.Any(q => p.Title.Contains(q) || p.Description.Contains(q)))
        .Take(topK)
        .ToList();
}

Searching "cozy mystery" with a full-phrase match fails if no title contains that exact substring. Splitting into words and matching any of them handles multi-word queries correctly. This is also synchronous by design, pure in-memory LINQ with no I/O. There's nothing to await.

SearchResult<T>: Why I Didn't Reuse the Week 1 Envelope

I already had AIResponse<T> from Week 1. The temptation was to add TopScore and LowConfidence to it and avoid a new type. I didn't.

AIResponse<T> carries FromCache, an infrastructure concern. SearchResult<T> carries TopScore and LowConfidence domain concerns specific to retrieval. Merging them would mean the description generator's response type carries unused search fields. A field on a type implies it's relevant to that type's callers. That's a misleading model design.

public class SearchResult<T>
{
    public IReadOnlyList<T> Items { get; init; } = [];
    public float TopScore { get; init; }
    public bool LowConfidence { get; init; }
}

4. Query Expansion: Bridging the Vocabulary Gap

Even searching for "psychological thriller remote mountain town" words that appear verbatim in a product description scored only 0.57. The stored vector was generated from a 278-character structured string. The surrounding context shifts the vector, and a short query doesn't land in the same neighbourhood even when the words are identical.

Query expansion bridges this: a GPT call reformulates a short query into richer language before embedding it:

public async Task<string> ExpandQueryAsync(string query, CancellationToken ct)
{
    var prompt = $"""
        Expand this book search query into a short descriptive phrase
        that includes genre, mood, and themes. 2-3 sentences max.
        Query: {query}
        """;

    var history = new ChatHistory();
    history.AddUserMessage(prompt);
    var response = await _chatCompletionService.GetChatMessageContentAsync(
        history, cancellationToken: ct);
    return response.Content;
}

Scores improved by ~0.05 on SQL Vector and ~0.03 on Azure AI Search when a matching product existed. When no relevant product existed, scores improved marginally but low confidence flags remained correct. Expansion cannot manufacture relevance that isn't in the catalogue. It's a vocabulary bridge, not a relevance fix.

Expansion is opt-in useQueryExpansion = false by default. The default path makes zero extra API calls. When confidence is low, the UI shows a "Search Harder" button that resubmits with expand=true. Users who get good results pay nothing extra. Users who don't can ask for more.

5. Azure AI Search: The Second Retrieval Path

I built both SQL Vector and Azure AI Search as separate, independently routable paths. A CompareSearch admin endpoint runs both against the same query and returns side-by-side results, which is how the benchmark data below was collected.

The One-Character Fix That Changed All Scores

Azure AI Search initially returned scores in the 0.016–0.033 range. The confidence threshold (0.75f) flagged every query as low confidence "dark skies" (a literal book title) and "quantum oxford" (genuinely irrelevant) scored identically at 0.033.

Root cause: passing a query string alongside vector options to SearchAsync triggers hybrid RRF (Reciprocal Rank Fusion) scoring, which blends BM25 text scores and vector scores. RRF always produces compressed tiny numbers regardless of semantic similarity. Changing SearchAsync(query, options, ct) to SearchAsync("*", options, ct) forces pure vector search and returns cosine-like scores (0.54–0.70). One character.

Separate Confidence Threshold Per Path

SQL Vector scores and Azure AI Search scores are not on the same scale, even after fixing the RRF issue. Applying the same LowConfidenceThreshold constant to both produces a meaningful signal on whichever path it was tuned for and a meaningless one on the other.

After plotting 10 queries, the natural gap in Azure scores sat between 0.5939 and 0.6258. A dedicated constant:

private const float AzureLowConfidenceThreshold = 0.61f;

cleanly separates relevant from irrelevant results. The constant name makes the intent explicit, this is not the SQL Vector threshold, and the two should never be merged.

Result Ordering Disappears After the EF Query

Azure AI Search returns product IDs ranked by relevance. A subsequent EF WHERE id IN (...) query doesn't preserve that order. SQL set operations have no ordering guarantee.

// Build a dictionary for O(1) lookup
var productMap = _unitOfWork.Product
    .GetAll(p => ids.Contains(p.Id), includeProperties: "Category,ProductImages")
    .ToDictionary(p => p.Id);

// Re-project using the original Azure-ranked order
var products = ids
    .Where(id => productMap.ContainsKey(id))
    .Select(id => productMap[id])
    .ToList();

If you fetch products by ID without re-ordering, you lose the ranking signal entirely. The most relevant result stays first only because you put it there explicitly.

6. RAG Evaluation Layer: A Second Opinion on Every Result

After the hybrid search returns results, a second LLM call scores how well the retrieved context answers the query (1 to 5), fire-and-forget, logged to App Insights, never blocking the user response.

The judge only sees the raw query and the retrieved product descriptions as plain text. No embeddings, no cosine scores, no knowledge of how retrieval worked. It reasons purely as a reader, which is the point.

The Fire-and-Forget Task That Was Never Actually Background

// Wrong — task gets cancelled the moment the HTTP response is sent
_ = _ragEvaluationService.ScoreFaithfulnessAsync(query, context, ct);

// Correct — task runs to completion regardless of request lifetime
_ = _ragEvaluationService.ScoreFaithfulnessAsync(query, context, CancellationToken.None);

The HTTP request's CancellationToken is cancelled when the response is sent. Passing it to a fire-and-forget task means the background work gets cancelled at exactly the moment it's supposed to be running independently. This only surfaces in systems that deliberately decouple a second AI call from the request lifecycle; you don't hit it with synchronous code or normally awaited async calls.

Two Design Decisions Worth Explaining

Why -1 instead of Math.Clamp: When the judge returns "3/5" or a prose explanation instead of a bare integer, int.TryParse fails. Clamping that garbage to a valid range silently corrupts dashboard averages a faithfulness trend chart fed by clamped invalid responses tells you nothing real. The -1 sentinel gets excluded from aggregates, and the warning log tells you exactly how often the judge is misbehaving.

Why ToList() before the fire-and-forget call: .Select(p => p.Description) uses deferred execution. Without materialising first, enumeration happens on a background thread after the request ends, at which point the EF Core DbContext may already be disposed. ToList() runs everything while the request is still alive. Removing it looks like a valid simplification (the types are compatible), but it introduces a bug that only surfaces at runtime under load.

The cosine similarity score and the faithfulness score measure fundamentally different things. Cosine measures mathematical distance before you look at content. Faithfulness measures whether the actual retrieved text serves the user's intent. You can have a high cosine score and a faithfulness score of 2 embeddings matched on surface features rather than meaning. Both signals together are more honest than either alone.

7. Production Hygiene: Rate Limiting and Input Validation

Each search query hits the Azure OpenAI embedding API there's a real cost attached. Three layers of defence address three different threat shapes, and none replaces the others:

Input validation (3–200 character length check) rejects garbage before any API call is made
Response caching (normalized query key: query.ToLowerInvariant().Trim()) eliminates repeat costs entirely; identical queries cost nothing after the first hit
Rate limiting caps volume from any single client

[EnableRateLimiting("search")]
public async Task<IActionResult> Search(string q, bool expand = false, CancellationToken ct = default)
{
    if (string.IsNullOrWhiteSpace(q) || q.Trim().Length < 3 || q.Trim().Length > 200)
    {
        TempData["Error"] = "Search query must be between 3 and 200 characters.";
        return RedirectToAction("Index", "Home");
    }
    // ...
}

Rate limiting policy names are a runtime concern, not a compile-time contract. If policyName: "search" in Program.cs doesn't exactly match [EnableRateLimiting("search")] on the action, the attribute is silently ignored, no error, no indication, just unenforced limits. UseRateLimiter() must also appear in the middleware pipeline after UseAuthorization(), otherwise the policy exists but is never applied.

8. The Benchmark That Contradicted the Plan

The plan document cited ~40ms for Azure AI Search vs ~120ms for SQL Vector, based on estimates from reference material. I measured before publishing.

Actual results across 10 queries:

Query	SQL #1 Result	Azure #1 Result	Same?
"sugar candy in the fair"	Cotton Candy	Cotton Candy	✓
"sunset on the beach"	Vanish in the Sunset	Vanish in the Sunset	✓
"psychological thriller remote location"	Dark Skies	Dark Skies	✓
"coming of age adventure"	The Road to Redemption	The Road to Redemption	✓
"elon musk"	(both low confidence)	(both low confidence)	✓
"quantum mechanics at oxford"	(both low confidence)	(both low confidence)	✓

10 of 10 queries returned the same top result. This validates the in-process cosine implementation as correct at this catalogue size. The divergence cases were both genuinely ambiguous queries where neither path had a strong signal, which is the expected behaviour when two different algorithms are both estimating without clear data.

Timing: SQL Vector averaged ~584ms, Azure AI Search averaged ~671ms SQL was faster. Both numbers were dominated by the shared GetEmbeddingAsync network call to Azure OpenAI (~400–750ms). The actual search operation on either side was under 100ms.

At small catalogue sizes, SQL Vector avoids a second network round trip to Azure. Azure AI Search's ANN indexing advantage only materialises when the brute-force O(n) cosine scan becomes the bottleneck, which requires hundreds of thousands of products, not six.

Measure before you claim. The plan was wrong.

9. What's Next

The Week 2 system is fully in production: hybrid retrieval across two paths, composite confidence scoring, opt-in query expansion, RAG faithfulness evaluation running after every search, and a CompareSearch endpoint that benchmarks both paths side by side.

Week 3 is Semantic Kernel plugins, MediatR/CQRS, and the first xUnit tests.

Built on: ASP.NET Core MVC · Azure OpenAI · Semantic Kernel · GPT-4o-mini · text-embedding-3-small · Azure SQL Vector · Azure AI Search

🤝 Connect with Me

If you're building AI into .NET or just following along, let's connect.

💼 LinkedIn
🐙 GitHub

I Built a Production AI Layer Inside a Legacy ASP.NET Core App — and It Broke in Ways Tutorials Never Mention

Sharad Kumar — Tue, 12 May 2026 00:34:14 +0000

Introduction

Most LLM tutorials assume you start from nothing. A blank project. A clean architecture. No constraints. No legacy code. No deployment history. No production traffic.

That is not how real systems work.

I spent one week integrating a production-grade AI service layer into an existing ASP.NET Core MVC e-commerce system that was already live, already structured, and already dependent on architectural decisions I couldn't change. The challenge wasn't calling an LLM API. It was designing an AI layer that could survive inside a real backend system without breaking testability, without leaking cost, without becoming tightly coupled to the domain, and without collapsing the moment the model or provider behaved unexpectedly.

The feature itself, a tone-aware product description generator, is simple. The system design behind it is not. This article is about the architectural decisions, the production failure modes, and the assumptions that broke the moment an LLM entered a real backend system.

AI System Design

The Provider/Domain Seam: The Most Important Structural Decision

The first instinct when adding AI to any backend is to create one service class that does everything. One class, one interface, done.

That instinct produces a system you cannot test, cannot swap, and cannot extend without touching everything.

The correct design draws a hard seam between two concerns:

Provider layer (IAIService / AzureOpenAIService): knows how to talk to an LLM. Accepts two strings (system prompt, user prompt), returns a string. Knows nothing about what a book or product is.
Domain layer (IProductAIService / BookAIService): knows what a description request looks like, what tone means, and how prompts should be constructed. Knows nothing about Azure, HTTP, or Semantic Kernel.

Above that seam: domain logic, testable with zero real network calls. Below it: provider mechanics, swappable without touching anything above. Swap Azure OpenAI for Ollama, write one new class, and change one registration. The domain layer doesn't notice.

"Naming is a design smell detector. My original class was called OpenAIService and it implemented both interfaces. When I tried to give it an honest name, I couldn't because it was doing two things."

The Generic Wrapper That Eliminates Try/Catch Everywhere

With the seam in place, the next question was: what does every AI call return? My first draft returned plain strings or domain objects directly. That looked fine until I had to handle failures, and I started writing the same try/catch block in three different places.

Every AI call in the system returns AIResponse<T>, not a raw result type. The wrapper carries:

public class AIResponse<T>
{
    public bool    Success      { get; init; }
    public T?      Data         { get; init; }
    public string? ErrorMessage { get; init; }
    public bool    FromCache    { get; init; }
    public int     TokensUsed   { get; init; }

    public static AIResponse<T> Ok(T data, int tokens = 0, bool fromCache = false) => ...
    public static AIResponse<T> Fail(string error) => ...
}

The payoff is architectural: no caller above the provider layer ever writes a try/catch. Every feature just checks result.Success. Error handling is decided once, at the boundary where it's caught, and that discipline holds automatically across every AI feature you add later.

AIResponse<string> at the provider layer. AIResponse<ProductDescriptionResult> at the domain layer. AIResponse<ChatResult> when the chatbot arrives in Week 3. Same envelope, different payloads.

The static factory methods aren't just convenience they make invalid states unrepresentable. You cannot call Ok() and get Success = false. You cannot call Fail() and accidentally leave ErrorMessage null.

The Inheritance Trap I Almost Fell Into

My first instinct was to write ProductDescriptionResult : AIResponse. It compiles. It works. But the moment you try to store a ProductDescriptionResult in a database or pass it across a service boundary, it drags Success, ErrorMessage, and TokensUsed with it, infrastructure concerns that mean nothing in the domain layer.

Composition is correct here. The result class is a pure data payload. The wrapper is the envelope. Neither inherits from the other.

System Prompts as Architectural Contracts

This is where prompt engineering actually lives in the codebase, not scattered across controllers, not inline in HTTP calls, but centralised in a single switch expression:

private static string BuildSystemPrompt(DescriptionTone tone) => tone switch
{
    DescriptionTone.Professional =>
        "You are a professional copywriter for a premium book retailer. " +
        "Write concise, authoritative product descriptions. " +
        "Never fabricate awards, authors, or facts not provided.",
    DescriptionTone.Casual =>
        "You are a friendly book recommender. Keep it warm and enthusiastic.",
    // ...
};

Two things worth calling out:

BuildSystemPrompt() and BuildUserPrompt() are separate private methods by design. Developer rules and user input must never accidentally merge. That separation is what makes the service layer both testable and secure.
Hardcoded system prompts are a deliberate decision, not laziness. They represent fixed feature contracts that don't change at runtime, they don't bleed across features, and they're easy to audit.

The temperature misconception is worth correcting explicitly: temperature is not a safety control. It's a creativity dial. The system prompt is where your behavioural constraints live. Three controls, three separate jobs: system prompt = instructions, temperature = style, max_tokens = budget.

ChatHistory: Single-Turn Today, Multi-Turn Ready Tomorrow

Most beginners send one big string to an LLM API. Using Semantic Kernel's ChatHistory correctly is a signal that you understand how chat models actually work:

var chatHistory = new ChatHistory();
chatHistory.AddSystemMessage(systemPrompt);
chatHistory.AddUserMessage(userPrompt);

The description generator creates a new ChatHistory per call by design. When the chatbot arrives in Week 3, the same pattern extends with no architecture change, just persistence added.

CancellationToken: The Hidden Cost Leak

Every async AI method in the chain accepts and propagates a CancellationToken. This isn't just politeness; it's cost control. If a user closes their browser tab mid-request and the token isn't propagated all the way to the Azure SDK call, your app completes the API call and gets billed for a response nobody receives.

A token accepted but not passed to the next await is worse than not accepting it at all; it gives false confidence that cancellation is handled. The chain must be unbroken: Controller → Service → Provider → Azure SDK call.

Response Caching: Proving It Works in the UI

Caching is wired at the provider layer with a hash-based key:

var cacheKey = $"ai:text:{systemPrompt.GetHashCode()}:{userPrompt.GetHashCode()}";

The AIResponse<T> wrapper carries a FromCache boolean all the way to the admin UI, where it renders as ⚡ Cached vs ✨ Generated. That's not decoration, it's verification. You can prove your caching is working in a live demo without opening Application Insights.

EnableCaching is a feature flag in AISettings, not a hardcoded true. You can flip it in the Azure App Service configuration without redeploying. That's a production pattern, not a tutorial habit.

The AI Cost Dashboard: Why Almost Nobody Builds This

Most AI portfolio projects show features. Mine also shows cost. The admin dashboard tracks tokens per feature per day, cost per request, and cache hit rate. This makes the economics of AI visible and is the difference between someone who built a feature and someone who shipped one.

Concrete number: approximately $15 total API spend over 8 weeks of development, with gpt-4o-mini during development (roughly 15x cheaper than gpt-4o) and caching in place.

Supporting Architecture Decisions

Three decisions that shaped how the AI layer was placed and wired were included because they caused real problems, not because they're interesting trivia.

AI infrastructure belongs at the application root, not inside feature folders. The app used area-based routing. The instinct was to put Services/AI/ inside an Area. Wrong call, UI grouping doesn't determine where cross-cutting infrastructure lives. AI features span the whole application. Scoping them to an Area implies boundaries that don't exist and creates coupling that becomes painful when the chatbot and search features arrive.

All AI wiring lives in one extension method. One line in Program.cs: builder.Services.AddAIServices(builder.Configuration). Everything else is encapsulated. The subtle thing to know: if you register the same concrete class twice under different interfaces without careful use of GetRequiredService, you can end up with two separate instances per request instead of one. Most tutorials never flag this.

Secrets never touch the config file. ApiKey exists in the settings class but is absent from appsettings.json. User Secrets fill it locally, App Service Application Settings fill it in production. The application code is identical in both environments everything is injected through the same configuration pipeline.

Challenges & Learnings

The Debugging Arc: 404 → 400 → 200

Getting the AI endpoint working produced three sequential errors, each from a different layer:

404 : routing was treating AI as an area name. Fix: explicit [Route("AI")] on the controller.
400 : the JavaScript payload sends "Professional" as a string; the C# model declares a DescriptionTone enum. ASP.NET Core's default deserializer returns a silent 400 without bridging them. Fix: add JsonStringEnumConverter. This is an AI-specific integration point — the LLM-facing tone selector has to round-trip correctly through the API layer.
200 : success.

Each error was a different subsystem. Real integrations rarely have a single root cause.

The Cascade Failure at Deployment

At this point, the feature was working locally. Deployment looked straightforward.

It wasn't.

After deploying, MaxTokens was reading as 0. Completely unrelated to the actual cause: a missing Azure App Settings entry was causing AIServiceExtensions.cs to crash at startup, leaving the settings object in a zeroed-out state. A NullReferenceException deep in Semantic Kernel setup was the symptom. A missing config key was the cause.

The lesson applies specifically to AI service registration: validate that your AI configuration is present and well-formed at startup, loudly, before any service gets built. A ?? throw on the config read gives you a readable failure message at the right location, not a cryptic error three layers downstream.

The Two Hours I Lost to a URL

This one stings to write. I spent the better part of an afternoon convinced my Semantic Kernel wiring was broken. Checked the DI registration twice. Re-read the SK docs. Added logging everywhere. Silent failure, no exception, no response.

The actual problem: I had copied the full REST endpoint URL from the Azure portal, something like https://your-resource.openai.azure.com/openai/deployments/gpt-4/chat/completions?api-version=2024-02-01 directly into my config. Semantic Kernel only wants the base domain. It constructs the rest itself. One wrong URL format, zero helpful error messages, two hours gone.

I'm including this because the Azure portal genuinely shows you the full path and it looks correct. It isn't.

Use https://your-resource.openai.azure.com/, the base domain only. Semantic Kernel builds the path from there based on your deployment name and API version.

LLM Error Messages Are a Security Leak

A raw Azure OpenAI 401 or timeout exception carries endpoint URLs, subscription hints, and SDK internals in ex.Message. None of that should reach a client. The AI service boundary catches the exception, logs it internally with full context, and returns a sanitised message to the caller. One line of difference, significant surface reduction.

The TinyMCE Silent Failure

The AI card writes to the product description field via textarea.value = text. This silently does nothing when TinyMCE is active. TinyMCE replaces the DOM element entirely. No error, no feedback. The button appears to work, but nothing changes. Lesson: The AI integration point in the UI needs to know what the UI is actually made of.

Key Takeaways

Draw the seam before you write any code. The provider/domain split is the foundational decision. Everything else, testability, swappability, and cost control, follows from it.

Make failure a return value, not a control flow mechanism. An AI system has expected failure modes: timeouts, rate limits, and degraded responses. AIResponse<T> handles them at the boundary. No caller above it needs a try/catch.

System prompts are architectural contracts, not strings. They live in one place, they don't change at runtime, and they're the only place your business rules for AI behaviour exist. Treat them with the same discipline as interface contracts.

Temperature is a creativity dial, not a safety control. The system prompt is where constraints live. Knowing the difference affects every feature you build on top of the same provider.

Propagate CancellationToken all the way to the LLM call. A broken chain gives false confidence and silently leaks API cost when users abandon requests.

Make cache hits observable. The FromCache signal in the response wrapper isn't decoration; it's the only way to verify your caching is working without opening a monitoring dashboard.

Cost is a first-class concern, not an afterthought. Token tracking per feature per day is what separates a portfolio project from a production system. Almost no one builds this. Build it anyway.

AI startup failures should be loud and specific. A missing config key that crashes deep inside Semantic Kernel setup is far worse than a clean, early "AI configuration is missing" exception at the service boundary.

Conclusion & What's Next

Week 1 is one AI feature with a production-grade system behind it: a layered architecture with a hard provider/domain seam, a generic response envelope, observable caching, structured logging, graceful degradation, and cost tracking. The feature is modest. The system thinking is not.

Week 2 is RAG semantic search using text-embedding-3-small and Azure SQL Vector, with hybrid search (vector + keyword in parallel) to handle exact matches that pure vector retrieval handles poorly. The agentic chatbot in Weeks 3 - 4 depends on RAG to ground its responses, which is why RAG comes first. Dependency-first sequencing isn't just planning hygiene; it prevents retrofitting at the seam where two features meet.

The design decisions in Week 1 were made with Week 4 in mind. That's what production AI system design actually is.

Built on: ASP.NET Core MVC · Azure OpenAI · Semantic Kernel · Microsoft.Extensions.AI · gpt-4.1-mini · IMemoryCache

🤝 Connect with Me

If you're building AI into .NET or just following along, let's connect.

💼 LinkedIn
🐙 GitHub