<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sharad Kumar</title>
    <description>The latest articles on DEV Community by Sharad Kumar (@sharad_kumar_45b990921489).</description>
    <link>https://dev.to/sharad_kumar_45b990921489</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3916780%2Fac06fee2-be79-4289-8730-b12b23a3416a.png</url>
      <title>DEV Community: Sharad Kumar</title>
      <link>https://dev.to/sharad_kumar_45b990921489</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sharad_kumar_45b990921489"/>
    <language>en</language>
    <item>
      <title>Building Hybrid Semantic Search in ASP.NET Core — SQL Vector, Azure AI Search, and the Bugs Between Them</title>
      <dc:creator>Sharad Kumar</dc:creator>
      <pubDate>Tue, 19 May 2026 04:31:22 +0000</pubDate>
      <link>https://dev.to/sharad_kumar_45b990921489/building-hybrid-semantic-search-in-aspnet-core-sql-vector-azure-ai-search-and-the-bugs-between-aed</link>
      <guid>https://dev.to/sharad_kumar_45b990921489/building-hybrid-semantic-search-in-aspnet-core-sql-vector-azure-ai-search-and-the-bugs-between-aed</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 2 of building a public AI learning series on top of an existing Bulky MVC bookstore. Code is live at &lt;a href="https://readify-eph9gsh4exanaafg.canadacentral-01.azurewebsites.net" rel="noopener noreferrer"&gt;readify&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Most semantic search tutorials start with a fresh project, a clean vector store, and a hand-picked dataset designed to make the demo look good. I had none of that.&lt;/p&gt;

&lt;p&gt;I had an existing MVC application, an existing SQL Server database, an existing repository pattern I couldn't break, and seed data whose descriptions were all identical "lorem ipsum", which I discovered produces nearly identical embeddings, making cosine similarity essentially &lt;br&gt;
random. You will not find that in the tutorials.&lt;/p&gt;

&lt;p&gt;This article is about what actually happened: the architecture decisions, the failures that only surface at the intersection of AI, EF Core, and async, and a benchmark that showed SQL Vector outperforming Azure AI Search at a small scale, the opposite of what the plan document predicted. Here is how the system was designed before a line of code was written.&lt;/p&gt;
&lt;h2&gt;
  
  
  1. The Architecture: How the System Is Structured
&lt;/h2&gt;

&lt;p&gt;The sequence I planned was deliberate: keyword search runs first (safe, never throws), semantic search runs second (can fail), the results merge, quality gets evaluated. That order shaped every decision that followed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbs5rke64st78hpa4km93.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbs5rke64st78hpa4km93.png" alt="Hybrid search architecture diagram showing the full retrieval pipeline from user query through AIController, keyword and semantic paths, query expansion, embedding service, SQL Vector and Azure AI Search, merge and confidence scoring, and RAG faithfulness evaluation" width="800" height="901"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every colour in the diagram maps to a failure zone. Green is safe keyword search that is pure SQL, with no AI dependency. Purple is the semantic path; it can fail on network, quota, or bad data. Orange is opt-in cost query expansion that only runs when the user asks for it. Red is fully decoupled RAG evaluation that is fire-and-forget and never touches the user response.&lt;/p&gt;

&lt;p&gt;The key structural decision is sequencing: the safe path runs before the risky one. My first draft ran keyword search inside the catch block, which meant a database failure would also take down the fallback. The fix was reordering, not rewriting. &lt;strong&gt;A fallback is only safe if it was verified healthy before the failure happened.&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  2. Embedding Layer: Storing Vectors in SQL Server
&lt;/h2&gt;

&lt;p&gt;An embedding is a fixed-size array of floats that represents the meaning of a piece of text. &lt;code&gt;text-embedding-3-small&lt;/code&gt; converts a product description into a &lt;code&gt;float[1536]&lt;/code&gt; vector — 1,536 numbers where similar meanings land close together in that space. Searching by meaning means converting the user's query into the same vector space and finding which products are closest. That's cosine similarity. None of this works without first getting those vectors into the database, which is where the storage problem starts.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Two-Property Pattern (Because EF Core Can't Persist float[])
&lt;/h3&gt;

&lt;p&gt;EF Core can't persist &lt;code&gt;float[]&lt;/code&gt;. SQL Server can't store it either. The solution is a translation layer that lives entirely on the model and is invisible to every other layer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// EF Core persists this — the actual database column&lt;/span&gt;
&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;[]?&lt;/span&gt; &lt;span class="n"&gt;SearchEmbeddingData&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;set&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Application code uses this — [NotMapped] means EF ignores it entirely&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;NotMapped&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;[]?&lt;/span&gt; &lt;span class="n"&gt;SearchEmbedding&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;get&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;SearchEmbeddingData&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;
        &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;MemoryMarshal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Cast&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;SearchEmbeddingData&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;ToArray&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;set&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;value&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;SearchEmbeddingData&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;SearchEmbeddingData&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MemoryMarshal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AsBytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AsSpan&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;ToArray&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;MemoryMarshal&lt;/code&gt; reinterprets the same memory block under a different type, no copy, no allocation. The getter converts bytes to floats on read. The setter converts floats to bytes on write. Every other layer works with &lt;code&gt;float[]&lt;/code&gt; naturally and never knows bytes exist.&lt;/p&gt;

&lt;h3&gt;
  
  
  Seed Data Quality Is a Precondition, Not Configuration
&lt;/h3&gt;

&lt;p&gt;The seed format is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"{Title} by {Author}. Category: {Category}. {Description}"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"Gone Girl" alone tells the model less than "Gone Girl by Gillian Flynn. Category: Thriller. A psychological thriller about a missing woman and her husband's dark secrets." The surrounding context pulls the vector toward the right region of embedding space.&lt;/p&gt;

&lt;p&gt;I discovered this the hard way: my initial seed data had identical lorem ipsum descriptions across all six products. The embeddings were nearly identical vectors cosine similarity had no meaningful signal to rank against. Search results looked random because they essentially were. &lt;strong&gt;Seed data quality is a precondition for RAG correctness.&lt;/strong&gt; This is the kind of failure mode that only surfaces when you actually run the math and see the results make no sense.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Embeddings That Were Generated But Never Saved
&lt;/h3&gt;

&lt;p&gt;Embeddings generated successfully. Logs confirmed it. Then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;Products&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;SearchEmbeddingData&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="c1"&gt;-- Result: 0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Root cause: the existing repository's &lt;code&gt;Update()&lt;/code&gt; method maps properties manually, one by one, a standard pattern for the rest of the codebase. &lt;code&gt;SearchEmbeddingData&lt;/code&gt; was never added to that list. EF Core tracked the entity, &lt;code&gt;SaveChanges()&lt;/code&gt; was called, and the byte array column was silently skipped. No exception. No warning. Nothing.&lt;/p&gt;

&lt;p&gt;This is the specific tax of adding AI to a system with manual property-copy update methods. Every new column added to the model must be added to the update method by hand, with no compiler enforcement and no runtime signal when you forget.&lt;/p&gt;

&lt;h3&gt;
  
  
  [NotMapped] Is Invisible to EF — Including in Where Clauses
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Produces no SQL WHERE clause — EF silently ignores it&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SearchEmbedding&lt;/span&gt; &lt;span class="p"&gt;!=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Correct — filters on the actual mapped column&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SearchEmbeddingData&lt;/span&gt; &lt;span class="p"&gt;!=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;SearchEmbedding&lt;/code&gt; is &lt;code&gt;[NotMapped]&lt;/code&gt;, a computed property that exists only in memory. EF Core cannot translate it to SQL. Rather than throwing, it silently drops the filter entirely and performs a full table scan. The code compiles, runs, and loads every product, not only the ones with embeddings.&lt;/p&gt;

&lt;p&gt;This only surfaces when you have a translation layer between your database type and your application type. Without the embedding system, you'd never have a &lt;code&gt;[NotMapped]&lt;/code&gt; computed property in a LINQ filter. Add one, and your filter assumptions break without warning.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Search Layer: Hybrid Retrieval and Confidence Scoring
&lt;/h2&gt;

&lt;p&gt;With vectors in the database, the search layer has three jobs: retrieve candidates, merge the semantic and keyword paths, and decide whether to trust the result.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Not a Vector Database
&lt;/h3&gt;

&lt;p&gt;Reaching for Qdrant or Pinecone when someone says "embeddings" is a reasonable instinct. I didn't.&lt;/p&gt;

&lt;p&gt;Each embedding from &lt;code&gt;text-embedding-3-small&lt;/code&gt; is a &lt;code&gt;float[1536]&lt;/code&gt; array: 1,536 × 4 bytes = 6 KB per product. 500 products are roughly 3 MB in memory. At that scale, loading vectors and computing cosine similarity in-process in C# is fast enough. One database, one EF Core provider, one migration. No second Azure resource to provision, bill, or debug.&lt;/p&gt;

&lt;p&gt;The upgrade path to Azure AI Search exists, and I built it too. But building the simple path first forced me to implement cosine similarity from scratch and understand what the managed service abstracts. You can't explain why HNSW is faster if you've never felt the O(n) scan problem. The SQL Vector path is live in production. Azure AI Search is the comparison baseline. Both are benchmarked in Section 8.&lt;/p&gt;

&lt;h3&gt;
  
  
  Composite Confidence: Why One Threshold Wasn't Enough
&lt;/h3&gt;

&lt;p&gt;My first implementation used &lt;code&gt;LowConfidenceThreshold = 0.75f&lt;/code&gt;. A top score of 0.57 with a second score of 0.37, a gap of 0.20, was being flagged as low confidence. That result won clearly. The threshold was penalising it for not reaching an arbitrary ceiling.&lt;/p&gt;

&lt;p&gt;A single threshold can't distinguish "this result won clearly" from "the top two results are nearly tied." A score of 0.57 with a gap of 0.20 is high confidence. The same score with a gap of 0.01 is a coin flip. The gap tells you something the absolute score cannot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;LowConfidenceThreshold&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;0.4f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;topScore&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ElementAtOrDefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)?.&lt;/span&gt;&lt;span class="n"&gt;Score&lt;/span&gt; &lt;span class="p"&gt;??&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;secondScore&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ElementAtOrDefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)?.&lt;/span&gt;&lt;span class="n"&gt;Score&lt;/span&gt; &lt;span class="p"&gt;??&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt;   &lt;span class="n"&gt;scoreGap&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;topScore&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="n"&gt;secondScore&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;lowConfidence&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;
    &lt;span class="n"&gt;topScore&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;LowConfidenceThreshold&lt;/span&gt; &lt;span class="p"&gt;||&lt;/span&gt;           &lt;span class="c1"&gt;// absolute floor&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topScore&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;0.50f&lt;/span&gt; &lt;span class="p"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;scoreGap&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;0.10f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;||&lt;/span&gt;     &lt;span class="c1"&gt;// mediocre + indistinct&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topScore&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;0.60f&lt;/span&gt; &lt;span class="p"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;scoreGap&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;0.05f&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;       &lt;span class="c1"&gt;// decent score + coin-flip ranking&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;LowConfidence&lt;/code&gt; is driven purely by the semantic path's score, not by whether keyword results filled the remaining slots. The flag is about AI retrieval confidence, not result completeness. Mixing those signals would make it meaningless.&lt;/p&gt;

&lt;h3&gt;
  
  
  Keyword Search: Word-Splitting vs Single Phrase Match
&lt;/h3&gt;

&lt;p&gt;The original plan described a single &lt;code&gt;.Contains(query)&lt;/code&gt; checking if the full phrase appeared in a title or description. The actual implementation splits first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="n"&gt;IReadOnlyList&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Product&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;KeywordSearch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;topK&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;queryWords&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sc"&gt;' '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StringSplitOptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RemoveEmptyEntries&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;StringSplitOptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TrimEntries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ToList&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_unitOfWork&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Product&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;GetAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;includeProperties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Category,ProductImages"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;queryWords&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Title&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;||&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Take&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topK&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ToList&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Searching "cozy mystery" with a full-phrase match fails if no title contains that exact substring. Splitting into words and matching any of them handles multi-word queries correctly. This is also synchronous by design, pure in-memory LINQ with no I/O. There's nothing to await.&lt;/p&gt;

&lt;h3&gt;
  
  
  SearchResult&amp;lt;T&amp;gt;: Why I Didn't Reuse the Week 1 Envelope
&lt;/h3&gt;

&lt;p&gt;I already had &lt;code&gt;AIResponse&amp;lt;T&amp;gt;&lt;/code&gt; from Week 1. The temptation was to add &lt;code&gt;TopScore&lt;/code&gt; and &lt;code&gt;LowConfidence&lt;/code&gt; to it and avoid a new type. I didn't.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;AIResponse&amp;lt;T&amp;gt;&lt;/code&gt; carries &lt;code&gt;FromCache&lt;/code&gt;, an infrastructure concern. &lt;code&gt;SearchResult&amp;lt;T&amp;gt;&lt;/code&gt; carries &lt;code&gt;TopScore&lt;/code&gt; and &lt;code&gt;LowConfidence&lt;/code&gt; domain concerns specific to retrieval. Merging them would mean the description generator's response type carries unused search fields. A field on a type implies it's relevant to that type's callers. That's a misleading model design.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SearchResult&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="n"&gt;IReadOnlyList&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Items&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;init&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;TopScore&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;init&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;LowConfidence&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;init&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. Query Expansion: Bridging the Vocabulary Gap
&lt;/h2&gt;

&lt;p&gt;Even searching for "psychological thriller remote mountain town"  words that appear verbatim in a product description scored only 0.57. The stored vector was generated from a 278-character structured string. The surrounding context shifts the vector, and a short query doesn't land in the same neighbourhood even when the words are identical.&lt;/p&gt;

&lt;p&gt;Query expansion bridges this: a GPT call reformulates a short query into richer language before embedding it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;ExpandQueryAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CancellationToken&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;$"""
&lt;/span&gt;        &lt;span class="n"&gt;Expand&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt; &lt;span class="n"&gt;book&lt;/span&gt; &lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="k"&gt;into&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="kt"&gt;short&lt;/span&gt; &lt;span class="n"&gt;descriptive&lt;/span&gt; &lt;span class="n"&gt;phrase&lt;/span&gt;
        &lt;span class="n"&gt;that&lt;/span&gt; &lt;span class="n"&gt;includes&lt;/span&gt; &lt;span class="n"&gt;genre&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mood&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;and&lt;/span&gt; &lt;span class="n"&gt;themes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;-&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;Query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="s"&gt;""";
&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;ChatHistory&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AddUserMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_chatCompletionService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;GetChatMessageContentAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cancellationToken&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Content&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scores improved by ~0.05 on SQL Vector and ~0.03 on Azure AI Search when a matching product existed. When no relevant product existed, scores improved marginally but low confidence flags remained correct. &lt;strong&gt;Expansion cannot manufacture relevance that isn't in the catalogue.&lt;/strong&gt; It's a vocabulary bridge, not a relevance fix.&lt;/p&gt;

&lt;p&gt;Expansion is opt-in &lt;code&gt;useQueryExpansion = false&lt;/code&gt; by default. The default path makes zero extra API calls. When confidence is low, the UI shows a "Search Harder" button that resubmits with &lt;code&gt;expand=true&lt;/code&gt;. Users who get good results pay nothing extra. Users who don't can ask for more.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Azure AI Search: The Second Retrieval Path
&lt;/h2&gt;

&lt;p&gt;I built both SQL Vector and Azure AI Search as separate, independently routable paths. A &lt;code&gt;CompareSearch&lt;/code&gt; admin endpoint runs both against the same query and returns side-by-side results, which is how the benchmark data below was collected.&lt;/p&gt;

&lt;h3&gt;
  
  
  The One-Character Fix That Changed All Scores
&lt;/h3&gt;

&lt;p&gt;Azure AI Search initially returned scores in the 0.016–0.033 range. The confidence threshold (&lt;code&gt;0.75f&lt;/code&gt;) flagged every query as low confidence "dark skies" (a literal book title) and "quantum oxford" (genuinely irrelevant) scored identically at 0.033.&lt;/p&gt;

&lt;p&gt;Root cause: passing a query string alongside vector options to &lt;code&gt;SearchAsync&lt;/code&gt; triggers hybrid RRF (Reciprocal Rank Fusion) scoring, which blends BM25 text scores and vector scores. RRF always produces compressed tiny numbers regardless of semantic similarity. Changing &lt;code&gt;SearchAsync(query, options, ct)&lt;/code&gt; to &lt;code&gt;SearchAsync("*", options, ct)&lt;/code&gt; forces pure vector search and returns cosine-like scores (0.54–0.70). One character.&lt;/p&gt;

&lt;h3&gt;
  
  
  Separate Confidence Threshold Per Path
&lt;/h3&gt;

&lt;p&gt;SQL Vector scores and Azure AI Search scores are not on the same scale, even after fixing the RRF issue. Applying the same &lt;code&gt;LowConfidenceThreshold&lt;/code&gt; constant to both produces a meaningful signal on whichever path it was tuned for and a meaningless one on the other.&lt;/p&gt;

&lt;p&gt;After plotting 10 queries, the natural gap in Azure scores sat between 0.5939 and 0.6258. A dedicated constant:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;AzureLowConfidenceThreshold&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;0.61f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;cleanly separates relevant from irrelevant results. The constant name makes the intent explicit, this is not the SQL Vector threshold, and the two should never be merged.&lt;/p&gt;

&lt;h3&gt;
  
  
  Result Ordering Disappears After the EF Query
&lt;/h3&gt;

&lt;p&gt;Azure AI Search returns product IDs ranked by relevance. A subsequent EF &lt;code&gt;WHERE id IN (...)&lt;/code&gt; query doesn't preserve that order. SQL set operations have no ordering guarantee.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Build a dictionary for O(1) lookup&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;productMap&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_unitOfWork&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Product&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;GetAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;includeProperties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Category,ProductImages"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ToDictionary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Re-project using the original Azure-ranked order&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ids&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;productMap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ContainsKey&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;productMap&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ToList&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you fetch products by ID without re-ordering, you lose the ranking signal entirely. The most relevant result stays first only because you put it there explicitly.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. RAG Evaluation Layer: A Second Opinion on Every Result
&lt;/h2&gt;

&lt;p&gt;After the hybrid search returns results, a second LLM call scores how well the retrieved context answers the query (1 to 5), fire-and-forget, logged to App Insights, never blocking the user response.&lt;/p&gt;

&lt;p&gt;The judge only sees the raw query and the retrieved product descriptions as plain text. No embeddings, no cosine scores, no knowledge of how retrieval worked. It reasons purely as a reader, which is the point.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Fire-and-Forget Task That Was Never Actually Background
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Wrong — task gets cancelled the moment the HTTP response is sent&lt;/span&gt;
&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_ragEvaluationService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ScoreFaithfulnessAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Correct — task runs to completion regardless of request lifetime&lt;/span&gt;
&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_ragEvaluationService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ScoreFaithfulnessAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CancellationToken&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;None&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The HTTP request's &lt;code&gt;CancellationToken&lt;/code&gt; is cancelled when the response is sent. Passing it to a fire-and-forget task means the background work gets cancelled at exactly the moment it's supposed to be running independently. This only surfaces in systems that deliberately decouple a second AI call from the request lifecycle; you don't hit it with synchronous code or normally awaited async calls.&lt;/p&gt;

&lt;h3&gt;
  
  
  Two Design Decisions Worth Explaining
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why -1 instead of &lt;code&gt;Math.Clamp&lt;/code&gt;:&lt;/strong&gt; When the judge returns "3/5" or a prose explanation instead of a bare integer, &lt;code&gt;int.TryParse&lt;/code&gt; fails. Clamping that garbage to a valid range silently corrupts dashboard averages a faithfulness trend chart fed by clamped invalid responses tells you nothing real. The -1 sentinel gets excluded from aggregates, and the warning log tells you exactly how often the judge is misbehaving.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;ToList()&lt;/code&gt; before the fire-and-forget call:&lt;/strong&gt; &lt;code&gt;.Select(p =&amp;gt; p.Description)&lt;/code&gt; uses deferred execution. Without materialising first, enumeration happens on a background thread after the request ends, at which point the EF Core &lt;code&gt;DbContext&lt;/code&gt; may already be disposed. &lt;code&gt;ToList()&lt;/code&gt; runs everything while the request is still alive. Removing it looks like a valid simplification (the types are compatible), but it introduces a bug that only surfaces at runtime under load.&lt;/p&gt;

&lt;p&gt;The cosine similarity score and the faithfulness score measure fundamentally different things. Cosine measures mathematical distance before you look at content. Faithfulness measures whether the actual retrieved text serves the user's intent. You can have a high cosine score and a faithfulness score of 2 embeddings matched on surface features rather than meaning. Both signals together are more honest than either alone.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Production Hygiene: Rate Limiting and Input Validation
&lt;/h2&gt;

&lt;p&gt;Each search query hits the Azure OpenAI embedding API there's a real cost attached. Three layers of defence address three different threat shapes, and none replaces the others:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Input validation&lt;/strong&gt; (3–200 character length check) rejects garbage before any API call is made&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Response caching&lt;/strong&gt; (normalized query key: &lt;code&gt;query.ToLowerInvariant().Trim()&lt;/code&gt;) eliminates repeat costs entirely; identical queries cost nothing after the first hit&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limiting&lt;/strong&gt; caps volume from any single client
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;EnableRateLimiting&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"search"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;IActionResult&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;Search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;expand&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CancellationToken&lt;/span&gt; &lt;span class="n"&gt;ct&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;IsNullOrWhiteSpace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;||&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Trim&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;Length&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt; &lt;span class="p"&gt;||&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Trim&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;Length&lt;/span&gt; &lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;TempData&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"Error"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Search query must be between 3 and 200 characters."&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;RedirectToAction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Index"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Home"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rate limiting policy names are a runtime concern, not a compile-time contract. If &lt;code&gt;policyName: "search"&lt;/code&gt; in &lt;code&gt;Program.cs&lt;/code&gt; doesn't exactly match &lt;code&gt;[EnableRateLimiting("search")]&lt;/code&gt; on the action, the attribute is silently ignored, no error, no indication, just unenforced limits. &lt;code&gt;UseRateLimiter()&lt;/code&gt; must also appear in the middleware pipeline after &lt;code&gt;UseAuthorization()&lt;/code&gt;, otherwise the policy exists but is never applied.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. The Benchmark That Contradicted the Plan
&lt;/h2&gt;

&lt;p&gt;The plan document cited ~40ms for Azure AI Search vs ~120ms for SQL Vector, based on estimates from reference material. I measured before publishing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Actual results across 10 queries:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Query&lt;/th&gt;
&lt;th&gt;SQL #1 Result&lt;/th&gt;
&lt;th&gt;Azure #1 Result&lt;/th&gt;
&lt;th&gt;Same?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"sugar candy in the fair"&lt;/td&gt;
&lt;td&gt;Cotton Candy&lt;/td&gt;
&lt;td&gt;Cotton Candy&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"sunset on the beach"&lt;/td&gt;
&lt;td&gt;Vanish in the Sunset&lt;/td&gt;
&lt;td&gt;Vanish in the Sunset&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"psychological thriller remote location"&lt;/td&gt;
&lt;td&gt;Dark Skies&lt;/td&gt;
&lt;td&gt;Dark Skies&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"coming of age adventure"&lt;/td&gt;
&lt;td&gt;The Road to Redemption&lt;/td&gt;
&lt;td&gt;The Road to Redemption&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"elon musk"&lt;/td&gt;
&lt;td&gt;(both low confidence)&lt;/td&gt;
&lt;td&gt;(both low confidence)&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"quantum mechanics at oxford"&lt;/td&gt;
&lt;td&gt;(both low confidence)&lt;/td&gt;
&lt;td&gt;(both low confidence)&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;10 of 10 queries returned the same top result.&lt;/strong&gt; This validates the in-process cosine implementation as correct at this catalogue size. The divergence cases were both genuinely ambiguous queries where neither path had a strong signal, which is the expected behaviour when two different algorithms are both estimating without clear data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Timing:&lt;/strong&gt; SQL Vector averaged ~584ms, Azure AI Search averaged ~671ms SQL was faster. Both numbers were dominated by the shared &lt;code&gt;GetEmbeddingAsync&lt;/code&gt; network call to Azure OpenAI (~400–750ms). The actual search operation on either side was under 100ms.&lt;/p&gt;

&lt;p&gt;At small catalogue sizes, SQL Vector avoids a second network round trip to Azure. Azure AI Search's ANN indexing advantage only materialises when the brute-force O(n) cosine scan becomes the bottleneck, which requires hundreds of thousands of products, not six.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Measure before you claim. The plan was wrong.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  9. What's Next
&lt;/h2&gt;

&lt;p&gt;The Week 2 system is fully in production: hybrid retrieval across two paths, composite confidence scoring, opt-in query expansion, RAG faithfulness evaluation running after every search, and a &lt;code&gt;CompareSearch&lt;/code&gt; endpoint that benchmarks both paths side by side.&lt;/p&gt;

&lt;p&gt;Week 3 is Semantic Kernel plugins, MediatR/CQRS, and the first xUnit tests. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Built on: ASP.NET Core MVC · Azure OpenAI · Semantic Kernel · GPT-4o-mini · text-embedding-3-small · Azure SQL Vector · Azure AI Search&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🤝 Connect with Me
&lt;/h2&gt;

&lt;p&gt;If you're building AI into .NET or just following along, let's connect.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;💼 &lt;a href="https://www.linkedin.com/in/sharad9kumar/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🐙 &lt;a href="https://github.com/sharad99kr" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>azure</category>
      <category>dotnet</category>
    </item>
    <item>
      <title>I Built a Production AI Layer Inside a Legacy ASP.NET Core App — and It Broke in Ways Tutorials Never Mention</title>
      <dc:creator>Sharad Kumar</dc:creator>
      <pubDate>Tue, 12 May 2026 00:34:14 +0000</pubDate>
      <link>https://dev.to/sharad_kumar_45b990921489/i-built-a-production-ai-layer-inside-a-legacy-aspnet-core-app-and-it-broke-in-ways-tutorials-4edg</link>
      <guid>https://dev.to/sharad_kumar_45b990921489/i-built-a-production-ai-layer-inside-a-legacy-aspnet-core-app-and-it-broke-in-ways-tutorials-4edg</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Most LLM tutorials assume you start from nothing. A blank project. A clean architecture. No constraints. No legacy code. No deployment history. No production traffic.&lt;/p&gt;

&lt;p&gt;That is not how real systems work.&lt;/p&gt;

&lt;p&gt;I spent one week integrating a production-grade AI service layer into an existing ASP.NET Core MVC e-commerce system that was already live, already structured, and already dependent on architectural decisions I couldn't change. The challenge wasn't calling an LLM API. It was designing an AI layer that could survive inside a real backend system without breaking testability, without leaking cost, without becoming tightly coupled to the domain, and without collapsing the moment the model or provider behaved unexpectedly.&lt;/p&gt;

&lt;p&gt;The feature itself, a tone-aware product description generator, is simple. The system design behind it is not. This article is about the architectural decisions, the production failure modes, and the assumptions that broke the moment an LLM entered a real backend system.&lt;/p&gt;




&lt;h2&gt;
  
  
  AI System Design
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Provider/Domain Seam: The Most Important Structural Decision
&lt;/h3&gt;

&lt;p&gt;The first instinct when adding AI to any backend is to create one service class that does everything. One class, one interface, done.&lt;/p&gt;

&lt;p&gt;That instinct produces a system you cannot test, cannot swap, and cannot extend without touching everything.&lt;/p&gt;

&lt;p&gt;The correct design draws a hard seam between two concerns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Provider layer&lt;/strong&gt; (&lt;code&gt;IAIService&lt;/code&gt; / &lt;code&gt;AzureOpenAIService&lt;/code&gt;): knows how to talk to an LLM. Accepts two strings (system prompt, user prompt), returns a string. Knows nothing about what a book or product is.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domain layer&lt;/strong&gt; (&lt;code&gt;IProductAIService&lt;/code&gt; / &lt;code&gt;BookAIService&lt;/code&gt;): knows what a description request looks like, what tone means, and how prompts should be constructed. Knows nothing about Azure, HTTP, or Semantic Kernel.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Above that seam: domain logic, testable with zero real network calls. Below it: provider mechanics, swappable without touching anything above. Swap Azure OpenAI for Ollama, write one new class, and change one registration. The domain layer doesn't notice.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Naming is a design smell detector. My original class was called &lt;code&gt;OpenAIService&lt;/code&gt; and it implemented both interfaces. When I tried to give it an honest name, I couldn't because it was doing two things."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The Generic Wrapper That Eliminates Try/Catch Everywhere
&lt;/h3&gt;

&lt;p&gt;With the seam in place, the next question was: what does every AI call return? My first draft returned plain strings or domain objects directly. That looked fine until I had to handle failures, and I started writing the same &lt;code&gt;try/catch&lt;/code&gt; block in three different places.&lt;/p&gt;

&lt;p&gt;Every AI call in the system returns &lt;code&gt;AIResponse&amp;lt;T&amp;gt;&lt;/code&gt;, not a raw result type. The wrapper carries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AIResponse&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;    &lt;span class="n"&gt;Success&lt;/span&gt;      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;init&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt;      &lt;span class="n"&gt;Data&lt;/span&gt;         &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;init&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="n"&gt;ErrorMessage&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;init&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;    &lt;span class="n"&gt;FromCache&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;init&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;     &lt;span class="n"&gt;TokensUsed&lt;/span&gt;   &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;init&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;AIResponse&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;fromCache&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;AIResponse&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;Fail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The payoff is architectural: &lt;strong&gt;no caller above the provider layer ever writes a try/catch.&lt;/strong&gt; Every feature just checks &lt;code&gt;result.Success&lt;/code&gt;. Error handling is decided once, at the boundary where it's caught, and that discipline holds automatically across every AI feature you add later.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;AIResponse&amp;lt;string&amp;gt;&lt;/code&gt; at the provider layer. &lt;code&gt;AIResponse&amp;lt;ProductDescriptionResult&amp;gt;&lt;/code&gt; at the domain layer. &lt;code&gt;AIResponse&amp;lt;ChatResult&amp;gt;&lt;/code&gt; when the chatbot arrives in Week 3. Same envelope, different payloads.&lt;/p&gt;

&lt;p&gt;The static factory methods aren't just convenience they make invalid states unrepresentable. You cannot call &lt;code&gt;Ok()&lt;/code&gt; and get &lt;code&gt;Success = false&lt;/code&gt;. You cannot call &lt;code&gt;Fail()&lt;/code&gt; and accidentally leave &lt;code&gt;ErrorMessage&lt;/code&gt; null.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Inheritance Trap I Almost Fell Into
&lt;/h3&gt;

&lt;p&gt;My first instinct was to write &lt;code&gt;ProductDescriptionResult : AIResponse&lt;/code&gt;. It compiles. It works. But the moment you try to store a &lt;code&gt;ProductDescriptionResult&lt;/code&gt; in a database or pass it across a service boundary, it drags &lt;code&gt;Success&lt;/code&gt;, &lt;code&gt;ErrorMessage&lt;/code&gt;, and &lt;code&gt;TokensUsed&lt;/code&gt; with it, infrastructure concerns that mean nothing in the domain layer.&lt;/p&gt;

&lt;p&gt;Composition is correct here. The result class is a pure data payload. The wrapper is the envelope. Neither inherits from the other.&lt;/p&gt;

&lt;h3&gt;
  
  
  System Prompts as Architectural Contracts
&lt;/h3&gt;

&lt;p&gt;This is where prompt engineering actually lives in the codebase, not scattered across controllers, not inline in HTTP calls, but centralised in a single switch expression:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="nf"&gt;BuildSystemPrompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DescriptionTone&lt;/span&gt; &lt;span class="n"&gt;tone&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;tone&lt;/span&gt; &lt;span class="k"&gt;switch&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;DescriptionTone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Professional&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;
        &lt;span class="s"&gt;"You are a professional copywriter for a premium book retailer. "&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt;
        &lt;span class="s"&gt;"Write concise, authoritative product descriptions. "&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt;
        &lt;span class="s"&gt;"Never fabricate awards, authors, or facts not provided."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;DescriptionTone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Casual&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;
        &lt;span class="s"&gt;"You are a friendly book recommender. Keep it warm and enthusiastic."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things worth calling out:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;BuildSystemPrompt()&lt;/code&gt; and &lt;code&gt;BuildUserPrompt()&lt;/code&gt; are &lt;em&gt;separate private methods by design&lt;/em&gt;. Developer rules and user input must never accidentally merge. That separation is what makes the service layer both testable and secure.&lt;/li&gt;
&lt;li&gt;Hardcoded system prompts are a deliberate decision, not laziness. They represent fixed feature contracts that don't change at runtime, they don't bleed across features, and they're easy to audit.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The temperature misconception is worth correcting explicitly: &lt;strong&gt;temperature is not a safety control.&lt;/strong&gt; It's a creativity dial. The system prompt is where your behavioural constraints live. Three controls, three separate jobs: system prompt = instructions, temperature = style, &lt;code&gt;max_tokens&lt;/code&gt; = budget.&lt;/p&gt;

&lt;h3&gt;
  
  
  ChatHistory: Single-Turn Today, Multi-Turn Ready Tomorrow
&lt;/h3&gt;

&lt;p&gt;Most beginners send one big string to an LLM API. Using Semantic Kernel's &lt;code&gt;ChatHistory&lt;/code&gt; correctly is a signal that you understand how chat models actually work:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;chatHistory&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;ChatHistory&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;chatHistory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AddSystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;chatHistory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AddUserMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;userPrompt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The description generator creates a new &lt;code&gt;ChatHistory&lt;/code&gt; per call by design. When the chatbot arrives in Week 3, the same pattern extends with no architecture change, just persistence added.&lt;/p&gt;

&lt;h3&gt;
  
  
  CancellationToken: The Hidden Cost Leak
&lt;/h3&gt;

&lt;p&gt;Every async AI method in the chain accepts and propagates a &lt;code&gt;CancellationToken&lt;/code&gt;. This isn't just politeness; it's cost control. If a user closes their browser tab mid-request and the token isn't propagated all the way to the Azure SDK call, your app completes the API call and gets billed for a response nobody receives.&lt;/p&gt;

&lt;p&gt;A token accepted but not passed to the next &lt;code&gt;await&lt;/code&gt; is worse than not accepting it at all; it gives false confidence that cancellation is handled. The chain must be unbroken: Controller → Service → Provider → Azure SDK call.&lt;/p&gt;

&lt;h3&gt;
  
  
  Response Caching: Proving It Works in the UI
&lt;/h3&gt;

&lt;p&gt;Caching is wired at the provider layer with a hash-based key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;cacheKey&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;$"ai:text:&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;GetHashCode&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;userPrompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;GetHashCode&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;AIResponse&amp;lt;T&amp;gt;&lt;/code&gt; wrapper carries a &lt;code&gt;FromCache&lt;/code&gt; boolean all the way to the admin UI, where it renders as &lt;code&gt;⚡ Cached&lt;/code&gt; vs &lt;code&gt;✨ Generated&lt;/code&gt;. That's not decoration, it's verification. You can prove your caching is working in a live demo without opening Application Insights.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;EnableCaching&lt;/code&gt; is a feature flag in &lt;code&gt;AISettings&lt;/code&gt;, not a hardcoded &lt;code&gt;true&lt;/code&gt;. You can flip it in the Azure App Service configuration without redeploying. That's a production pattern, not a tutorial habit.&lt;/p&gt;

&lt;h3&gt;
  
  
  The AI Cost Dashboard: Why Almost Nobody Builds This
&lt;/h3&gt;

&lt;p&gt;Most AI portfolio projects show features. Mine also shows cost. The admin dashboard tracks tokens per feature per day, cost per request, and cache hit rate. This makes the economics of AI visible and is the difference between someone who &lt;em&gt;built&lt;/em&gt; a feature and someone who &lt;em&gt;shipped&lt;/em&gt; one.&lt;/p&gt;

&lt;p&gt;Concrete number: approximately $15 total API spend over 8 weeks of development, with &lt;code&gt;gpt-4o-mini&lt;/code&gt; during development (roughly 15x cheaper than &lt;code&gt;gpt-4o&lt;/code&gt;) and caching in place.&lt;/p&gt;




&lt;h2&gt;
  
  
  Supporting Architecture Decisions
&lt;/h2&gt;

&lt;p&gt;Three decisions that shaped how the AI layer was placed and wired were included because they caused real problems, not because they're interesting trivia.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI infrastructure belongs at the application root, not inside feature folders.&lt;/strong&gt; The app used area-based routing. The instinct was to put &lt;code&gt;Services/AI/&lt;/code&gt; inside an Area. Wrong call, UI grouping doesn't determine where cross-cutting infrastructure lives. AI features span the whole application. Scoping them to an Area implies boundaries that don't exist and creates coupling that becomes painful when the chatbot and search features arrive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;All AI wiring lives in one extension method.&lt;/strong&gt; One line in &lt;code&gt;Program.cs&lt;/code&gt;: &lt;code&gt;builder.Services.AddAIServices(builder.Configuration)&lt;/code&gt;. Everything else is encapsulated. The subtle thing to know: if you register the same concrete class twice under different interfaces without careful use of &lt;code&gt;GetRequiredService&lt;/code&gt;, you can end up with two separate instances per request instead of one. Most tutorials never flag this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Secrets never touch the config file.&lt;/strong&gt; &lt;code&gt;ApiKey&lt;/code&gt; exists in the settings class but is absent from &lt;code&gt;appsettings.json&lt;/code&gt;. User Secrets fill it locally, App Service Application Settings fill it in production. The application code is identical in both environments everything is injected through the same configuration pipeline.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges &amp;amp; Learnings
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Debugging Arc: 404 → 400 → 200
&lt;/h3&gt;

&lt;p&gt;Getting the AI endpoint working produced three sequential errors, each from a different layer:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;404&lt;/strong&gt; : routing was treating &lt;code&gt;AI&lt;/code&gt; as an area name. Fix: explicit &lt;code&gt;[Route("AI")]&lt;/code&gt; on the controller.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;400&lt;/strong&gt; : the JavaScript payload sends &lt;code&gt;"Professional"&lt;/code&gt; as a string; the C# model declares a &lt;code&gt;DescriptionTone&lt;/code&gt; enum. ASP.NET Core's default deserializer returns a silent 400 without bridging them. Fix: add &lt;code&gt;JsonStringEnumConverter&lt;/code&gt;. This is an AI-specific integration point — the LLM-facing tone selector has to round-trip correctly through the API layer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;200&lt;/strong&gt; : success.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each error was a different subsystem. Real integrations rarely have a single root cause.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Cascade Failure at Deployment
&lt;/h3&gt;

&lt;p&gt;At this point, the feature was working locally. Deployment looked straightforward.&lt;/p&gt;

&lt;p&gt;It wasn't.&lt;/p&gt;

&lt;p&gt;After deploying, &lt;code&gt;MaxTokens&lt;/code&gt; was reading as &lt;code&gt;0&lt;/code&gt;. Completely unrelated to the actual cause: a missing Azure App Settings entry was causing &lt;code&gt;AIServiceExtensions.cs&lt;/code&gt; to crash at startup, leaving the settings object in a zeroed-out state. A &lt;code&gt;NullReferenceException&lt;/code&gt; deep in Semantic Kernel setup was the symptom. A missing config key was the cause.&lt;/p&gt;

&lt;p&gt;The lesson applies specifically to AI service registration: &lt;strong&gt;validate that your AI configuration is present and well-formed at startup, loudly, before any service gets built.&lt;/strong&gt; A &lt;code&gt;?? throw&lt;/code&gt; on the config read gives you a readable failure message at the right location, not a cryptic error three layers downstream.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Two Hours I Lost to a URL
&lt;/h3&gt;

&lt;p&gt;This one stings to write. I spent the better part of an afternoon convinced my Semantic Kernel wiring was broken. Checked the DI registration twice. Re-read the SK docs. Added logging everywhere. Silent failure, no exception, no response.&lt;/p&gt;

&lt;p&gt;The actual problem: I had copied the full REST endpoint URL from the Azure portal, something like &lt;code&gt;https://your-resource.openai.azure.com/openai/deployments/gpt-4/chat/completions?api-version=2024-02-01&lt;/code&gt; directly into my config. Semantic Kernel only wants the base domain. It constructs the rest itself. One wrong URL format, zero helpful error messages, two hours gone.&lt;/p&gt;

&lt;p&gt;I'm including this because the Azure portal genuinely shows you the full path and it looks correct. It isn't.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;https://your-resource.openai.azure.com/&lt;/code&gt;, the base domain only. Semantic Kernel builds the path from there based on your deployment name and API version.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM Error Messages Are a Security Leak
&lt;/h3&gt;

&lt;p&gt;A raw Azure OpenAI &lt;code&gt;401&lt;/code&gt; or timeout exception carries endpoint URLs, subscription hints, and SDK internals in &lt;code&gt;ex.Message&lt;/code&gt;. None of that should reach a client. The AI service boundary catches the exception, logs it internally with full context, and returns a sanitised message to the caller. One line of difference, significant surface reduction.&lt;/p&gt;

&lt;h3&gt;
  
  
  The TinyMCE Silent Failure
&lt;/h3&gt;

&lt;p&gt;The AI card writes to the product description field via &lt;code&gt;textarea.value = text&lt;/code&gt;. This silently does nothing when TinyMCE is active. TinyMCE replaces the DOM element entirely. No error, no feedback. The button appears to work, but nothing changes. Lesson: The AI integration point in the UI needs to know what the UI is actually made of.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Draw the seam before you write any code.&lt;/strong&gt; The provider/domain split is the foundational decision. Everything else, testability, swappability, and cost control, follows from it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Make failure a return value, not a control flow mechanism.&lt;/strong&gt; An AI system has expected failure modes: timeouts, rate limits, and degraded responses. &lt;code&gt;AIResponse&amp;lt;T&amp;gt;&lt;/code&gt; handles them at the boundary. No caller above it needs a try/catch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System prompts are architectural contracts, not strings.&lt;/strong&gt; They live in one place, they don't change at runtime, and they're the only place your business rules for AI behaviour exist. Treat them with the same discipline as interface contracts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Temperature is a creativity dial, not a safety control.&lt;/strong&gt; The system prompt is where constraints live. Knowing the difference affects every feature you build on top of the same provider.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Propagate &lt;code&gt;CancellationToken&lt;/code&gt; all the way to the LLM call.&lt;/strong&gt; A broken chain gives false confidence and silently leaks API cost when users abandon requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Make cache hits observable.&lt;/strong&gt; The &lt;code&gt;FromCache&lt;/code&gt; signal in the response wrapper isn't decoration; it's the only way to verify your caching is working without opening a monitoring dashboard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost is a first-class concern, not an afterthought.&lt;/strong&gt; Token tracking per feature per day is what separates a portfolio project from a production system. Almost no one builds this. Build it anyway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI startup failures should be loud and specific.&lt;/strong&gt; A missing config key that crashes deep inside Semantic Kernel setup is far worse than a clean, early "AI configuration is missing" exception at the service boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion &amp;amp; What's Next
&lt;/h2&gt;

&lt;p&gt;Week 1 is one AI feature with a production-grade system behind it: a layered architecture with a hard provider/domain seam, a generic response envelope, observable caching, structured logging, graceful degradation, and cost tracking. The feature is modest. The system thinking is not.&lt;/p&gt;

&lt;p&gt;Week 2 is RAG semantic search using &lt;code&gt;text-embedding-3-small&lt;/code&gt; and Azure SQL Vector, with hybrid search (vector + keyword in parallel) to handle exact matches that pure vector retrieval handles poorly. The agentic chatbot in Weeks 3 - 4 depends on RAG to ground its responses, which is why RAG comes first. Dependency-first sequencing isn't just planning hygiene; it prevents retrofitting at the seam where two features meet.&lt;/p&gt;

&lt;p&gt;The design decisions in Week 1 were made with Week 4 in mind. That's what production AI system design actually is.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built on: ASP.NET Core MVC · Azure OpenAI · Semantic Kernel · Microsoft.Extensions.AI · gpt-4.1-mini · IMemoryCache&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🤝 Connect with Me
&lt;/h2&gt;

&lt;p&gt;If you're building AI into .NET or just following along, let's connect.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;💼 &lt;a href="https://www.linkedin.com/in/sharad9kumar/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🐙 &lt;a href="https://github.com/sharad99kr" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>dotnet</category>
      <category>azure</category>
      <category>csharp</category>
    </item>
  </channel>
</rss>
