Sharad Kumar

Posted on May 19

Building Hybrid Semantic Search in ASP.NET Core — SQL Vector, Azure AI Search, and the Bugs Between Them

#ai #rag #azure #dotnet

Part 2 of building a public AI learning series on top of an existing Bulky MVC bookstore. Code is live at readify

Most semantic search tutorials start with a fresh project, a clean vector store, and a hand-picked dataset designed to make the demo look good. I had none of that.

I had an existing MVC application, an existing SQL Server database, an existing repository pattern I couldn't break, and seed data whose descriptions were all identical "lorem ipsum", which I discovered produces nearly identical embeddings, making cosine similarity essentially
random. You will not find that in the tutorials.

This article is about what actually happened: the architecture decisions, the failures that only surface at the intersection of AI, EF Core, and async, and a benchmark that showed SQL Vector outperforming Azure AI Search at a small scale, the opposite of what the plan document predicted. Here is how the system was designed before a line of code was written.

1. The Architecture: How the System Is Structured

The sequence I planned was deliberate: keyword search runs first (safe, never throws), semantic search runs second (can fail), the results merge, quality gets evaluated. That order shaped every decision that followed.

Every colour in the diagram maps to a failure zone. Green is safe keyword search that is pure SQL, with no AI dependency. Purple is the semantic path; it can fail on network, quota, or bad data. Orange is opt-in cost query expansion that only runs when the user asks for it. Red is fully decoupled RAG evaluation that is fire-and-forget and never touches the user response.

The key structural decision is sequencing: the safe path runs before the risky one. My first draft ran keyword search inside the catch block, which meant a database failure would also take down the fallback. The fix was reordering, not rewriting. A fallback is only safe if it was verified healthy before the failure happened.

2. Embedding Layer: Storing Vectors in SQL Server

An embedding is a fixed-size array of floats that represents the meaning of a piece of text. text-embedding-3-small converts a product description into a float[1536] vector — 1,536 numbers where similar meanings land close together in that space. Searching by meaning means converting the user's query into the same vector space and finding which products are closest. That's cosine similarity. None of this works without first getting those vectors into the database, which is where the storage problem starts.

The Two-Property Pattern (Because EF Core Can't Persist float[])

EF Core can't persist float[]. SQL Server can't store it either. The solution is a translation layer that lives entirely on the model and is invisible to every other layer:

// EF Core persists this — the actual database column
public byte[]? SearchEmbeddingData { get; set; }

// Application code uses this — [NotMapped] means EF ignores it entirely
[NotMapped]
public float[]? SearchEmbedding
{
    get => SearchEmbeddingData is null ? null
        : MemoryMarshal.Cast<byte, float>(SearchEmbeddingData).ToArray();
    set
    {
        if (value is null) { SearchEmbeddingData = null; return; }
        SearchEmbeddingData = MemoryMarshal.AsBytes(value.AsSpan()).ToArray();
    }
}

MemoryMarshal reinterprets the same memory block under a different type, no copy, no allocation. The getter converts bytes to floats on read. The setter converts floats to bytes on write. Every other layer works with float[] naturally and never knows bytes exist.

Seed Data Quality Is a Precondition, Not Configuration

The seed format is:

"{Title} by {Author}. Category: {Category}. {Description}"

"Gone Girl" alone tells the model less than "Gone Girl by Gillian Flynn. Category: Thriller. A psychological thriller about a missing woman and her husband's dark secrets." The surrounding context pulls the vector toward the right region of embedding space.

I discovered this the hard way: my initial seed data had identical lorem ipsum descriptions across all six products. The embeddings were nearly identical vectors cosine similarity had no meaningful signal to rank against. Search results looked random because they essentially were. Seed data quality is a precondition for RAG correctness. This is the kind of failure mode that only surfaces when you actually run the math and see the results make no sense.

The Embeddings That Were Generated But Never Saved

Embeddings generated successfully. Logs confirmed it. Then:

SELECT COUNT(*) FROM Products WHERE SearchEmbeddingData IS NOT NULL
-- Result: 0

Root cause: the existing repository's Update() method maps properties manually, one by one, a standard pattern for the rest of the codebase. SearchEmbeddingData was never added to that list. EF Core tracked the entity, SaveChanges() was called, and the byte array column was silently skipped. No exception. No warning. Nothing.

This is the specific tax of adding AI to a system with manual property-copy update methods. Every new column added to the model must be added to the update method by hand, with no compiler enforcement and no runtime signal when you forget.

[NotMapped] Is Invisible to EF — Including in Where Clauses

// Produces no SQL WHERE clause — EF silently ignores it
.Where(p => p.SearchEmbedding != null)

// Correct — filters on the actual mapped column
.Where(p => p.SearchEmbeddingData != null)

SearchEmbedding is [NotMapped], a computed property that exists only in memory. EF Core cannot translate it to SQL. Rather than throwing, it silently drops the filter entirely and performs a full table scan. The code compiles, runs, and loads every product, not only the ones with embeddings.

This only surfaces when you have a translation layer between your database type and your application type. Without the embedding system, you'd never have a [NotMapped] computed property in a LINQ filter. Add one, and your filter assumptions break without warning.

3. Search Layer: Hybrid Retrieval and Confidence Scoring

With vectors in the database, the search layer has three jobs: retrieve candidates, merge the semantic and keyword paths, and decide whether to trust the result.

Why Not a Vector Database

Reaching for Qdrant or Pinecone when someone says "embeddings" is a reasonable instinct. I didn't.

Each embedding from text-embedding-3-small is a float[1536] array: 1,536 × 4 bytes = 6 KB per product. 500 products are roughly 3 MB in memory. At that scale, loading vectors and computing cosine similarity in-process in C# is fast enough. One database, one EF Core provider, one migration. No second Azure resource to provision, bill, or debug.

The upgrade path to Azure AI Search exists, and I built it too. But building the simple path first forced me to implement cosine similarity from scratch and understand what the managed service abstracts. You can't explain why HNSW is faster if you've never felt the O(n) scan problem. The SQL Vector path is live in production. Azure AI Search is the comparison baseline. Both are benchmarked in Section 8.

Composite Confidence: Why One Threshold Wasn't Enough

My first implementation used LowConfidenceThreshold = 0.75f. A top score of 0.57 with a second score of 0.37, a gap of 0.20, was being flagged as low confidence. That result won clearly. The threshold was penalising it for not reaching an arbitrary ceiling.

A single threshold can't distinguish "this result won clearly" from "the top two results are nearly tied." A score of 0.57 with a gap of 0.20 is high confidence. The same score with a gap of 0.01 is a coin flip. The gap tells you something the absolute score cannot.

private const float LowConfidenceThreshold = 0.4f;

float topScore    = scored.ElementAtOrDefault(0)?.Score ?? 0;
float secondScore = scored.ElementAtOrDefault(1)?.Score ?? 0;
var   scoreGap    = topScore - secondScore;

bool lowConfidence =
    topScore < LowConfidenceThreshold ||           // absolute floor
    (topScore < 0.50f && scoreGap < 0.10f) ||     // mediocre + indistinct
    (topScore < 0.60f && scoreGap < 0.05f);       // decent score + coin-flip ranking

LowConfidence is driven purely by the semantic path's score, not by whether keyword results filled the remaining slots. The flag is about AI retrieval confidence, not result completeness. Mixing those signals would make it meaningless.

Keyword Search: Word-Splitting vs Single Phrase Match

The original plan described a single .Contains(query) checking if the full phrase appeared in a title or description. The actual implementation splits first:

private IReadOnlyList<Product> KeywordSearch(string query, int topK = 5)
{
    var queryWords = query
        .Split(' ', StringSplitOptions.RemoveEmptyEntries | StringSplitOptions.TrimEntries)
        .ToList();

    return _unitOfWork.Product
        .GetAll(includeProperties: "Category,ProductImages")
        .Where(p => queryWords.Any(q => p.Title.Contains(q) || p.Description.Contains(q)))
        .Take(topK)
        .ToList();
}

Searching "cozy mystery" with a full-phrase match fails if no title contains that exact substring. Splitting into words and matching any of them handles multi-word queries correctly. This is also synchronous by design, pure in-memory LINQ with no I/O. There's nothing to await.

SearchResult<T>: Why I Didn't Reuse the Week 1 Envelope

I already had AIResponse<T> from Week 1. The temptation was to add TopScore and LowConfidence to it and avoid a new type. I didn't.

AIResponse<T> carries FromCache, an infrastructure concern. SearchResult<T> carries TopScore and LowConfidence domain concerns specific to retrieval. Merging them would mean the description generator's response type carries unused search fields. A field on a type implies it's relevant to that type's callers. That's a misleading model design.

public class SearchResult<T>
{
    public IReadOnlyList<T> Items { get; init; } = [];
    public float TopScore { get; init; }
    public bool LowConfidence { get; init; }
}

4. Query Expansion: Bridging the Vocabulary Gap

Even searching for "psychological thriller remote mountain town" words that appear verbatim in a product description scored only 0.57. The stored vector was generated from a 278-character structured string. The surrounding context shifts the vector, and a short query doesn't land in the same neighbourhood even when the words are identical.

Query expansion bridges this: a GPT call reformulates a short query into richer language before embedding it:

public async Task<string> ExpandQueryAsync(string query, CancellationToken ct)
{
    var prompt = $"""
        Expand this book search query into a short descriptive phrase
        that includes genre, mood, and themes. 2-3 sentences max.
        Query: {query}
        """;

    var history = new ChatHistory();
    history.AddUserMessage(prompt);
    var response = await _chatCompletionService.GetChatMessageContentAsync(
        history, cancellationToken: ct);
    return response.Content;
}

Scores improved by ~0.05 on SQL Vector and ~0.03 on Azure AI Search when a matching product existed. When no relevant product existed, scores improved marginally but low confidence flags remained correct. Expansion cannot manufacture relevance that isn't in the catalogue. It's a vocabulary bridge, not a relevance fix.

Expansion is opt-in useQueryExpansion = false by default. The default path makes zero extra API calls. When confidence is low, the UI shows a "Search Harder" button that resubmits with expand=true. Users who get good results pay nothing extra. Users who don't can ask for more.

5. Azure AI Search: The Second Retrieval Path

I built both SQL Vector and Azure AI Search as separate, independently routable paths. A CompareSearch admin endpoint runs both against the same query and returns side-by-side results, which is how the benchmark data below was collected.

The One-Character Fix That Changed All Scores

Azure AI Search initially returned scores in the 0.016–0.033 range. The confidence threshold (0.75f) flagged every query as low confidence "dark skies" (a literal book title) and "quantum oxford" (genuinely irrelevant) scored identically at 0.033.

Root cause: passing a query string alongside vector options to SearchAsync triggers hybrid RRF (Reciprocal Rank Fusion) scoring, which blends BM25 text scores and vector scores. RRF always produces compressed tiny numbers regardless of semantic similarity. Changing SearchAsync(query, options, ct) to SearchAsync("*", options, ct) forces pure vector search and returns cosine-like scores (0.54–0.70). One character.

Separate Confidence Threshold Per Path

SQL Vector scores and Azure AI Search scores are not on the same scale, even after fixing the RRF issue. Applying the same LowConfidenceThreshold constant to both produces a meaningful signal on whichever path it was tuned for and a meaningless one on the other.

After plotting 10 queries, the natural gap in Azure scores sat between 0.5939 and 0.6258. A dedicated constant:

private const float AzureLowConfidenceThreshold = 0.61f;

cleanly separates relevant from irrelevant results. The constant name makes the intent explicit, this is not the SQL Vector threshold, and the two should never be merged.

Result Ordering Disappears After the EF Query

Azure AI Search returns product IDs ranked by relevance. A subsequent EF WHERE id IN (...) query doesn't preserve that order. SQL set operations have no ordering guarantee.

// Build a dictionary for O(1) lookup
var productMap = _unitOfWork.Product
    .GetAll(p => ids.Contains(p.Id), includeProperties: "Category,ProductImages")
    .ToDictionary(p => p.Id);

// Re-project using the original Azure-ranked order
var products = ids
    .Where(id => productMap.ContainsKey(id))
    .Select(id => productMap[id])
    .ToList();

If you fetch products by ID without re-ordering, you lose the ranking signal entirely. The most relevant result stays first only because you put it there explicitly.

6. RAG Evaluation Layer: A Second Opinion on Every Result

After the hybrid search returns results, a second LLM call scores how well the retrieved context answers the query (1 to 5), fire-and-forget, logged to App Insights, never blocking the user response.

The judge only sees the raw query and the retrieved product descriptions as plain text. No embeddings, no cosine scores, no knowledge of how retrieval worked. It reasons purely as a reader, which is the point.

The Fire-and-Forget Task That Was Never Actually Background

// Wrong — task gets cancelled the moment the HTTP response is sent
_ = _ragEvaluationService.ScoreFaithfulnessAsync(query, context, ct);

// Correct — task runs to completion regardless of request lifetime
_ = _ragEvaluationService.ScoreFaithfulnessAsync(query, context, CancellationToken.None);

The HTTP request's CancellationToken is cancelled when the response is sent. Passing it to a fire-and-forget task means the background work gets cancelled at exactly the moment it's supposed to be running independently. This only surfaces in systems that deliberately decouple a second AI call from the request lifecycle; you don't hit it with synchronous code or normally awaited async calls.

Two Design Decisions Worth Explaining

Why -1 instead of Math.Clamp: When the judge returns "3/5" or a prose explanation instead of a bare integer, int.TryParse fails. Clamping that garbage to a valid range silently corrupts dashboard averages a faithfulness trend chart fed by clamped invalid responses tells you nothing real. The -1 sentinel gets excluded from aggregates, and the warning log tells you exactly how often the judge is misbehaving.

Why ToList() before the fire-and-forget call: .Select(p => p.Description) uses deferred execution. Without materialising first, enumeration happens on a background thread after the request ends, at which point the EF Core DbContext may already be disposed. ToList() runs everything while the request is still alive. Removing it looks like a valid simplification (the types are compatible), but it introduces a bug that only surfaces at runtime under load.

The cosine similarity score and the faithfulness score measure fundamentally different things. Cosine measures mathematical distance before you look at content. Faithfulness measures whether the actual retrieved text serves the user's intent. You can have a high cosine score and a faithfulness score of 2 embeddings matched on surface features rather than meaning. Both signals together are more honest than either alone.

7. Production Hygiene: Rate Limiting and Input Validation

Each search query hits the Azure OpenAI embedding API there's a real cost attached. Three layers of defence address three different threat shapes, and none replaces the others:

Input validation (3–200 character length check) rejects garbage before any API call is made
Response caching (normalized query key: query.ToLowerInvariant().Trim()) eliminates repeat costs entirely; identical queries cost nothing after the first hit
Rate limiting caps volume from any single client

[EnableRateLimiting("search")]
public async Task<IActionResult> Search(string q, bool expand = false, CancellationToken ct = default)
{
    if (string.IsNullOrWhiteSpace(q) || q.Trim().Length < 3 || q.Trim().Length > 200)
    {
        TempData["Error"] = "Search query must be between 3 and 200 characters.";
        return RedirectToAction("Index", "Home");
    }
    // ...
}

Rate limiting policy names are a runtime concern, not a compile-time contract. If policyName: "search" in Program.cs doesn't exactly match [EnableRateLimiting("search")] on the action, the attribute is silently ignored, no error, no indication, just unenforced limits. UseRateLimiter() must also appear in the middleware pipeline after UseAuthorization(), otherwise the policy exists but is never applied.

8. The Benchmark That Contradicted the Plan

The plan document cited ~40ms for Azure AI Search vs ~120ms for SQL Vector, based on estimates from reference material. I measured before publishing.

Actual results across 10 queries:

Query	SQL #1 Result	Azure #1 Result	Same?
"sugar candy in the fair"	Cotton Candy	Cotton Candy	✓
"sunset on the beach"	Vanish in the Sunset	Vanish in the Sunset	✓
"psychological thriller remote location"	Dark Skies	Dark Skies	✓
"coming of age adventure"	The Road to Redemption	The Road to Redemption	✓
"elon musk"	(both low confidence)	(both low confidence)	✓
"quantum mechanics at oxford"	(both low confidence)	(both low confidence)	✓

10 of 10 queries returned the same top result. This validates the in-process cosine implementation as correct at this catalogue size. The divergence cases were both genuinely ambiguous queries where neither path had a strong signal, which is the expected behaviour when two different algorithms are both estimating without clear data.

Timing: SQL Vector averaged ~584ms, Azure AI Search averaged ~671ms SQL was faster. Both numbers were dominated by the shared GetEmbeddingAsync network call to Azure OpenAI (~400–750ms). The actual search operation on either side was under 100ms.

At small catalogue sizes, SQL Vector avoids a second network round trip to Azure. Azure AI Search's ANN indexing advantage only materialises when the brute-force O(n) cosine scan becomes the bottleneck, which requires hundreds of thousands of products, not six.

Measure before you claim. The plan was wrong.

9. What's Next

The Week 2 system is fully in production: hybrid retrieval across two paths, composite confidence scoring, opt-in query expansion, RAG faithfulness evaluation running after every search, and a CompareSearch endpoint that benchmarks both paths side by side.

Week 3 is Semantic Kernel plugins, MediatR/CQRS, and the first xUnit tests.

Built on: ASP.NET Core MVC · Azure OpenAI · Semantic Kernel · GPT-4o-mini · text-embedding-3-small · Azure SQL Vector · Azure AI Search

🤝 Connect with Me

If you're building AI into .NET or just following along, let's connect.

💼 LinkedIn
🐙 GitHub

DEV Community