Ólafur Aron Jóhannsson

Posted on Feb 15 • Originally published at olafuraron.is

Build a Document Search Engine in C# Without Python

#csharp #ai #machinelearning #dotnet

Build a Document Search Engine in C

Most search implementations fall into one of two camps: send everything to Elasticsearch, or call a search API. Both work. Both add infrastructure.

Here's a third option. Index local files, search them by keyword, by meaning, or both, in about 10 lines of C#. No external services.

dotnet add package Kjarni

NuGet

using Kjarni;

using var indexer = new Indexer(model: "minilm-l6-v2", quiet: true);
indexer.Create("my_index", new[] { "docs/" });

using var searcher = new Searcher(
    model: "minilm-l6-v2",
    rerankerModel: "minilm-l6-v2-cross-encoder");

var results = searcher.Search("my_index", "how do returns work?",
    mode: SearchMode.Hybrid);

foreach (var r in results)
    Console.WriteLine($"  {r.Score:F4}: {r.Text}");

The indexer reads your files, splits them into chunks, encodes each chunk as a vector, and builds a BM25 keyword index. The searcher queries both indexes and combines the results.

Setup

Create a few text files to search over:

mkdir -p docs

docs/returns.txt:

Our return policy allows customers to return any unused item within 30 days
of purchase for a full refund. Items must be in their original packaging.
Shipping costs are non-refundable.

docs/shipping.txt:

We ship to all 50 US states and internationally to over 40 countries.
Standard shipping takes 5-7 business days. Express shipping is available
for an additional fee.

docs/account.txt:

To reset your password, click "Forgot Password" on the login page.
You will receive an email with a reset link. The link expires after 24 hours.

Three short documents. In practice these could be product manuals, support articles, internal wikis, or any text files.

Indexing

using var indexer = new Indexer(model: "minilm-l6-v2", quiet: true);
indexer.Create("my_index", new[] { "docs/" });

The indexer does three things:

Reads all files in the given directories
Chunks each file into passages (for long documents)
Encodes each chunk into a 384-dimension vector using the embedding model

It also builds a BM25 keyword index over the same chunks. The result is a local index on disk that you can query repeatedly without re-indexing.

Three Search Modes

Keyword Search (BM25)

Matches documents that contain the query words. The same algorithm that powers Elasticsearch and Solr.

var results = searcher.Search("my_index", "return policy refund",
    mode: SearchMode.Keyword);

  7.8795: Our return policy allows customers to return any unused item
          within 30 days of purchase for a full refund...

This works because the query words — "return", "policy", "refund" — appear in the document. If you searched for "send items back and get money" instead, keyword search would find nothing.

For the theory behind BM25, see BM25 vs TF-IDF: Keyword Search Explained.

Semantic Search

Matches documents by meaning, regardless of the exact words used.

var results = searcher.Search("my_index", "can I send items back and get money?",
    mode: SearchMode.Semantic);

This finds the returns document even though none of those exact words appear in it. The embedding model understands that "send items back" means "return" and "get money" means "refund."

For how embeddings and similarity work, see Semantic Search in C#.

Hybrid Search

Combines keyword and semantic results. This is usually the best default.

var results = searcher.Search("my_index", "how do returns work?",
    mode: SearchMode.Hybrid);

   1.3282: Our return policy allows customers to return any unused item
           within 30 days of purchase for a full refund. Items must be in
           their original packaging. Shipping costs are non-refundable.

 -10.5874: To reset your password, click "Forgot Password" on the login
           page. You will receive an email with a reset link. The link
           expires after 24 hours.

 -11.0939: We ship to all 50 US states and internationally to over 40
           countries. Standard shipping takes 5-7 business days. Express
           shipping is available for an additional fee.

Hybrid search catches both exact keyword matches and semantically related content. The scores are from the reranker (more on that below), which is why the gap between relevant and irrelevant results is so large. The returns document scores 1.3, while the other two are deep in the negatives.

Reranking

The results above use a cross-encoder reranker. This is the difference between good search and great search.

The Problem with Embeddings Alone

Embedding models are fast because they encode the query and each document independently. But this means they can't model the interaction between query and document directly. They're comparing summaries, not reading both texts together.

How Reranking Fixes This

A cross-encoder takes the query and a document as a single input and outputs a relevance score. It reads both texts at the same time, so it can attend to specific words in the document that answer the specific question.

Bi-encoder (embedding):     Query -> Vector    Document -> Vector    Compare
Cross-encoder (reranker):   [Query + Document] -> Relevance Score

The cross-encoder is slower because it processes each query-document pair individually. That's why it's used as a second stage: the embedding model retrieves candidates quickly, then the cross-encoder reranks the top results precisely.

Using the Reranker Directly

You can also use the reranker on its own:

using var reranker = new Reranker();

var results = reranker.Rerank(
    "What is machine learning?",
    new[] {
        "Machine learning is a subset of artificial intelligence.",
        "Deep learning uses neural networks with many layers.",
        "The weather today is sunny.",
    });

foreach (var r in results)
    Console.WriteLine($"  {r.Score:F4}: {r.Document}");

  10.5139: Machine learning is a subset of artificial intelligence.
  -5.5301: Deep learning uses neural networks with many layers.
 -11.1001: The weather today is sunny.

The scores are logits, not probabilities. What matters is the relative ordering and the gap between scores. A positive score means the cross-encoder thinks the document is relevant. A negative score means it's not.

The Full Pipeline

Here's how the pieces fit together:

Query
  |
  +-- BM25 Keyword Index ----> Top N candidates by word match
  |
  +-- Vector Index ----------> Top N candidates by meaning
  |
  v
  Merge candidates (union or intersection)
  |
  v
  Cross-Encoder Reranker ----> Final ranked results
  |
  v
  Return to user

Each stage filters and refines. BM25 is cheap and catches exact matches. The vector index catches semantic matches that keywords miss. The reranker reads both query and document together to produce a precise ranking.

using var indexer = new Indexer(model: "minilm-l6-v2", quiet: true);
indexer.Create("my_index", new[] { "docs/" });

using var searcher = new Searcher(
    model: "minilm-l6-v2",
    rerankerModel: "minilm-l6-v2-cross-encoder");

// Hybrid = BM25 + Semantic + Reranker
var results = searcher.Search("my_index", "how do returns work?",
    mode: SearchMode.Hybrid);

When to Use Each Mode

Mode	Best for	Misses
Keyword	Exact terms, error codes, IDs	Synonyms, rephrased queries
Semantic	Intent matching, fuzzy queries	Exact phrases, rare terms
Hybrid	General purpose (recommended)	Slightly slower

Start with Hybrid. Switch to Keyword if your users search for exact identifiers. Switch to Semantic if your users describe what they want in natural language.

Practical Patterns

Filtering Results

Apply a score threshold to filter out irrelevant results:

var results = searcher.Search("my_index", query, mode: SearchMode.Hybrid);
var relevant = results.Where(r => r.Score > 0.0);

With reranking, a score above 0 is a reasonable default threshold for "probably relevant."

Search + Classification

Find relevant documents, then classify their sentiment. This combines search and classification together:

using var searcher = new Searcher(model: "minilm-l6-v2");
using var classifier = new Classifier("roberta-sentiment");

var results = searcher.Search("reviews_index", "battery life",
    mode: SearchMode.Hybrid);

foreach (var r in results.Take(10))
{
    var sentiment = classifier.Classify(r.Text);
    Console.WriteLine($"  {sentiment}  \"{r.Text}\"");
}

See Sentiment Analysis in C# Without Python for more on classification.

Re-indexing

When documents change, re-create the index:

indexer.Create("my_index", new[] { "docs/" });

This rebuilds the full index. For large corpora where incremental updates matter, you'd manage the vector storage separately.

How It Compares

Approach	Setup	Cost	Offline
Elasticsearch	Cluster + config	Server costs	No
Azure AI Search	Portal + API key	Per-query pricing	No
Algolia	Dashboard + API key	Per-search pricing	No
Kjarni	`dotnet add package`	Free	Yes

The tradeoff: Kjarni runs in-process on a single machine. If you need distributed search across billions of documents, use Elasticsearch. If you need search over thousands to millions of documents on a single server, a local engine works well and eliminates a dependency.

How It Works Under the Hood

Kjarni builds two indexes per collection:

BM25 index — inverted index over tokenized text, with term frequency saturation and document length normalization
Vector index — encoded embeddings for each chunk, queried by cosine similarity

At search time, both indexes return candidates. The results are merged and optionally reranked by a cross-encoder model that reads the query and each candidate together.

The engine is written in Rust. The C# package wraps a single native library. There is no Python runtime, no JVM, and no external service.

NuGet:  https://www.nuget.org/packages/Kjarni
GitHub: https://github.com/olafurjohannsson/kjarni

Other Resources

Semantic Search in C# - Embeddings and similarity from scratch
Build a Document Search Engine in C# - Full hybrid search with indexing and reranking
BM25 vs TF-IDF: Keyword Search Explained - How keyword search works under the hood
What are Vector Embeddings? - How machines understand meaning through numbers

DEV Community