Breno Prestes

Posted on Jun 30

RAG: Why Vectors, Cosine Distance, and a Vector Database? (.NET + Postgres)

#ai #webdev #postgres #development

Note: this article shows the simplest way to build a RAG system with .NET. The goal is to explain the concepts clearly, not to provide a production-ready implementation. It leaves out error handling, security checks, and other best practices you would want in real code.

I had used and implemented RAG for a while without truly understanding the theory behind it. Why do we need a vector database? Why does semantic search rely on cosine distance rather than some other method?

To answer those questions, I wrote this article, which details an implementation built with .NET, Postgres (with the vector extension), Gemini, and Ollama running the qwen3-embedding:0.6b model.

What is RAG?

RAG stands for Retrieval-Augmented Generation. The idea is to provide an LLM with some context and have it answer based on that context rather than on the general knowledge it acquired during training.

A concrete example is Google's NotebookLM. You upload your own material to the application and ask questions about it, and the answers are drawn from that specific content. In other words, you supply a closed context and expect responses limited to that context. Customer support chatbots that answer based on a company's documentation follow the same principle.

This raises a question: how do we know that a given resource, text, or image is relevant to the answer the user is looking for? How do we gather all of that material and send it as context, and how do we do it as efficiently as possible?

Why vectors, not text

The answer comes down to mathematics. The best way to represent data for semantic search is not as text, but as a sequence of numbers, that is, a vector. Working directly with characters is far more complex and computationally expensive, while comparing numbers is fast.

To turn each of our resources into a sequence of numbers, we use a technique called embeddings. An embedding is a machine learning method that converts data into vectors within a multidimensional space, and these vectors carry meaning: semantically similar data ends up close together in that space. In our case, we will generate these embeddings locally using Ollama with the qwen3-embedding:0.6b model.

Measuring similarity: cosine

Once each piece of data has been converted into a multidimensional vector by the embedding model, we need a way to measure how similar two vectors are. The most common technique is cosine similarity.

The idea is to calculate the angle between two vectors. The smaller the angle between them, the more similar they are. Put another way, the closer the cosine of the angle is to 1, the more semantically similar the data is.

Written component by component, for vectors of n dimensions, the formula becomes a summation:

\cos(\theta) = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2}}

In the numerator we multiply the components at the same position and add them up, which is the dot product. In the denominator we normalize by the length of each vector. This holds the same way whether the vectors have 2 dimensions or 1,024, which is exactly the case for the vectors produced by the embedding model.

To make this concrete, consider a 2D space and two pieces of data (a text, a word, and so on) A and B, represented by the vectors A = (3, 5) and B = (6, 3) shown in the figure below. Applying the formula gives:

\cos(\theta) = \frac{\vec{A} \cdot \vec{B}}{|\vec{A}| \times |\vec{B}|} = \frac{33}{\sqrt{34} \times \sqrt{45}} = \frac{33}{\sqrt{1530}} \approx \frac{33}{39.115} \approx 0.844

A cosine of roughly 0.84 indicates that A and B point in fairly close directions, so they are quite similar.

Why cosine and not sine?

Sine is tied to the cross product, which works well only in two and three dimensions. Embedding models operate with hundreds or even thousands of dimensions, so the computational cost would be enormous. Beyond that, restricting ourselves to two or three dimensions would introduce a contradiction, since opposite angles could share the same value: sin(0) and sin(180) are both equal to 0. Cosine does not have this problem, because cos(0) and cos(180) are distinct.

Let's build it

With this overview in place, and an understanding of why embeddings are necessary, we can solve the problem as follows.

Because I did not get satisfactory results with a local LLM, the Gemini API is used here, as it offers a generous daily quota and solid models. The embeddings, on the other hand, run locally through Ollama, as mentioned earlier. For storage we use Postgres with the Vector extension. On the .NET side, we use Entity Framework with Npgsql and the vector package for Entity Framework, which lets us use LINQ (Language Integrated Query) along with some features of Microsoft Semantic Kernel.

Packages

First, we need to install the following packages:

PgVector
Npgsql
EntityFramework
PdfPig
OllamaSharp
Semantic Kernel
Google GenAI (for LLM; you can also use Ollama directly, depending on your machine)

Configuring PostgreSQL to accept PgVector

var config = builder.Configuration;

builder.Services.AddOpenApi();
builder.Services.AddControllers();
builder.Services.AddDbContext<DatabaseContext>(options =>
    options.UseNpgsql(config.GetConnectionString("database"), o => o.UseVector()));

var app = builder.Build();

The key detail in this snippet is o.UseVector(). It registers the pgvector type mappings with Npgsql so that Entity Framework knows how to translate a vector column between PostgreSQL and .NET. Without it, the Vector type would not be recognized when reading from or writing to the database.

using Microsoft.EntityFrameworkCore;

public class DatabaseContext : DbContext
{
    public DbSet<Source> Sources { get; set; }

    public DatabaseContext(DbContextOptions<DatabaseContext> options) : base(options) {}

    protected override void OnModelCreating(ModelBuilder modelBuilder)
    {
        modelBuilder.HasPostgresExtension("vector");

        modelBuilder.Entity<Source>(entity =>
        {
            entity.Property(s => s.Embedding)
                .HasColumnType("vector(1024)");

            entity.HasIndex(s => s.Embedding)
                .HasMethod("hnsw")
                .HasOperators("vector_cosine_ops");
        });
    }
}

This DbContext is where the vector configuration actually takes shape. Three lines deserve attention. HasPostgresExtension("vector") ensures the pgvector extension is enabled in the database when migrations are applied. HasColumnType("vector(1024)") declares that the Embedding column stores a vector of exactly 1024 dimensions, which must match the output size of the embedding model; qwen3-embedding:0.6b produces 1024-dimensional vectors, so the numbers have to agree. Finally, the index uses the HNSW method (Hierarchical Navigable Small World) together with the vector_cosine_ops operator class. HNSW is an approximate nearest neighbor index that makes similarity searches fast even over large tables, and vector_cosine_ops tells it to organize the data for cosine distance specifically, which is the metric we rely on for semantic search.

Reading the documents

First, we need to read the documents. Here I considered only PDF files, using PdfPig:

using UglyToad.PdfPig;
using UglyToad.PdfPig.DocumentLayoutAnalysis.TextExtractor;

public class ReadFile
{
    public static async Task<string> FormFilePdf(IFormFile formFile)
    {
        string pathToSave = Path.Combine(Directory.GetCurrentDirectory(), "uploads");

        if (!Directory.Exists(pathToSave))
            Directory.CreateDirectory(pathToSave);

        using FileStream stream = new(Path.Combine(pathToSave, formFile.FileName), FileMode.Create);
        await formFile.CopyToAsync(stream);

        string text = "";

        using (var pdf = PdfDocument.Open($"{Path.Combine(pathToSave, formFile.FileName)}"))
        {
            foreach (var page in pdf.GetPages())
            {
                text += ContentOrderTextExtractor.GetText(page);
            }
        }

        return text;
    }
}

This method receives an uploaded file, saves it to an uploads directory on disk, and then opens it with PdfPig. It iterates over every page and extracts the text using ContentOrderTextExtractor, which attempts to read the content in natural reading order rather than in the arbitrary order the characters may be stored in the PDF.

Generating the embeddings

For the embeddings, I used:

using Microsoft.SemanticKernel.Text;
using OllamaSharp;
using OllamaSharp.Models;
using Pgvector;

public class EmbeddingService
{
    public static async Task<IEnumerable<CreateSource>> Generate(string text)
    {
        var lines = TextChunker.SplitPlainTextLines(text, maxTokensPerLine: 30);
        var chunks = TextChunker.SplitPlainTextParagraphs(lines, maxTokensPerParagraph: 100, overlapTokens: 20);

        var ollama = new OllamaApiClient("http://localhost:11434");

        var rs = await ollama.EmbedAsync(new EmbedRequest
        {
            Model = "qwen3-embedding:0.6b",
            Input = chunks
        });

        var sources = rs.Embeddings.Zip(chunks, (embedding, text) =>
        {
            return new CreateSource { Text = text, Embedding = new Vector(embedding) };
        });

        return sources;
    }
}

This service does the actual work of turning text into vectors, and it happens in two stages. First, the text is split into manageable pieces, a process known as chunking. SplitPlainTextLines breaks the raw text into lines of at most 30 tokens, and SplitPlainTextParagraphs then groups those lines into paragraphs of at most 100 tokens, with an overlap of 20 tokens between consecutive chunks. The overlap matters: it keeps a small amount of shared context at the boundaries so that a sentence split across two chunks does not lose its meaning. Chunking is necessary because embedding a very long text as a single vector blurs its meaning, and because we want retrieval to return small, focused passages rather than entire documents.

Second, the chunks are sent to Ollama through EmbedAsync, which runs the qwen3-embedding:0.6b model locally and returns one vector per chunk. The Zip call then pairs each chunk of text with its corresponding vector and wraps them in a CreateSource, so that the original text and its embedding stay together and can be stored side by side.

The database model

For the database model, I created the following table with the Source model:

public class Source
{
    public int Id { get; set; }
    public string? Content { get; set; }
    public Pgvector.Vector? Embedding { get; set; }
}

And the DTOs:

using Pgvector;

public class CreateSource
{
    public string Text;
    public Vector Embedding;
}

using System.ComponentModel.DataAnnotations;

public class FileUploadDto
{
    [Required]
    public IFormFile? FormFile { get; set; }
    public string? Description { get; set; }
}

The Source entity is what gets persisted: it holds the original text in Content and its vector in Embedding. CreateSource is an intermediate object used while generating embeddings, before the data is mapped to the entity that is saved. FileUploadDto describes the shape of the upload request, and the [Required] attribute on FormFile ensures the request is rejected if no file is provided.

The file controller

using Microsoft.AspNetCore.Mvc;

[ApiController]
[Route("api/[controller]")]
public class FileController : ControllerBase
{
    private DatabaseContext dbContext;

    public FileController([FromServices] DatabaseContext context)
    {
        this.dbContext = context;
    }

    [HttpPost("upload")]
    public async Task<IActionResult> UploadFile([FromForm] FileUploadDto form)
    {
        if (form.FormFile == null)
        {
            return BadRequest();
        }

        var text = await ReadFile.FormFilePdf(form.FormFile);
        var res = await EmbeddingService.Generate(text);

        var sources = res.Select(x => new Source { Embedding = x.Embedding, Content = x.Text }).ToList();
        await dbContext.Sources.AddRangeAsync(sources);
        await dbContext.SaveChangesAsync();

        return Ok(sources);
    }
}

This controller ties the ingestion pipeline together. When a file is uploaded, it reads the text from the PDF, generates the embeddings for each chunk, maps the results to Source entities, and saves them all to the database in a single batch with AddRangeAsync followed by SaveChangesAsync. After this runs, the database holds the text of each chunk alongside its vector, ready to be searched. This is the offline half of RAG, the indexing step, which prepares the data so that questions can be answered later.

The chat controller

using Google.GenAI;
using Google.GenAI.Types;
using Microsoft.AspNetCore.Mvc;
using Microsoft.EntityFrameworkCore;
using Pgvector.EntityFrameworkCore;

[ApiController]
[Route("api/[controller]")]
public class ChatController : ControllerBase
{
    private DatabaseContext dbCOntext;

    public ChatController([FromServices] DatabaseContext context)
    {
        this.dbCOntext = context;
    }

    [HttpPost]
    public async Task<IActionResult> Chat([FromBody] string inputUser)
    {
        if(inputUser is null)
            return BadRequest(new { reason = "Input vazio"});

        var inputUserEmbedding = await EmbeddingService.generate(inputUser);

        var results = await dbCOntext.Sources
            .Where(x => x.Embedding.CosineDistance(inputUserEmbedding.First().Embedding) < 0.3)
            .OrderBy(x => x.Embedding.CosineDistance(inputUserEmbedding.First().Embedding))
            .Select(x => x.Content)
            .ToListAsync();

        string context = String.Concat(results);

        var client = new Client(apiKey: "API_KEY");

        Schema schema = new()
        {
            Properties = new Dictionary<string, Schema>
            {
                { 
                    "response", new Schema { Type =  Google.GenAI.Types.Type.String }
                }
            },
            Type = Google.GenAI.Types.Type.Object
        };

        GenerateContentConfig config = new()
        {
            ResponseMimeType = "application/json",
            SystemInstruction = new Content 
            {
                Parts = [
                    new Part {Text = $"Only answer questions based on {context}. For everything else, say I don't know."}
                ]
            },
            ResponseJsonSchema = schema,
            MaxOutputTokens = 1024,
            Temperature = 0.1,
            TopP = 0.8,
            TopK = 40,
        };

        var response = await client.Models.GenerateContentAsync(
            model: "gemini-3.1-flash-lite",
            contents: inputUser,
            config: config
        );

        return Ok(response.Text);
    }
}

This controller is where retrieval and generation come together, and it is the heart of the whole system. It runs in three steps.

First, the user's question is converted into a vector with the same embedding service used during ingestion. This is essential: the question and the stored chunks must live in the same vector space for the comparison to mean anything.

Second, the database is queried for the chunks most similar to the question. The CosineDistance call is translated by Entity Framework into the pgvector cosine operator, so the comparison runs inside PostgreSQL using the HNSW index rather than in application memory. The OrderBy then sorts the results from closest to farthest, and Select keeps only the text content. The matching chunks are concatenated into a single context string. One point to verify in this query: cosine distance is the complement of similarity, so smaller values mean more similar.

Third, the question and the retrieved context are sent to the LLM. The system instruction constrains the model to answer only from the provided context and to say that it does not know otherwise, which is what keeps the answer grounded in your data rather than in the model's general training. A low Temperature of 0.1 makes the output more deterministic and less prone to invention, which is desirable when the goal is faithful retrieval. The result returned by Gemini is then sent back to the caller.

This second controller is the online half of RAG, the query step. Together with the ingestion controller, it completes the loop: documents are embedded and stored ahead of time, and at query time the most relevant pieces are retrieved and handed to the model as context.

A closer look at CosineDistance

CosineDistance deserves a dedicated explanation, because it is the bridge between the theory discussed earlier and the actual query. Earlier we used cosine similarity, where a value close to 1 means two vectors point in nearly the same direction and are therefore semantically close. Cosine distance is simply the complement of that quantity:

\text{cosine distance} = 1 - \cos(\theta)

This inversion changes how the values are read. With cosine similarity, larger is more similar; with cosine distance, smaller is more similar. The endpoints make this clear: identical vectors have a similarity of 1 and a distance of 0, while opposite vectors have a similarity of -1 and a distance of 2. The distance always falls between 0 and 2.

The reason pgvector exposes distance rather than similarity is practical. Database indexes and ordering operations are built around the idea of "nearest" meaning "smallest," so framing the metric as a distance lets ORDER BY return the closest matches first in the natural way. In the code, x.Embedding.CosineDistance(inputUserEmbedding.First().Embedding) is not executed in C#. Entity Framework translates it into the pgvector cosine distance operator <=>, so the calculation runs inside PostgreSQL and can take advantage of the HNSW index defined earlier with vector_cosine_ops. This is what allows the search to remain fast even as the number of stored chunks grows, since the database does not have to compare the query against every row one by one.

This also explains the earlier caution about the filter. Because smaller distance means more similar, OrderBy(... CosineDistance ...) correctly lists the closest chunks first < 0.3. Choosing the exact threshold is a matter of tuning against your own data, since the right cutoff depends on the embedding model and the nature of the documents.

Conclusion

The implementation tied these ideas together in two halves. The ingestion step reads documents, splits them into chunks, turns each chunk into a vector with a local Ollama model, and stores everything in Postgres through pgvector. The query step embeds the user's question, retrieves the nearest chunks by cosine distance using the HNSW index, and hands that context to Gemini with an instruction to answer only from what it was given.

This is a deliberately simple version, and there is plenty of room to grow: better chunking strategies, reranking the retrieved results, evaluating answer quality, and adding the error handling and security that production code requires. Even so, the core loop stays the same, and understanding it makes the rest far easier to reason about.

CODE HERE