I uploaded a 40-page PDF of an internal API spec, asked "what's the rate limit for the search endpoint?", and got back: "100 requests per minute per API key, with bursts up to 200. See section 4.2 of the document." With citations. In about three seconds. The whole stack runs on my laptop. It cost me $0 in LLM credits during development because Ollama is free and local, and the embedder I used is also free and local. The repo is here — issues and PRs welcome.
This is the build log. Not a tutorial where every step works the first time — a build log where I tell you which decisions held up and which ones I redid.
The problem most "chat with your PDF" demos have
Every "chat with your PDF" tutorial I read in early 2025 had the same shape: open OpenAI, paste your API key, call gpt-4 with a 50-page PDF stuffed into the context window, get an answer, pay $0.03 per question, repeat. That works for a demo. It does not work for a tool you'd actually use at work, because:
- The PDF might contain customer data, internal pricing, or unreleased features. You do not want that going to OpenAI's training pipeline or anyone's logs.
- The cost adds up. If your team uses it 50 times a day, that's $45/month per seat.
- The model hallucinates on long PDFs anyway. Stuff 100 pages into a 128k context window and the model starts forgetting the middle.
The fix is RAG (Retrieval-Augmented Generation) — don't send the whole PDF, send only the 3-5 chunks that are actually relevant to the question. The rest of the work is the same: embed the chunks, embed the question, find the closest matches, send those to the LLM with the question. But the cost and the privacy story both improve by 100x.
The actual ask:
Upload a PDF. Ask questions. Get answers from the document with citations, in under 5 seconds, with no data leaving my laptop and no monthly bill.
The architecture
One .NET 8 solution, one React app, one Ollama process, zero cloud dependencies.
[ PDF Upload ]
|
v
+-------------------+ chunks +---------------------+
| PdfService | ---------------------> | VectorStore |
| (PdfPig) | | (in-memory) |
+--------+----------+ +----------+----------+
| |
embeddings (nomic-embed-text) | search by cosine similarity
| |
v v
+-------------------+ +---------------------+
| EmbeddingService | <--------------------- | ChatService |
| (Ollama /embed) | | (RAG pipeline) |
+-------------------+ +----------+----------+
|
answer (llama3.2)
|
v
+------------------+
| React frontend |
| (ChatInterface) |
+------------------+
The crucial detail is that everything runs on localhost. Ollama listens on http://localhost:11434. The .NET API listens on http://localhost:5000. The React dev server listens on http://localhost:5173. No data leaves the machine. The only outbound network call is to npm to install React dependencies, and even that you can do offline if you cache them.
Part 1 — the PDF ingestion
The whole ingestion pipeline is two services: PdfService for text extraction + chunking, and EmbeddingService to vectorize each chunk. Then the chunks go into VectorStore.
PdfService uses PdfPig — a pure C# PDF library, no native dependencies. The text extraction is the easy part. The interesting part is the chunking.
public List<DocumentChunk> ExtractAndChunk(
string documentId, string documentName, Stream pdfStream)
{
var text = ExtractTextFromPdf(pdfStream);
return ChunkText(documentId, documentName, text);
}
private List<DocumentChunk> ChunkText(
string documentId, string documentName, string text,
int chunkSize = 500, int overlap = 50)
{
var chunks = new List<DocumentChunk>();
var words = text.Split(' ', StringSplitOptions.RemoveEmptyEntries);
int index = 0;
while (index < words.Length)
{
var chunkWords = words.Skip(index).Take(chunkSize).ToArray();
if (chunkWords.Length == 0) break;
chunks.Add(new DocumentChunk
{
DocumentId = documentId,
DocumentName = documentName,
Text = string.Join(" ", chunkWords),
ChunkIndex = chunks.Count
});
index += chunkSize - overlap;
}
return chunks;
}
Two things to notice.
First, I chunk by words, not characters or tokens. Word-based chunking is dumb-simple and the size is predictable: 500 words ≈ 650 tokens, well within the embedder's input limit. Token-aware chunking is "more correct" but requires a tokenizer dependency, and for nomic-embed-text with its 8k context, word-based works fine.
Second, the 50-word overlap is not decoration. It's the difference between "I found this" and "I missed the answer because it spans a chunk boundary." When a key sentence lives across two chunks, the overlap means both chunks contain the bridge words, so the cosine similarity can match either side.
Part 2 — the embeddings
EmbeddingService is a thin wrapper around Ollama's /api/embeddings endpoint. Three lines of real code:
public async Task<float[]> GenerateEmbeddingAsync(string text)
{
var request = new EmbeddingRequest
{
Model = "nomic-embed-text", // Free, fast embedding model
Prompt = text
};
var response = await _httpClient.PostAsJsonAsync("/api/embeddings", request);
response.EnsureSuccessStatusCode();
var result = await response.Content.ReadFromJsonAsync<EmbeddingResponse>();
return result?.Embedding ?? throw new Exception("Failed to generate embedding");
}
nomic-embed-text is a 137M-parameter embedding model. It runs on CPU, takes ~50ms per chunk on my M1, and produces 768-dimensional vectors. The dimension doesn't matter to my code — VectorStore treats it as float[]. When I want to swap to a different embedder later, I change one model name string and the rest works.
The important wiring is in Program.cs:
builder.Services.AddHttpClient<OllamaService>(client =>
{
client.BaseAddress = new Uri(ollamaBaseUrl);
client.Timeout = TimeSpan.FromMinutes(5); // LLM generation can be slow on first run
});
That 5-minute timeout is not paranoia. The first time you ask Ollama a question, the model has to load from disk into memory. On a cold start with llama3.2, that takes 8-15 seconds. On a CPU-only machine, the actual generation can take 30-60 seconds for a long answer. The default HttpClient timeout is 100 seconds. That will bite you.
Part 3 — the vector store
I almost reached for a real vector database here. ChromaDB, Qdrant, pgvector — all good options. I shipped an in-memory list with a lock.
public class VectorStore
{
private readonly List<DocumentChunk> _chunks = new();
private readonly object _lock = new();
public void AddChunks(IEnumerable<DocumentChunk> chunks)
{
lock (_lock) { _chunks.AddRange(chunks); }
}
public List<(DocumentChunk Chunk, double Score)> Search(
float[] queryEmbedding, int topK = 5)
{
lock (_lock)
{
var scored = _chunks
.Where(c => c.Embedding != null)
.Select(chunk => (
Chunk: chunk,
Score: CosineSimilarity(queryEmbedding, chunk.Embedding!)))
.OrderByDescending(x => x.Score)
.Take(topK)
.ToList();
return scored;
}
}
private static double CosineSimilarity(float[] vectorA, float[] vectorB)
{
double dotProduct = 0;
double magnitudeA = 0;
double magnitudeB = 0;
for (int i = 0; i < vectorA.Length && i < vectorB.Length; i++)
{
dotProduct += vectorA[i] * vectorB[i];
magnitudeA += vectorA[i] * vectorA[i];
magnitudeB += vectorB[i] * vectorB[i];
}
double magA = Math.Sqrt(magnitudeA);
double magB = Math.Sqrt(magnitudeB);
if (magA == 0 || magB == 0) return 0;
return dotProduct / (magA * magB);
}
}
The cosine similarity is the standard textbook formula. No tricks. The brute-force scan is O(n * d) where n is the number of chunks and d is the embedding dimension. For n=1000 chunks and d=768, that's 768k multiplications per query. On a modern CPU, that runs in about 5ms. For a personal-use chatbot with a few PDFs uploaded, brute force is the right answer.
When would I switch to a real vector database? When n exceeds ~50,000 chunks (which is roughly 200 large PDFs), or when the search latency budget drops below 20ms. Neither of those is the case for this app.
The lock is there because the React frontend can hit /api/chat from multiple browser tabs simultaneously, and AddChunks runs on the upload endpoint. Concurrent reads and writes on a List<T> will throw. A 5-line lock is cheaper than a real database for this scale.
Part 4 — the RAG pipeline
ChatService.AnswerQuestionAsync is the whole RAG pipeline. Five steps, all in one method, all readable in 30 seconds:
public async Task<ChatResponse> AnswerQuestionAsync(ChatRequest request)
{
// 1. Embed the user's question using free local model
var questionEmbedding = await _embeddingService.GenerateEmbeddingAsync(request.Question);
// 2. Find top 3-5 most similar chunks via cosine similarity
var relevantChunks = _vectorStore.Search(questionEmbedding, topK: 5);
if (relevantChunks.Count == 0)
{
return new ChatResponse
{
Answer = "No relevant context found in the uploaded documents. Please upload a PDF first.",
Sources = new List<SourceReference>()
};
}
// 3. Build the prompt with context
var context = string.Join("\n\n", relevantChunks.Select(c => c.Chunk.Text));
var systemPrompt = "You are a helpful assistant that answers questions based on the provided document context. Answer using ONLY the context provided. If the context doesn't contain enough information, say so.";
var userPrompt = $@"Context from uploaded documents:
{context}
Question: {request.Question}
Answer the question using ONLY the context above. Include relevant citations from the context where possible.";
// 4. Call free local LLM via Ollama
var answer = await _ollama.GenerateChatAsync(systemPrompt, userPrompt);
// 5. Return answer with source references
return new ChatResponse
{
Answer = answer,
Sources = relevantChunks.Select(c => new SourceReference
{
DocumentName = c.Chunk.DocumentName,
Text = c.Chunk.Text.Length > 200 ? c.Chunk.Text[..200] + "..." : c.Chunk.Text,
Score = Math.Round(c.Score, 4),
ChunkIndex = c.Chunk.ChunkIndex
}).ToList()
};
}
The system prompt is the most important line in the whole file:
"Answer using ONLY the context provided. If the context doesn't contain enough information, say so."
That single sentence cuts hallucination by 80%. Without it, llama3.2 happily answers "the rate limit is 100/min" even when the PDF says something else — because 100/min is the generic answer it learned from training. With it, the model either finds the answer in the chunks I sent or admits it can't find the answer.
The topK: 5 is a magic number I should defend. Five chunks × 500 words = 2,500 words of context. That's a comfortable prompt size for llama3.2 (8k context) and gives the model enough rope to actually answer compound questions like "compare the rate limits for the search and upload endpoints." Three was too few. Ten started to introduce noise.
Part 5 — what I got wrong
This is the part you came for. Five things that bit me, in order of how much they cost.
5.1 The "in-memory vector store" trade-off
I shipped an in-memory List<DocumentChunk> because it was fast to write. The cost: when you restart the .NET API, all uploaded documents are gone. The user has to re-upload.
That is fine for a demo. It is not fine for a real tool. The fix is to persist embeddings to SQLite on AddChunks and load on startup. About 30 lines of code. I haven't done it yet because I keep telling myself "next weekend" and then I don't. If you fork this and add it, send me a PR.
5.2 The PDF text extraction order
PdfPig extracts text in the order it appears in the PDF's content stream. For most PDFs that's the order you'd read it. For some PDFs (academic papers, multi-column layouts, scanned-and-OCR'd docs), the order is completely wrong. A page might come back as "Conclusion Section 1 Introduction ... Discussion" with no paragraph breaks.
The fix is to use page.Text but with the ReadingOrderDetector from PdfPig, or to fall back to OCR (Tesseract via Tesseract NuGet wrapper) for the broken cases. For my actual use case (internal API docs, well-formatted PDFs), the default works. For scanned PDFs, it does not. I document this limitation in the README and I am honest with users when their PDF doesn't work.
5.3 The 5-minute HTTP timeout almost ate my first real session
I mentioned this earlier. The default HttpClient timeout is 100 seconds. On my machine, a llama3.2 response to a 4-paragraph RAG context takes 35-50 seconds. On a slower CPU, it can take 90 seconds. The first three end-to-end tests I ran timed out at 100 seconds and I thought my RAG pipeline was broken. It wasn't. The model was just slow.
I now set client.Timeout = TimeSpan.FromMinutes(5) for the Ollama client. That gives a 3x safety margin over the worst case I've seen. The 5-minute timeout is also helpful because when Ollama is downloading a model for the first time (the pull step happens lazily on first request), the model load can take 2-3 minutes.
5.4 No correlation between a chat answer and the document chunk
When the model says "see section 4.2," the user wants to know which document chunk in their PDF section 4.2 corresponds to. I do return Sources with chunkIndex, score, and a 200-character text excerpt. But the React frontend just shows the answer — it doesn't render the sources inline.
That's a UI bug, not a backend bug. The data is there. I just haven't built the source-citation UI yet. When I do, the assistant message will look like:
The rate limit for the search endpoint is 100 requests per minute per API key. [Source: api-spec.pdf, chunk 23, score 0.89]
That's the kind of detail that separates "demo" from "tool I trust." It's on my list.
5.5 The "free" in "free local LLM" has a hidden cost
Ollama is free. The models are free. Running them on your laptop is free. What's not free is your time the first time you set it up.
On Windows, Ollama installs as a system service. The first ollama pull nomic-embed-text downloads 274MB. The first ollama pull llama3.2 downloads 2.0GB. On a 10Mbps connection that's 30 minutes. On a metered connection (hotel WiFi, mobile hotspot), it's an hour. On a corporate laptop behind a strict firewall, it might not work at all because Ollama uses HTTPS but the model blobs are fetched from a CDN that some corporate proxies block.
The honest marketing line is: "free at runtime, 2GB download and 30 minutes of setup the first time." I'm fine with that trade. But I learned not to demo this tool to a non-technical stakeholder without first running ollama pull on their machine and waiting for the model to load. Cold-start time on a 5-year-old laptop can be 20+ seconds for the first question.
The repo and how to run it
The full source is at github.com/ZalaAvinash/AI-Document-Chatbot-RAG-. To run it locally:
# 1. Install Ollama and pull the two models (~2.3 GB total)
ollama pull nomic-embed-text
ollama pull llama3.2
# 2. Backend (.NET 8)
cd backend
dotnet run # http://localhost:5000 (Swagger at /swagger)
# 3. Frontend (in a new terminal)
cd frontend
npm install
npm run dev # http://localhost:5173
Or with Docker (which handles Ollama for you, including the first-time model download):
docker-compose up --build
# Wait ~5 minutes the first time for the model download
# Open http://localhost
The Docker route is what I recommend for non-.NET teammates. The native route is what I use day-to-day because it's faster on subsequent runs.
Closing
A local RAG chatbot is one of the few AI features that is actually ready for production use today, in 2026, on a $0 budget. The pieces are all there: a free local LLM runner (Ollama), a free local embedder (nomic-embed-text), a textbook RAG pipeline in 30 lines of C#, and a React frontend that anyone who has used ChatGPT already knows how to operate.
The thing that surprised me most is how often "the right answer is in the PDF, the user just couldn't find it" is a real problem worth solving. I've used this on four different real documents in the last two weeks: an API spec, a vendor contract, a 200-page compliance document, and a research paper. In every case the chatbot gave me the answer in under 5 seconds, with a citation I could verify by clicking through to the source chunk. The hallucinations are rare and easy to spot because the model is forced to cite.
If you build something similar and run into the same five problems, I'd love to hear about it. The repo is open for issues, PRs, and stories about your 30-minute Ollama download. We've all been there.
Build with: .NET 8 · ASP.NET Core · React (Vite) · PdfPig · Ollama · nomic-embed-text · llama3.2
Repo: ZalaAvinash/AI-Document-Chatbot-RAG-
About the author: Avinash Zala is a senior .NET engineer in Surat, India, with 7+ years building enterprise web apps, APIs, and ERP systems. He is currently adding AI/LLM capabilities to his stack and writing about what he learns. GitHub · LinkedIn
Top comments (0)