DEV Community: Suman Nath

Anatomy of a Full RAG Application: Every Concept, One Self-Hosted Stack

Suman Nath — Sun, 12 Jul 2026 13:35:29 +0000

"Chat with your documents" sounds simple. Then you build it, and you discover a good RAG system is really eight systems wearing a trench coat.

I recently finished myRAG — a fully self-hosted RAG stack: FastAPI backend, React frontend, and three storage engines (Qdrant, PostgreSQL, Neo4j), all orchestrated with Docker Compose. This post walks through every stage of the pipeline and the concept behind it, with real code from the project.

For a link to the codebase, scroll to bottom!

The architecture at a glance

Document ─► Docling ─► Chunker ─┬─► Dense embed (OpenRouter) ──┐
                                ├─► Sparse BM25 (fastembed) ───┼─► Qdrant (named vectors)
                                └─► LLM triple extraction ─────┼─► Neo4j (knowledge graph)
                                                               └─► PostgreSQL (metadata, doc_uuid)

Question ─► hybrid search (RRF) ─► rerank ─► + graph facts ─► + memory ─► token budget ─► LLM ─► SSE stream

Six containers: the app, the React/nginx frontend, docling-serve, Qdrant, Postgres, and Neo4j.

1. Parsing: retrieval quality is capped by parsing quality

Everything starts with converting PDFs/DOCX/HTML into text an LLM can use. I run Docling as a separate service that returns clean Markdown, preserving headings and tables.

Why Markdown? Because structure survives. A table flattened into a character soup is unsearchable no matter how good your embedding model is. Garbage in, garbage out — this stage silently determines your ceiling.

2. Chunking: the most underrated decision in RAG

Chunks are the retrieval unit. Too large and the embedding becomes a blurry average of many topics; too small and the chunk loses the context needed to be understood alone.

chunking:
  strategy: recursive
  chunk_size: 512
  chunk_overlap: 64

Recursive splitting respects paragraph and sentence boundaries, and the overlap is essential: without it, answers that straddle a chunk boundary fall into the gap and are never retrieved whole.

3. Index twice: dense for meaning, sparse for precision

Most tutorials embed once and call it done. I index every chunk two ways:

Dense vectors (Qwen3-Embedding, 4096-dim, via OpenRouter) — semantic similarity. "Q3 revenue" matches "third-quarter earnings."
Sparse BM25 vectors (computed locally with fastembed, no API cost) — lexical similarity. Exact part numbers, names, acronyms — the things embedding models fumble.

Qdrant stores both on the same point as named vectors, with BM25's IDF computed server-side:

self.client.create_collection(
    collection_name=...,
    vectors_config={"dense": VectorParams(size=4096, distance=Distance.COSINE)},
    sparse_vectors_config={"bm25": models.SparseVectorParams(modifier=models.Modifier.IDF)},
)

One subtlety: BM25 weights documents and queries differently, so ingestion uses embed() (term weighting) while queries use query_embed() (term presence).

4. Hybrid search: Reciprocal Rank Fusion

At query time, both searches run and Qdrant fuses them with RRF — which merges ranked lists instead of trying to normalize incomparable score scales:

result = self.client.query_points(
    collection_name=...,
    prefetch=[
        models.Prefetch(query=dense_vector, using="dense", limit=20),
        models.Prefetch(query=sparse_query, using="bm25", limit=20),
    ],
    query=models.FusionQuery(fusion=models.Fusion.RRF),
    limit=10,
)

RRF scores each document by Σ 1/(k + rank) across lists. A chunk ranked highly by either method surfaces; one ranked highly by both wins. No score calibration, no tuning — and it works disturbingly well.

5. Reranking: cheap recall, then expensive precision

Vector search compares a query against chunks that were embedded without knowing the question. A cross-encoder reranker (Cohere rerank, via OpenRouter) reads query and chunk together and produces a much sharper relevance score.

The pattern is a funnel: hybrid-retrieve 10 candidates cheaply, rerank down to the best 5. It's the same two-stage architecture search engines have used for decades — recall first, precision second.

6. The knowledge graph: RAG that connects the dots

Pure vector RAG struggles with relational questions — "Who reports to the person who founded X?" spans facts that live in different chunks.

So during ingestion, an LLM extracts (subject, relation, object) triples from every chunk into Neo4j:

You are a knowledge-graph extraction engine. From the text below, extract factual
relationships as a JSON array of triples... Only extract relationships explicitly
stated in the text. Return ONLY the JSON array, no prose.

At query time the flow is: extract entities from the question → match them against graph nodes via Neo4j's fulltext index → pull their 1-hop neighborhood → inject the triples into the prompt as structured facts:

Knowledge graph facts:
- Acme Corp --acquired--> Widget Inc
- Widget Inc --founded_by--> Jane Doe

Call it GraphRAG-lite: the vector store answers "what's relevant," the graph answers "how things relate." Every relationship is tagged with doc_uuid, so deleting a document prunes its facts (and any orphaned entities) cleanly.

7. Memory + token budgeting: context is a budget

Multi-turn chat needs history, but the context window is finite. Two mechanisms:

Rolling summarization — after N turns, older exchanges are compressed into a running summary by a small LLM. Long conversations cost a paragraph, not pages.

Token budgeting — before generation, the prompt is assembled against a hard cap with explicit priorities:

# Priority (always kept): system prompt, summary, graph facts, the question.
# Then newest history, then chunks best-first; lowest-ranked chunks drop first.
while history_msgs and fixed + est(history_msgs) > budget:
    history_msgs.pop(0)          # oldest history goes first

for chunk in chunks:             # rerank order: best first
    if used + est(chunk) > avail and len(kept) >= min_chunks:
        break
    kept.append(chunk)

The mental model: every token of history you keep is a token of evidence you can't include. Making that trade-off explicit — instead of letting the API truncate arbitrarily — noticeably improves long conversations.

8. Parallelism: same models, several times faster

Ingestion is embarrassingly parallel — embedding API calls, local BM25 encoding, and per-chunk graph extraction are all independent I/O. A single shared thread pool handles the fan-out:

# executor.py — one process-wide pool, order-preserving map
def map_parallel(fn, items):
    items = list(items)
    if len(items) <= 1:
        return [fn(i) for i in items]
    return list(get_pool().map(fn, items))

Three applications:

# 1. Dense embedding: sub-batches of 32 texts, requests in flight concurrently
batches = [texts[i:i+32] for i in range(0, len(texts), 32)]
vectors = [v for batch in map_parallel(self._embed_request, batches) for v in batch]

# 2. Sparse BM25 runs concurrently with the dense API round-trips
sparse_future = get_pool().submit(self.sparse_embedder.embed_batch, texts)
vectors = self.embedder.embed_batch(texts)
sparse_vectors = sparse_future.result()

# 3. Graph extraction: one LLM call per chunk, previously sequential — now fanned out
futures = [get_pool().submit(self.graph_extractor.extract, t) for t in texts]
for i, future in enumerate(as_completed(futures)):
    ...  # upsert triples + yield progress events as each completes

At query time, the graph fact lookup is submitted before retrieval starts, so it overlaps with retrieval + reranking instead of running after them. Graph extraction was the slowest ingestion phase by far — parallelizing it is the difference between 90 seconds and 15 for a mid-sized document. No async rewrite needed; threads are plenty for I/O-bound work.

9. The unglamorous parts that make it "production"

Referential integrity. One doc_uuid ties each document across Postgres (primary key), Qdrant (point payload), and Neo4j (relationship property). Deletes cascade through all three stores.
Streaming UX. Ingestion and chat are generator-based and stream progress over SSE — the UI shows parsing → chunking → embedding → storing → graph → done live. One gotcha: EventSource can't send headers, so the client streams via fetch + ReadableStream to pass the auth token.
Auth. Every /api/* endpoint requires a bearer token, checked with secrets.compare_digest (constant-time). The React build bakes the key in at build time from an env var — no login screen; docker-compose sources frontend and backend from the same .env so they can't drift.
Config as code. One config.yaml for every tunable, Pydantic Settings overlays secrets from .env, zero hardcoded values.

The takeaway

RAG isn't one technique. It's a pipeline of small, well-understood ideas:

Parse cleanly → chunk thoughtfully → index twice → fuse ranks → rerank deeply → add structure with a graph → budget your tokens → parallelize the waits.

None of these steps is hard alone. The engineering is in making them agree with each other — sharing one UUID, one config, one thread pool, and one honest token budget.

Access the full codebase here: https://github.com/sumannath/myRAG

Questions about any layer? The hybrid search and knowledge-graph stages delivered the biggest quality jumps for me, and I'm happy to go deeper on either in the comments.

Practical RAG, Part 1: The Simplest RAG That Actually Works

Suman Nath — Thu, 02 Jul 2026 17:28:58 +0000

By Suman — Part 1 of the **Practical RAG* series. All code is in a runnable notebook: https://www.kaggle.com/code/sumannath88/ep01-simple-rag

Everyone talks about RAG. Far fewer people have built the simplest version end to end and looked at exactly where it falls over.

That's what this series does. We start with the most naive RAG pipeline that actually works, understand it completely, and then — one concrete problem at a time — make it better. No frameworks hiding the moving parts. Just Python you can read.

By the end of this post you'll have a working pipeline in about 40 lines that answers questions correctly — and you'll understand exactly why that success is misleading. Those hidden weaknesses are the roadmap for the rest of the series.

What RAG actually is

RAG — Retrieval-Augmented Generation — is one idea: before you ask the model a question, go find relevant text and paste it into the prompt. That's it. The "retrieval" finds the text; the "generation" is the LLM answering with that text in front of it.

Why bother? Because it lets a model answer questions about your data — documents it was never trained on — without fine-tuning, and it grounds answers in real sources instead of the model's memory.

The naive pipeline has five steps:

Load your documents
Chunk them into pieces
Embed each chunk into a vector
Retrieve the chunks most similar to the question
Generate an answer with those chunks as context

Let's build each one.

Setup

We'll use local embeddings (via sentence-transformers) so retrieval is free and needs no API key, and OpenRouter for generation because it exposes an OpenAI-compatible API across many models.

pip install sentence-transformers openai numpy

import os
import numpy as np

# On Kaggle, store OPENROUTER_API_KEY as a notebook Secret; elsewhere use an
# env var or paste it inline.
try:
    from kaggle_secrets import UserSecretsClient
    os.environ.setdefault(
        "OPENROUTER_API_KEY",
        UserSecretsClient().get_secret("OPENROUTER_API_KEY"),
    )
except ModuleNotFoundError:
    os.environ.setdefault("OPENROUTER_API_KEY", "sk-or-...")  # your key

LLM_MODEL   = "deepseek/deepseek-v4-flash"
EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
TOP_K = 3

The notebook runs on Kaggle, Colab, or locally. Embeddings are computed locally, so only generation touches the network.

1 & 2. Load and chunk

To keep everything self-contained, our "corpus" is a handful of short passages about planets. And our chunking strategy is the simplest one imaginable: one chunk per document.

DOCUMENTS = [
    "Mercury is the smallest planet ... no moons ...",
    "Venus is the hottest planet ... 465 degrees Celsius.",
    "Earth ... the only known world with liquid water and life ...",
    "Mars ... two small moons, Phobos and Deimos.",
    "Jupiter is the largest planet ... at least 95 known moons.",
    "Saturn ... famous for its prominent ring system ...",
]
chunks = DOCUMENTS  # naive: each doc is one chunk

This is fine because the passages are already short. Hold onto that caveat — it's the first thing that breaks on real data.

3. Embed

An embedding turns text into a vector of numbers such that similar meanings land near each other in space. We compute one vector per chunk, once, up front.

from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer(EMBED_MODEL)
chunk_embeddings = embedder.encode(chunks, normalize_embeddings=True)

We normalize the vectors so that cosine similarity — the standard measure of "how close are these two meanings" — collapses to a plain dot product.

4. Retrieve

To answer a question, embed the question the same way, score it against every chunk, and keep the top k.

def retrieve(question, k=TOP_K):
    q_emb = embedder.encode([question], normalize_embeddings=True)[0]
    scores = chunk_embeddings @ q_emb        # cosine similarity
    top_idx = np.argsort(scores)[::-1][:k]
    return [(chunks[i], float(scores[i])) for i in top_idx]

Ask "Which planet has the most moons?" and the Jupiter chunk comes back on top. No LLM involved yet — this is pure vector search.

5. Generate

Now stitch the retrieved chunks into a prompt and ask the model — instructing it to answer only from the provided context. That instruction is the heart of RAG discipline: it's what keeps the model grounded instead of guessing.

from openai import OpenAI

client = OpenAI(base_url="https://openrouter.ai/api/v1",
                api_key=os.environ["OPENROUTER_API_KEY"])

def answer(question, k=TOP_K):
    retrieved = retrieve(question, k)
    context = "\n\n".join(f"[{i+1}] {c}" for i, (c, _) in enumerate(retrieved))
    prompt = (
        "Answer the question using ONLY the context below. "
        "If the answer is not in the context, say you don't know.\n\n"
        f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"
    )
    resp = client.chat.completions.create(
        model=LLM_MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return resp.choices[0].message.content, retrieved

answer("Which planet has the most moons?")[0]
# -> "Jupiter, with at least 95 known moons."

That's a complete RAG system. Load → chunk → embed → retrieve → generate.

It works — and that's the trap

Here's the twist: this pipeline handles the hard-looking questions just fine.

A question outside the corpus:

answer("How far is Pluto from the Sun?")[0]
# -> "I don't know."

Pluto isn't in our documents, and the model correctly refuses to invent an answer. Grounding is doing its job.

A comparison spanning two chunks:

answer("Which is hotter, Venus or Mercury, and why?")[0]
# -> "Venus is hotter (~465°C) because its thick CO2 atmosphere traps heat,
#     while Mercury has almost no atmosphere."

The answer lives across two chunks, and top-k retrieval pulls both. Correct, and even well-reasoned.

So naive RAG works. It works flawlessly. And that is exactly the problem — because it's working on six clean, short, hand-picked paragraphs. A small, tidy corpus hides every weakness the technique has.

The weaknesses hiding behind the demo — and the roadmap

Clean answers on toy data prove almost nothing. Each of these breaks the moment you point naive RAG at real documents, and each is exactly what a later part of the series fixes:

Chunking is naive. One-chunk-per-document collapses when documents are long — the right passage gets buried in noise or split apart.
Retrieval is purely semantic. Exact keywords — names, IDs, error codes — can slip past vector similarity. Hybrid (keyword + vector) search helps.
No reranking. With hundreds of chunks, the top k by cosine similarity aren't reliably the most useful k.
No evaluation. We're eyeballing two answers. Without numbers, we can't tell whether any "improvement" actually improved anything.

Part 2 takes on chunking and retrieval quality — and adds a small evaluation harness so every change from here on is measurable.

The full runnable notebook for this part is here: https://www.kaggle.com/code/sumannath88/ep01-simple-rag

If this was useful, follow along — the series gets more interesting as the naive version starts to hurt.

Next: Part 2 — Better chunks, hybrid retrieval, and how to actually measure RAG.

A Better LLM Judge? The Rubric Made My Small Model Worse

Suman Nath — Mon, 29 Jun 2026 08:07:48 +0000

In Part 2 I built the laziest possible LLM judge — a tiny model (Qwen2.5-1.5B) and a one-line rubric — and it agreed with human votes only ~43% of the time, crammed every score into a 7–8 band, and tied a third of the comparisons humans had no trouble separating.

Two things were wrong with that judge, and people usually fix only one:

The model was too small.
The rubric told it almost nothing.

I fixed each independently and measured the effect. The result wasn't the tidy "write a better rubric, it's free" story I expected — it was more interesting than that.

The big judge runs on an API (and why)

A genuinely large judge doesn't fit a free Kaggle GPU, and fighting transformers versions / OOM / sharding is exactly the yak-shaving real teams skip by calling a hosted endpoint. So the big judge runs on OpenRouter — one OpenAI-compatible API across many models, so swapping the judge is a one-line BIG_ID change. The small baseline still runs locally (no reason to spend API calls on a 1.5B model).

Two things keep the calls cheap and short: cap the output (max_tokens=160) and turn reasoning off (these models reason by default, which bloats output). Plus a small retry on the occasional 429:

BIG_ID = 'deepseek/deepseek-v4-pro'   # one-line swap; also ran qwen/qwen3-32b

def big_judge(question, answer, rubric, max_tokens=160, retries=4):
    kw = dict(model=BIG_ID, messages=build_messages(question, answer, rubric),
              temperature=0, max_tokens=max_tokens)
    for attempt in range(retries):
        try:
            try:   # disable reasoning (OpenRouter-specific); fall back if rejected
                resp = or_client.chat.completions.create(
                    extra_body={'reasoning': {'enabled': False}}, **kw)
            except Exception as inner:
                if 'reasoning' in str(inner).lower():
                    resp = or_client.chat.completions.create(**kw)
                else:
                    raise
            return parse_score(resp.choices[0].message.content or ''), None
        except Exception as e:
            if ('rate' in str(e).lower() or '429' in str(e)) and attempt < retries - 1:
                time.sleep(2 * (attempt + 1)); continue
            return float('nan'), None

Since the API calls are network-bound, the 2x2 runner fans them out across a thread pool (ThreadPoolExecutor), so each big-judge condition finishes in a fraction of the sequential time. (Lesson learned the hard way on an earlier provider: with max_tokens=512 and no reasoning cap, a reasoning model spent ~4.5K tokens thinking per call and blew straight through that provider's rate limit. Capping output is the biggest lever.)

The two rubrics — the actual variable

The naive rubric is what most people write and stop at:

NAIVE_RUBRIC = (
    'Score from 1 (terrible) to 10 (excellent) based on correctness and helpfulness. '
    'Respond EXACTLY as:\nSCORE: <number>'
)

The good rubric names explicit criteria, anchors the scale (what a 2/5/8/10 mean), and demands reasoning before the score:

GOOD_RUBRIC = (
    'You are an expert evaluator. Judge the answer on CORRECTNESS, COMPLETENESS, and '
    'INSTRUCTION-FOLLOWING. Use the FULL 1-10 scale, anchored:\n'
    '  1-2 = wrong/irrelevant.  3-4 = major errors.  5-6 = partial.\n'
    '  7-8 = correct, minor issues.  9-10 = fully correct and on-task.\n'
    'A confident, fluent answer that is factually WRONG must score 1-2, not high. '
    'First one sentence of reasoning, then:\nREASON: <one sentence>\nSCORE: <number>'
)

The 2x2 (run twice, two different big judges)

Same human-voted Chatbot Arena pairs as Part 2 (N=30), same independent single-answer scoring. The only things that change are model and rubric. To make sure the effect wasn't a quirk of one model, I ran the big judge twice — deepseek/deepseek-v4-pro and qwen/qwen3-32b — via OpenRouter. The small baseline is the same local Qwen2.5-1.5B in both.

Big judge = DeepSeek:

Condition	Agreement (decisive)	Agreement (overall)	Ties	Scale
small + naive	67%	47%	9/30	2–10
small + good rubric	54% ⬇	43%	6/30	1–10
big + naive	65%	37%	10/30	1–10
big + good rubric	79% ⬆	50%	7/30	1–10

Big judge = Qwen 32B (same pattern, milder):

Condition	Agreement (decisive)	Ties
small + naive	67%	9/30
small + good rubric	54% ⬇	6/30
big + naive	70%	7/30
big + good rubric	71% ⬆	4/30

Read the rubric column carefully, on both. The good rubric hurt the small model (67%→54% — same on both runs) but helped the big one (DeepSeek: 65%→79%, a +14pt jump; Qwen: 70%→71% but with far fewer ties). The detailed, multi-criteria instructions that sharpened a capable model just confused the 1.5B.

One more thing the DeepSeek run exposes: big + naive landed at 65% decisive / 37% overall — no better than the small model, and its worst tie count. A bigger, pricier judge with a lazy rubric bought nothing. The leap to 79% only came when the big model and a real rubric were used together.

The point

I expected "a better rubric is the cheap win." The data said something more useful: a good rubric is an instruction, and the model has to be capable enough to follow it.

A bigger model only helped when paired with a real rubric. With the lazy rubric, the big model was no better than the small one (DeepSeek big+naive actually landed at 67%/65% — flat — with its worst tie count).
A better rubric only paid off on the capable model. On the small model, the careful rubric was worse than the lazy one-liner — on both big-judge runs.

So the two fixes aren't independent levers you can add up. Hand a precise rubric to a weak model and you can make your eval worse than doing nothing; pay for a big model and skip the rubric and you've bought nothing. The best judge was the combination — big model and real rubric (DeepSeek hit 79%) — but the instructive results are the two traps on either side of it.

An LLM judge is an instrument: the model is the sensor, the rubric is the calibration. A precise calibration on a cheap sensor can read worse than no calibration at all. Specify both, and always check against human labels — because intuition (mine included) gets this wrong.

That wraps the series

Three episodes, one thread: a metric is only as honest as the conditions you measured it under.

Ep 1 — accuracy hid the classes a model silently abandoned.
Ep 2 — an LLM judge's confident score hid that it disagreed with humans.
Ep 3 — a "better" rubric helped a strong model and hurt a weak one; the headline hid that.

Evaluation isn't a box you tick once and quote forever — it's an instrument you specify, calibrate, and keep checking against ground truth, because the convenient number will always flatter you. Thanks for following along.

📓 Full runnable notebook on Kaggle: [https://www.kaggle.com/code/sumannath88/ep03-better-judge-model-and-rubric]

Built with Hugging Face Transformers (small judge, local) + OpenRouter (big judges: deepseek-v4-pro and qwen3-32b). Data: LMSYS Chatbot Arena. Questions or corrections welcome in the comments.

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

Suman Nath — Mon, 29 Jun 2026 08:05:50 +0000

In Part 1 the model's job was to pick one of 77 labels, so I could check it with ==. But most real LLM output isn't like that — it's a paragraph, a summary, a support reply. There's no label to compare against.

So people reach for the obvious move: use an LLM to grade the LLM. Show it a question and an answer, ask "how good is this, 1–10?", trust the number. It works shockingly well... right up until it doesn't, in ways that don't show up unless you go looking.

I built that judge from scratch and checked it against a dataset that comes with real human votes: the LMSYS Chatbot Arena conversations (via the ungated mirror agie-ai/lmsys-chatbot_arena_conversations, so this runs cold on Kaggle). Each row is a real user prompt, two chatbot answers, and a human verdict for which was better.

The judge is one prompt and a regex

JUDGE_RUBRIC = (
    'You are grading the quality of an answer to a question. '
    'Score from 1 (terrible) to 10 (excellent) based on correctness and helpfulness. '
    'Respond in EXACTLY this format:\nSCORE: <number>\nREASON: <one short sentence>'
)

def judge(question, answer, temperature=0.0):
    prompt = f'{JUDGE_RUBRIC}\n\nQUESTION:\n{question}\n\nANSWER:\n{answer}\n\nYour grade:'
    reply = generate(prompt, max_new_tokens=64, temperature=temperature)
    m = re.search(r'SCORE:\s*([0-9]+(?:\.[0-9]+)?)', reply)
    return (float(m.group(1)) if m else float('nan')), reply

That's it — Qwen2.5-1.5B-Instruct reading one answer and emitting a number. The rest of the notebook is about not trusting it blindly. Note the rubric is deliberately naive ("correctness and helpfulness, 1–10") — it's the lazy version most people actually write, which is the point.

Failure #1 — It barely used the scale

I had the judge score one unchanged answer eight times at a realistic temperature:

scores = [judge(sample_q, sample_a, temperature=0.7)[0] for _ in range(8)]
# [8.0, 7.0, 8.0, 7.0, 8.0, 7.0, 8.0, 8.0]
# range: 7-8 | stdev: 0.48

Two problems. First, the score isn't stable — same answer, different numbers. Second, and worse: a "1–10" judge that only ever emits 7 or 8 isn't really using a 10-point scale. It has almost no resolution to separate "good" from "great." So when you A/B-test two prompts and one scores 7.6 vs 7.9, that gap is noise dressed up as a decimal.

Failure #2 — It didn't agree with humans

For each pair, I scored answer A and answer B independently (the judge never sees both at once — this avoids position bias entirely), took the higher score as the judge's pick, and compared to the human winner:

for p in pairs[:60]:
    s_a, _ = judge(p['question'], p['ans_a'])
    s_b, _ = judge(p['question'], p['ans_b'])
    judge_pick = 'tie' if s_a == s_b else ('model_a' if s_a > s_b else 'model_b')
    ...
# Pairs scored: 60 (judge gave equal scores on 20 of them)
# On the 40 it scored decisively, it agreed with the HUMAN winner: 26/40 = 65%

Read those two numbers together:

20 of 60 were ties — on a third of the pairs, the judge gave both answers the same score even though a human saw a clear winner. (Remember that 7–8 band? When everything scores 7 or 8, lots of things tie.) It was blind to a difference real people could see.
65% agreement on decisive calls — better than a coin flip, but it disagreed with humans on more than 1 in 3 of its confident calls.

Count the ties as misses and the judge lined up with human judgment on just 26/60 = 43% of all pairs.

The receipt that stung

The disagreement cases tell you why it fails. My favorite:

Q : When is it today?
  judge scores -> answer_a: 3, answer_b: 10  => judge picked model_b
  but HUMANS preferred: model_a

The model has no idea what day it is, so a confident date is the wrong answer. The human caught that. The judge gave the confident-wrong answer a 10 and the honest hedge a 3. It wasn't grading correctness — it was grading confidence. (Other receipts showed the same thing: 1-vs-2 and 8-vs-7 "decisions" that were really just noise around a tie.)

The point

This notebook scored nothing new about a model. It audited the judge — the thing handing out the scores — and found two failures hiding behind a clean-looking number: it disagreed with itself run to run, and it agreed with people only 43% of the time.

The fix isn't "don't use judges." It's evaluate your evaluator: grade with repeats and report the spread, not a single number, and calibrate against human labels before you trust the judge on data nobody has labeled.

What's next

Part 3: two obvious ways to fix a bad judge — a bigger model and a better rubric. I run all four combinations against the same human votes and measure how much each lever actually moves the needle. (The cheaper fix does more of the work than you'd expect.)

📓 Full runnable notebook on Kaggle: [https://www.kaggle.com/code/sumannath88/ep02-llm-as-a-judge]

Built with PyTorch + Hugging Face Transformers. Data: LMSYS Chatbot Arena (ungated mirror). Questions or corrections welcome in the comments.

Breaking down the accuracy number: Building an LLM Eval Harness From Scratch

Suman Nath — Fri, 26 Jun 2026 06:32:00 +0000

In my last series I fine-tuned models and kept quoting one proud number: ~96% accuracy. This series is about the thing I didn't do carefully enough back then — actually checking what that number meant.

Here's the trap. Accuracy is a single number trying to summarize a task with many possible answers. It blends the cases the model nails together with the ones it quietly fails, and hands you back one confident percentage. So I built a small eval harness from scratch — no evaluate, no lm-eval-harness — and ran it on the base Qwen2.5-1.5B-Instruct (no fine-tuning, so anyone can run it cold on a Kaggle T4).

The point isn't "frameworks are bad." It's that once you write the loop yourself, you understand exactly what those frameworks are doing for you — and why a single metric can hide a broken model.

The task and the loop

I reused the Banking77 intent-classification dataset from my fine-tuning series (77 customer-support intents) and the same parse_prediction() helper, so the parsing here is identical to what I trained/served with. Evaluating with a different parser than you served with is a classic way to produce numbers that don't mean anything.

N_EVAL = 400
LABELS_BLOCK = ', '.join(label_names)   # give the base model the label space

def predict_one(query: str) -> str:
    prompt = build_chat_prompt(tokenizer, query)
    prompt = prompt.replace('and nothing else.',
                            f'and nothing else. Valid intents: {LABELS_BLOCK}.')
    inputs = tokenizer(prompt, return_tensors='pt').to(DEVICE)
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=16,
                             do_sample=False, pad_token_id=tokenizer.eos_token_id)
    gen = tokenizer.decode(out[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return parse_prediction(gen, label_names)

Greedy decoding (do_sample=False) keeps the eval deterministic. Now the fun part: score the same predictions five different ways.

Metric #1 — Accuracy (the number everyone quotes)

from sklearn.metrics import accuracy_score
acc = accuracy_score(y_true, y_pred)
# Accuracy: 50.0%
# Unparseable / no-match predictions: 0 (0.0%)

50%. Mediocre, but "functional" — the kind of number you'd note and move on from. And notably, 0% unparseable: every prediction was a clean, valid intent label. By every surface check, the model looked fine.

Metric #2 — Per-class precision / recall / F1

Now the same predictions, broken out by intent:

from sklearn.metrics import classification_report
report = classification_report(y_true, y_pred, labels=label_names,
                               output_dict=True, zero_division=0)

The ten worst intents had F1 = 0.0 and support = 0.0 — meaning the model predicted them literally zero times. It wasn't getting them wrong. It was pretending they didn't exist.

Metric #3 — Macro vs. micro F1

This is where the headline falls apart:

from sklearn.metrics import f1_score
micro = f1_score(y_true, y_pred, labels=label_names, average='micro', zero_division=0)  # 50.0%
macro = f1_score(y_true, y_pred, labels=label_names, average='macro', zero_division=0)  # 7.5%
# Gap: 42.5%

Micro F1: 50%. Macro F1: 7.5%. That 42.5-point gap is the story. Micro lets the common classes dominate; macro weights every intent equally, so all the abandoned classes drag it to the floor. When micro ≫ macro, your model is carrying a few common cases and ignoring the rest.

Metric #4 — The confusion matrix

Accuracy says how often. The confusion matrix says what gets mistaken for what — and it's the only view that shows the model collapsing 77 intents into a handful of favorites:

from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(y_true, y_pred, labels=label_names)
sns.heatmap(cm, cmap='magma', square=True)

The matrix had almost no diagonal, bright vertical streaks (the model's favorite default labels, absorbing many true intents), and a near-empty lower half (dozens of intents never predicted at all). I also ranked the off-diagonal cells to name the worst confusions in plain English — the actually-actionable output.

The point

I scored the exact same predictions five ways and got five different impressions:

Accuracy said 50% — "functional."
Macro F1 said 7.5% — the model abandoned most classes.
Per-class F1 / never-predicted list named which classes, with zero recall.
The confusion matrix showed what it collapses into what.
0% unparseable meant none of this showed up in any surface check.

A metric doesn't just measure your model. It decides what you're allowed to notice. Pick the wrong one and you'll ship blind spots you never knew were there — not from carelessness, but because your one number was never capable of showing them to you.

What's next

Part 2: when there's no label to compare against — paragraphs, summaries, support replies — people reach for an LLM to grade the LLM. I build that judge from scratch and check whether it agrees with actual humans. (Spoiler: not as often as you'd hope.)

📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/ep01-eval-harness-from-scratch

Built with PyTorch + Hugging Face Transformers + scikit-learn. Questions or corrections welcome in the comments.

If a 270M Model Already Worked, Why Did I Fine-Tune a 7B One?

Suman Nath — Sun, 21 Jun 2026 12:23:55 +0000

Over three posts I built three fine-tuned models for the same banking-intent task — full fine-tuning a 270M model, LoRA on 1.5B, QLoRA on 7B. They all landed around the same accuracy.

Which raises an honest, slightly uncomfortable question: if a 270M model on my laptop already worked, why reach for a 7B model at all?

The answer most "bigger is better" content skips

For this task — you wouldn't. A good engineer picks the smallest model that clears the bar, not the biggest one available. The small model is cheaper to serve, runs in milliseconds, and you fully own it. Choosing the 7B here would be over-engineering.

Reaching for a bigger model isn't a flex. It's a response to a requirement the small one can't meet. Here are the four cases where small stops being enough:

1. The task is genuinely hard

Banking77 is easy — 77 fixed labels, short clean queries. Small models saturate it. But ask for reasoning ("which of these three issues is the primary one?"), open-ended generation (write the reply, don't just classify), or real nuance, and there's a capability floor that more parameters buy. No amount of fine-tuning gives a 270M model abilities it doesn't have.

2. You have little data

I had ~10,000 labeled examples — plenty for a small model. With 50, a small model can't learn the task, but a 7B model already "knows" banking concepts from pretraining and only needs a nudge. Bigger models need less task data because they bring more prior knowledge.

3. You need one model for many tasks

This is the quiet superpower of LoRA/QLoRA. A single frozen 7B base can host dozens of swappable adapters — intent classifier, reply writer, summarizer, sentiment — all from one ~5GB footprint in memory. The 270M is single-purpose. This is why companies serve hundreds of fine-tunes from one base model.

4. Accuracy compounds at scale

93% means 7 in 100 queries misrouted. At 10M queries/month, that's 700,000 mistakes. If each costs a support escalation, the 2–3 points a bigger model buys can pay for itself many times over. At small scale, nobody notices. At large scale, it's the whole budget.

So why did I build all three?

Not to beat the small model — they tied. And that tie is the lesson: on an easy task, the technique barely matters.

I built them to learn the techniques, so that when I hit a task where small isn't enough, I can fine-tune a 7B model on a 16GB card without flinching. It's like learning to change a tire in your driveway on a sunny day. The driveway didn't need it — but now I can do it on the highway, in the rain.

The bug no model size could fix

One thread ran through all three projects. Every model — 270M, 1.5B, 7B — confused the same two intents: card_arrival and card_delivery_estimate. Three model scales, three techniques, the same mistake in every confusion matrix.

That's not a capacity problem you can buy your way out of. "Where's my card?" and "when will my card arrive?" genuinely overlap — the ambiguity is in the labels themselves, not the model. Three models agreeing on a "mistake" is a strong signal the data, not the model, is the limit.

Sometimes the answer isn't a bigger model. It's better data.

That might be the most useful thing I learned across the whole series.

The takeaway, as a decision

Is the small model good enough for the actual requirement?
   YES → ship it. Bigger is wasted cost, latency, and complexity.
   NO  → why not?
          capability ceiling? → bigger base model
          too little data?    → bigger base + LoRA (needs less data)
          many tasks at once? → one big base + swappable adapters

Match the model to the requirement. That instinct is worth more than any of the fine-tuning mechanics in Parts 1–3.

📓 All three notebooks on Kaggle: https://www.kaggle.com/work/collections/18659493

Thanks for reading the series. If it was useful, a reaction or a comment helps it reach the next person debugging their first OOM.

QLoRA: Fine-Tuning a 7B Model on a 16GB GPU (It Shrank to 5.4GB in Front of Me)

Suman Nath — Sun, 21 Jun 2026 12:20:53 +0000

In Part 2, LoRA let me fine-tune a 1.5B model by freezing it and training tiny adapters. But the frozen base still sat in memory in 16-bit (~3GB). Now I wanted to go to Qwen2.5-7B — and hit a wall that LoRA alone doesn't solve.

The problem

A 7B model is ~15GB in 16-bit precision. A free-tier T4 GPU has 16GB. It would barely load, with no room left to actually train.

The QLoRA insight

QLoRA asks the question that naturally follows from LoRA: the base is frozen and only ever read — so why store it in full precision?

So you quantize the frozen base to 4-bit (NF4, a format tuned for how neural-net weights are distributed) and run the LoRA adapters on top in normal precision. The base shrinks dramatically; the trainable part stays small and precise.

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",             # NormalFloat4
    bnb_4bit_use_double_quant=True,        # quantize the quant constants too
    bnb_4bit_compute_dtype=torch.float16,  # dequantize to fp16 for the matmuls
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, quantization_config=bnb_config, device_map="auto")

Each flag earns its place:

load_in_4bit — store frozen weights in 4 bits instead of 16.
nf4 — a 4-bit type matched to the bell-curve distribution of neural-net weights (better than plain int4).
double_quant — quantize the quantization constants too, for a bit more savings.
compute_dtype — dequantize to fp16 for the actual matmuls, so storage is 4-bit but compute stays precise.

The moment it clicked

One line of output:

loaded in 4-bit. footprint: 5.44 GB

I downloaded 15.2GB of weights and they sat in memory as 5.44GB. A model that couldn't be loaded for full fine-tuning was now training on a single consumer GPU — with room to spare. (The download is still 15GB; bitsandbytes quantizes on the fly during load.)

The QLoRA-standard recipe

Two more pieces beyond Part 2's LoRA setup: prepare the quantized model for training, and target all linear layers (the QLoRA paper found this matters), with a paged 8-bit optimizer:

from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
# ... attach LoRA to every linear layer ...
TrainingArguments(optim="paged_adamw_8bit", gradient_checkpointing=True, ...)

It's slow — and that's fine

A 7B forward pass through 4-bit weights with gradient checkpointing is heavy: ~1 hour for one epoch on a T4, ~3 examples/second. But QLoRA isn't about speed — it's about fit. The model runs at all, on hardware that couldn't otherwise hold it. That's the entire point.

⚠️ Hardware note: bitsandbytes 4-bit is CUDA-first. It does not run on Apple MPS, and AMD/ROCm support exists but is less mature. Run this one on an NVIDIA GPU (Kaggle/Colab T4 works).

Result

QLoRA accuracy: 92.848% (4-bit base was 16.000%)
macro-F1: 0.928

It roughly tied the smaller models from Parts 1 and 2.

And the card_arrival vs card_delivery_estimate confusion that haunted both smaller models? [Say what happened at 7B — did it finally fix it, or hit the same wall?] Either way, it sets up the question I tackle in Part 4: if the 270M model already worked, why did I build any of this?

📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/03-qlora-qwen2-5-7b

Built with PyTorch + Transformers + PEFT + bitsandbytes. Questions or corrections welcome in the comments.

LoRA: I Trained <1% of a 1.5B Model and Matched a Full Fine-Tune

Suman Nath — Sun, 21 Jun 2026 12:17:16 +0000

In Part 1 I fully fine-tuned a 270M model — updating every weight. That's fine for a tiny model. It gets painful as models grow, because full fine-tuning needs gradients and optimizer state for every parameter (~4× the model size in memory).

So: what do you do when the model is too big to comfortably fine-tune all of?

The idea behind LoRA

LoRA (Low-Rank Adaptation) rests on one observation: the change fine-tuning makes to a weight matrix is "low rank" — it lives in a small subspace. You don't need to learn the full update ΔW; you can learn it as the product of two skinny matrices, B·A:

output = W·x  +  (B·A)·x
         ↑frozen    ↑trainable (tiny)

For a single 1536×1536 layer at rank 16, that's about 49,000 trainable numbers instead of ~2.4 million. And you freeze the entire base model — only the adapters train. B starts at zero, so at step 0 the model behaves exactly like the original and training nudges it from there.

The config

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                 # rank — adapter capacity
    lora_alpha=32,        # scaling; effective scale = alpha / r
    lora_dropout=0.05,
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# -> trainable params are ~1% of the model. The other 99% is frozen.

I ran this on Qwen2.5-1.5B-Instruct — 5× bigger than the Gemma model from Part 1. Same Banking77 task. Then the GPU fought back.

Wall #1: `ValueError: Attempting to unscale FP16 gradients`

I'd loaded the model in fp16 to save memory. Wrong move: the optimizer needs fp32 master weights; mixed precision is applied at train time by the trainer, not baked into the loaded weights.

# load weights in fp32; let the Trainer's AMP do fp16 during training
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.float32)
# and set fp16=True in TrainingArguments (on CUDA) for the mixed-precision part

Wall #2: CUDA out of memory at batch size 64

Adapter training still holds activations and optimizer state. Fix: smaller batch + gradient accumulation (keeps the effective batch) + gradient checkpointing (recompute activations in the backward pass):

per_device_train_batch_size=16,
gradient_accumulation_steps=2,     # effective batch 32, lower peak memory
gradient_checkpointing=True,       # ~30% more compute, big memory savings

Wall #3: my laptop and a cloud GPU showed the same speed

This one was sneaky. My Mac (MPS) and a Kaggle T4 reported nearly identical it/s. How is a datacenter GPU no faster than a laptop?

It wasn't. The Kaggle session had 2 GPUs running data-parallel — each step processed 2× the data, so the total step count halved (626 vs 1250) while it/s stayed flat. The fix isn't code, it's how you read the number: compare examples/second, never iterations/second. Once I did, the GPU was clearly ~3× faster.

Result

~96% accuracy again — a frozen 1.5B model + a few-MB adapter matched the fully-fine-tuned 270M model from Part 1, with a saved artifact roughly 1000× smaller.

And that card_arrival vs card_delivery_estimate confusion from Part 1? Still there. Bigger model, different technique, identical mistake. (We resolve that mystery in Part 4.)

What's next

Part 3: I fit a 7-billion-parameter model onto a 16GB GPU that can't even load it normally. That's QLoRA.

📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/02-lora-qwen2-5-1-5b

Built with PyTorch + Hugging Face Transformers + PEFT. Questions or corrections welcome in the comments.

I Fine-Tuned a 270M Model on My Laptop (Full Fine-Tuning, From Scratch)

Suman Nath — Sun, 21 Jun 2026 12:08:45 +0000

I wanted to actually understand fine-tuning — not run a tutorial and nod along. So I gave myself a constraint: same task, three techniques, smallest model to largest. Full fine-tuning, then LoRA, then QLoRA. Hold the task fixed and the only variable is the method.

This first post is full fine-tuning — the most powerful and most expensive option: update every weight in the model.

The task

Banking77: ~13,000 real bank customer-support messages, 77 intents like card_arrival, lost_or_stolen_card, exchange_rate. The model reads a message and names the intent.

The model: deliberately tiny

I picked Gemma 3, 270M parameters — small enough to fully fine-tune on a laptop (Apple Silicon / MPS). That's intentional: full fine-tuning stores gradients and optimizer state for every parameter, roughly 4× the model's size in memory. I wanted to feel that, not read about it.

One design decision: generate the label, don't classify it

The obvious approach is to bolt a 77-way classification head onto the model. I didn't. Instead I had the model generate the intent as text — literally output card_arrival. Why? Because that's the same shape as instruction-tuning, so the later LoRA/QLoRA projects build naturally on this one.

The key detail is masking the loss so the model is graded only on the label tokens, not the prompt:

# build "prompt + label", but set prompt tokens to -100 so the loss ignores them
prompt_ids = tokenizer(prompt, add_special_tokens=False)["input_ids"]
target_ids = tokenizer(" " + label_name + tokenizer.eos_token,
                       add_special_tokens=False)["input_ids"]
input_ids = prompt_ids + target_ids
labels    = [-100] * len(prompt_ids) + target_ids   # only the label is graded

If you skip that masking, the model spends its capacity learning to reproduce the prompt instead of the answer.

The thing that surprised me: full FT is fragile

Because you're updating all the pretrained weights, a too-high learning rate shreds the model's existing knowledge. I used 5e-5 and it trained cleanly. Bumping to 2e-4 destabilized it. The training config is otherwise unremarkable — and that's the point:

TrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=5e-5,            # small, on purpose
    lr_scheduler_type="cosine",
    bf16=False, fp16=False,        # fp32 on MPS for stability
)

(The later projects freeze the base, which is exactly why they can tolerate a much higher learning rate — there's no fragile pretrained knowledge to wreck.)

Result

~96% on the common intents. A near-perfect diagonal confusion matrix. A 270M model, fully fine-tuned on a laptop, nailing the task.

The one persistent slip: it confused card_arrival with card_delivery_estimate. Keep that in mind — it shows up in every project in this series, and the reason why is the punchline of Part 4.

What's next

In Part 2, I take a model 5× bigger and train less than 1% of it — and get the same accuracy. That's LoRA.

📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/01-full-finetune-gemma270m

Built with PyTorch + Hugging Face Transformers. Questions or corrections welcome in the comments.

DEV Community: Suman Nath

Anatomy of a Full RAG Application: Every Concept, One Self-Hosted Stack

The architecture at a glance

1. Parsing: retrieval quality is capped by parsing quality

2. Chunking: the most underrated decision in RAG

3. Index twice: dense for meaning, sparse for precision

4. Hybrid search: Reciprocal Rank Fusion

5. Reranking: cheap recall, then expensive precision

6. The knowledge graph: RAG that connects the dots

7. Memory + token budgeting: context is a budget

8. Parallelism: same models, several times faster

9. The unglamorous parts that make it "production"

The takeaway

Practical RAG, Part 1: The Simplest RAG That Actually Works

What RAG actually is

Setup

1 & 2. Load and chunk

3. Embed

4. Retrieve

5. Generate

It works — and that's the trap

The weaknesses hiding behind the demo — and the roadmap

A Better LLM Judge? The Rubric Made My Small Model Worse

The big judge runs on an API (and why)

The two rubrics — the actual variable

The 2x2 (run twice, two different big judges)

The point

That wraps the series

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

The judge is one prompt and a regex

Failure #1 — It barely used the scale

Failure #2 — It didn't agree with humans

The receipt that stung

The point

What's next

Breaking down the accuracy number: Building an LLM Eval Harness From Scratch

The task and the loop

Metric #1 — Accuracy (the number everyone quotes)

Metric #2 — Per-class precision / recall / F1

Metric #3 — Macro vs. micro F1

Metric #4 — The confusion matrix

The point

What's next

If a 270M Model Already Worked, Why Did I Fine-Tune a 7B One?

The answer most "bigger is better" content skips

1. The task is genuinely hard

2. You have little data

3. You need one model for many tasks

4. Accuracy compounds at scale

So why did I build all three?

The bug no model size could fix

The takeaway, as a decision

QLoRA: Fine-Tuning a 7B Model on a 16GB GPU (It Shrank to 5.4GB in Front of Me)

The problem

The QLoRA insight

The moment it clicked

The QLoRA-standard recipe

It's slow — and that's fine

Result

LoRA: I Trained <1% of a 1.5B Model and Matched a Full Fine-Tune

The idea behind LoRA

The config

Wall #1: ValueError: Attempting to unscale FP16 gradients

Wall #2: CUDA out of memory at batch size 64

Wall #3: my laptop and a cloud GPU showed the same speed

Result

What's next

I Fine-Tuned a 270M Model on My Laptop (Full Fine-Tuning, From Scratch)

The task

The model: deliberately tiny

One design decision: generate the label, don't classify it

The thing that surprised me: full FT is fragile

Result

What's next

Wall #1: `ValueError: Attempting to unscale FP16 gradients`