You wired up an LLM, pointed it at a user question about your own product, and it confidently invented an API endpoint that doesn't exist. Welcome to the moment every engineer hits: the base model is smart but knows nothing about your domain.
There are two mainstream fixes - RAG and fine-tuning - and the internet is full of hand-wavy "it depends" answers. This post is the opposite: when retrieval beats retraining, when it doesn't, what each actually looks like in code, and the gotchas I've hit shipping both.
The one-line mental model
RAG = the model looks things up at inference time. Knowledge lives outside the weights.
Fine-tuning = you bake new behavior into the weights via training. Knowledge/behavior lives inside the model.
- The single most useful question to disambiguate them:
Is my problem that the model doesn't know my facts, or that it doesn't behave the way I want?
Knowledge gap → RAG. Behavior gap → fine-tuning. Most "we need fine-tuning" requests are actually knowledge gaps in disguise.
RAG in code (runnable)
The whole RAG loop is: embed your docs → store the vectors → at query time, embed the question, find the nearest chunks (semantic search), stuff them into the prompt.
This example runs as-is. Embeddings use sentence-transformers (local, no API key); generation uses Claude. Swap the embedding model for a hosted one (OpenAI, Voyage, Cohere) by replacing the two model.encode(...) calls.
pip install sentence-transformers numpy anthropic
export ANTHROPIC_API_KEY=sk-... # for the generation step
import numpy as np
from sentence_transformers import SentenceTransformer
from anthropic import Anthropic
embedder = SentenceTransformer("all-MiniLM-L6-v2") # small, fast, local
client = Anthropic() # reads ANTHROPIC_API_KEY from env
# 1. Offline: chunk + embed your knowledge base
docs = [
"Refunds are processed within 5 business days.",
"Enterprise plans include SSO and a 99.9% SLA.",
"The API rate limit is 100 requests per minute on the Pro tier.",
]
doc_vecs = embedder.encode(docs, normalize_embeddings=True) # (n_docs, dim)
def retrieve(query, k=2):
q = embedder.encode([query], normalize_embeddings=True)[0]
sims = doc_vecs @ q # cosine sim (vectors are normalized)
top = sims.argsort()[-k:][::-1]
return [docs[i] for i in top]
# 2. Online: retrieve, then generate grounded in retrieved context
def answer(query):
context = "\n".join(retrieve(query))
msg = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{
"role": "user",
"content": (
"Answer using ONLY the context below. "
"If it's not in the context, say you don't know.\n\n"
f"Context:\n{context}\n\nQuestion: {query}"
),
}],
)
return msg.content[0].text
print(answer("What's the rate limit on Pro?"))
# -> The API rate limit is 100 requests per minute on the Pro tier.
In production you'd swap the in-memory NumPy search for a real vector database (pgvector, Qdrant, Weaviate, Pinecone) so search stays fast past a few thousand chunks - but the logic above is RAG. Nothing more mystical than "nearest-neighbor search + prompt assembly."
Why engineers reach for it:
- Update knowledge by changing a document - no retraining.
- Answers are traceable; you know which chunk produced them.
- Sensitive data stays in your store, not in model weights.
Fine-tuning in code
Fine-tuning doesn't retrieve anything. You show the model hundreds+ of input→output examples until the desired behavior is baked in. The data format is the real work; the training call is almost an afterthought.
A typical supervised fine-tuning dataset is JSONL, one example per line:
{"messages": [{"role": "system", "content": "Classify the support ticket into: billing, bug, feature_request, other."}, {"role": "user", "content": "I was charged twice this month."}, {"role": "assistant", "content": "billing"}]}
{"messages": [{"role": "system", "content": "Classify the support ticket into: billing, bug, feature_request, other."}, {"role": "user", "content": "The export button does nothing on Safari."}, {"role": "assistant", "content": "bug"}]}
{"messages": [{"role": "system", "content": "Classify the support ticket into: billing, bug, feature_request, other."}, {"role": "user", "content": "Can you add dark mode?"}, {"role": "assistant", "content": "feature_request"}]}
And a representative training kickoff (provider SDKs vary, but the shape is consistent):
# Adapt to your provider's fine-tuning SDK.
job = client.fine_tuning.jobs.create(
training_file="ticket_classifier.jsonl",
model="base-model-name",
hyperparameters={"n_epochs": 3},
)
# poll job.status until "succeeded", then call the resulting model id
Why engineers reach for it:
Consistent format/tone/structure at scale without re-explaining it every prompt.
Specialized judgment that's hard to express as retrievable documents.
Shorter prompts at inference (the behavior is internalized), which can cut per-call cost at high volume.
The catch: every time the requirement changes, you curate data and retrain again. Data prep is the real cost, not the GPU time.
The decision checklist
Run down this list; stop at the first strong signal.
- Knowledge changes often? → RAG. Retraining on every doc update is masochism.
- Need source-cited / auditable answers? → RAG. Fine-tuned weights can't tell you why.
- The model keeps getting the format/tone/judgment wrong, not the facts? → Fine-tuning.
- Have a few hundred clean labeled examples? Required for fine-tuning. If not, RAG is the faster start.
- Still unsure? → Start with RAG. It's cheaper to build, easier to debug, and solves the most common failure (the model not knowing your stuff).
And the part most posts skip: they're not mutually exclusive. Mature systems fine-tune for behavior and bolt RAG on top for fresh facts. "RAG vs Fine-Tuning" is increasingly "RAG and Fine-Tuning."
Gotchas I've hit
Chunking quietly decides your accuracy. Bad chunk boundaries (splitting a table mid-row, 2000-token mega-chunks) wreck retrieval before the model ever sees the question. Tune chunk size and overlap before you blame the LLM.
RAG is not plug-and-play. "Retrieved the wrong context → confidently wrong answer" is the #1 RAG failure mode. Add an eval set early.
Fine-tuning a knowledge problem is the classic expensive mistake. If the fix is "teach it our current pricing," fine-tuning is slow, costly, and goes stale. Use RAG.
No eval = no progress. Build a small labeled test set before launch. Without it you're tuning blind and "it feels better" becomes your only metric.
Garbage in, confident garbage out. Both approaches amplify whatever you feed them. Clean the data first.
Difference Between RAG vs Fine-Tuning
RAG
- Fixes: knowledge gaps
- Update knowledge: edit a document, no retraining
- Traceable answers: yes - you know which chunk produced them
- Upfront cost: lower
- Best start for most teams: yes
Fine-Tuning
- Fixes: behavior gaps
- Update knowledge: retrain
- Traceable answers: no
- Upfront cost: higher (data prep is the real cost)
- Best start for most teams: add later, once behavior is the blocker
Conclusion
The "RAG vs Fine-Tuning" debate isn't a turf war - it's a routing decision. Point a knowledge problem at RAG and a behavior problem at fine-tuning, and most of the confusion disappears.
For the large majority of teams, the right first move is RAG: it's cheaper to build, far easier to debug, ships in days instead of weeks, and directly solves the most common failure mode - the model not knowing your stuff. Reach for fine-tuning when the model already has the facts but keeps getting the format, tone, or judgment wrong, and you've got a few hundred clean examples to teach it. When you've earned the complexity, run both: fine-tune for behavior, layer RAG on top for fresh, traceable facts.
The expensive mistake to avoid is reaching for fine-tuning to fix a knowledge gap. It's slow, it goes stale, and a retrieval layer would've done the job in a fraction of the time. Start simple, measure with a real eval set, and only add weight to the system when the evidence says you need it.
Top comments (0)