DEV Community

Cover image for RAG vs Fine-Tuning: Which One Should You Actually Choose?
EncodeDots Technolabs
EncodeDots Technolabs

Posted on

RAG vs Fine-Tuning: Which One Should You Actually Choose?

You wired up an LLM, pointed it at a user question about your own product, and it confidently invented an API endpoint that doesn't exist. Welcome to the moment every engineer hits: the base model is smart but knows nothing about your domain.

There are two mainstream fixes - RAG and fine-tuning - and the internet is full of hand-wavy "it depends" answers. This post is the opposite: when retrieval beats retraining, when it doesn't, what each actually looks like in code, and the gotchas I've hit shipping both.

The one-line mental model

RAG = the model looks things up at inference time. Knowledge lives outside the weights.

Fine-tuning = you bake new behavior into the weights via training. Knowledge/behavior lives inside the model.

  • The single most useful question to disambiguate them:

Is my problem that the model doesn't know my facts, or that it doesn't behave the way I want?

Knowledge gap → RAG. Behavior gap → fine-tuning. Most "we need fine-tuning" requests are actually knowledge gaps in disguise.

RAG in code (runnable)

The whole RAG loop is: embed your docs → store the vectors → at query time, embed the question, find the nearest chunks (semantic search), stuff them into the prompt.

This example runs as-is. Embeddings use sentence-transformers (local, no API key); generation uses Claude. Swap the embedding model for a hosted one (OpenAI, Voyage, Cohere) by replacing the two model.encode(...) calls.

pip install sentence-transformers numpy anthropic
export ANTHROPIC_API_KEY=sk-...   # for the generation step
Enter fullscreen mode Exit fullscreen mode
import numpy as np
from sentence_transformers import SentenceTransformer
from anthropic import Anthropic

embedder = SentenceTransformer("all-MiniLM-L6-v2")  # small, fast, local
client = Anthropic()  # reads ANTHROPIC_API_KEY from env

# 1. Offline: chunk + embed your knowledge base
docs = [
    "Refunds are processed within 5 business days.",
    "Enterprise plans include SSO and a 99.9% SLA.",
    "The API rate limit is 100 requests per minute on the Pro tier.",
]
doc_vecs = embedder.encode(docs, normalize_embeddings=True)  # (n_docs, dim)

def retrieve(query, k=2):
    q = embedder.encode([query], normalize_embeddings=True)[0]
    sims = doc_vecs @ q                       # cosine sim (vectors are normalized)
    top = sims.argsort()[-k:][::-1]
    return [docs[i] for i in top]

# 2. Online: retrieve, then generate grounded in retrieved context
def answer(query):
    context = "\n".join(retrieve(query))
    msg = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": (
                "Answer using ONLY the context below. "
                "If it's not in the context, say you don't know.\n\n"
                f"Context:\n{context}\n\nQuestion: {query}"
            ),
        }],
    )
    return msg.content[0].text

print(answer("What's the rate limit on Pro?"))
# -> The API rate limit is 100 requests per minute on the Pro tier.
Enter fullscreen mode Exit fullscreen mode

In production you'd swap the in-memory NumPy search for a real vector database (pgvector, Qdrant, Weaviate, Pinecone) so search stays fast past a few thousand chunks - but the logic above is RAG. Nothing more mystical than "nearest-neighbor search + prompt assembly."

Why engineers reach for it:

  • Update knowledge by changing a document - no retraining.
  • Answers are traceable; you know which chunk produced them.
  • Sensitive data stays in your store, not in model weights.

Fine-tuning in code

Fine-tuning doesn't retrieve anything. You show the model hundreds+ of input→output examples until the desired behavior is baked in. The data format is the real work; the training call is almost an afterthought.

A typical supervised fine-tuning dataset is JSONL, one example per line:

{"messages": [{"role": "system", "content": "Classify the support ticket into: billing, bug, feature_request, other."}, {"role": "user", "content": "I was charged twice this month."}, {"role": "assistant", "content": "billing"}]}
{"messages": [{"role": "system", "content": "Classify the support ticket into: billing, bug, feature_request, other."}, {"role": "user", "content": "The export button does nothing on Safari."}, {"role": "assistant", "content": "bug"}]}
{"messages": [{"role": "system", "content": "Classify the support ticket into: billing, bug, feature_request, other."}, {"role": "user", "content": "Can you add dark mode?"}, {"role": "assistant", "content": "feature_request"}]}
Enter fullscreen mode Exit fullscreen mode

And a representative training kickoff (provider SDKs vary, but the shape is consistent):

# Adapt to your provider's fine-tuning SDK.
job = client.fine_tuning.jobs.create(
    training_file="ticket_classifier.jsonl",
    model="base-model-name",
    hyperparameters={"n_epochs": 3},
)
# poll job.status until "succeeded", then call the resulting model id
Enter fullscreen mode Exit fullscreen mode

Why engineers reach for it:

Consistent format/tone/structure at scale without re-explaining it every prompt.

Specialized judgment that's hard to express as retrievable documents.
Shorter prompts at inference (the behavior is internalized), which can cut per-call cost at high volume.

The catch: every time the requirement changes, you curate data and retrain again. Data prep is the real cost, not the GPU time.

The decision checklist

Run down this list; stop at the first strong signal.

  1. Knowledge changes often? → RAG. Retraining on every doc update is masochism.
  2. Need source-cited / auditable answers? → RAG. Fine-tuned weights can't tell you why.
  3. The model keeps getting the format/tone/judgment wrong, not the facts? → Fine-tuning.
  4. Have a few hundred clean labeled examples? Required for fine-tuning. If not, RAG is the faster start.
  5. Still unsure? → Start with RAG. It's cheaper to build, easier to debug, and solves the most common failure (the model not knowing your stuff).

And the part most posts skip: they're not mutually exclusive. Mature systems fine-tune for behavior and bolt RAG on top for fresh facts. "RAG vs Fine-Tuning" is increasingly "RAG and Fine-Tuning."

Gotchas I've hit

  • Chunking quietly decides your accuracy. Bad chunk boundaries (splitting a table mid-row, 2000-token mega-chunks) wreck retrieval before the model ever sees the question. Tune chunk size and overlap before you blame the LLM.

  • RAG is not plug-and-play. "Retrieved the wrong context → confidently wrong answer" is the #1 RAG failure mode. Add an eval set early.

  • Fine-tuning a knowledge problem is the classic expensive mistake. If the fix is "teach it our current pricing," fine-tuning is slow, costly, and goes stale. Use RAG.

  • No eval = no progress. Build a small labeled test set before launch. Without it you're tuning blind and "it feels better" becomes your only metric.

  • Garbage in, confident garbage out. Both approaches amplify whatever you feed them. Clean the data first.

Difference Between RAG vs Fine-Tuning

RAG

  • Fixes: knowledge gaps
  • Update knowledge: edit a document, no retraining
  • Traceable answers: yes - you know which chunk produced them
  • Upfront cost: lower
  • Best start for most teams: yes

Fine-Tuning

  • Fixes: behavior gaps
  • Update knowledge: retrain
  • Traceable answers: no
  • Upfront cost: higher (data prep is the real cost)
  • Best start for most teams: add later, once behavior is the blocker

Conclusion

The "RAG vs Fine-Tuning" debate isn't a turf war - it's a routing decision. Point a knowledge problem at RAG and a behavior problem at fine-tuning, and most of the confusion disappears.

For the large majority of teams, the right first move is RAG: it's cheaper to build, far easier to debug, ships in days instead of weeks, and directly solves the most common failure mode - the model not knowing your stuff. Reach for fine-tuning when the model already has the facts but keeps getting the format, tone, or judgment wrong, and you've got a few hundred clean examples to teach it. When you've earned the complexity, run both: fine-tune for behavior, layer RAG on top for fresh, traceable facts.

The expensive mistake to avoid is reaching for fine-tuning to fix a knowledge gap. It's slow, it goes stale, and a retrieval layer would've done the job in a fraction of the time. Start simple, measure with a real eval set, and only add weight to the system when the evidence says you need it.

Top comments (0)