EncodeDots Technolabs

Posted on Jun 29 • Edited on Jul 1

RAG vs Fine-Tuning: Which One Should You Actually Choose?

#ai #rag #finetuning #llm

You wired up an LLM, pointed it at a user question about your product, and it confidently invented an API endpoint that doesn't exist. Welcome to the moment every AI engineer eventually hits: the base model is smart, but it doesn't know your company's data, documentation, or latest product changes.

There are two mainstream ways to fix that-RAG and fine-tuning-but most explanations stop at "it depends." This article goes further, breaking down when retrieval beats retraining, when fine-tuning is the better choice, and how to choose the right approach for real-world AI applications.

The one-line mental model

RAG = the model looks things up at inference time. Behavior-and sometimes domain-specific patterns-are learned during training.

Fine-tuning = you bake new behavior into the weights via training. Knowledge/behavior lives inside the model.

The single most useful question to disambiguate them:

Is my problem that the model doesn't know my facts, or that it doesn't behave the way I want?

Knowledge gap → RAG. Behavior gap → fine-tuning. Most "we need fine-tuning" requests turn out to be knowledge gaps that RAG can solve more efficiently.

RAG in code (runnable)

The whole RAG loop is: embed your docs → store the vectors → at query time, embed the question, find the nearest chunks (semantic search), stuff them into the prompt.

This example runs as-is. Embeddings use sentence-transformers (local, no API key); generation uses Claude. Swap the local embedding model for a hosted one (OpenAI, Voyage, or Cohere) by replacing the embedder. encode(...) calls.

pip install sentence-transformers numpy anthropic
export ANTHROPIC_API_KEY=sk-...   # for the generation step

import numpy as np
from sentence_transformers import SentenceTransformer
from anthropic import Anthropic

embedder = SentenceTransformer("all-MiniLM-L6-v2")  # small, fast, local
client = Anthropic()  # reads ANTHROPIC_API_KEY from env

# 1. Offline: chunk + embed your knowledge base
docs = [
    "Refunds are processed within 5 business days.",
    "Enterprise plans include SSO and a 99.9% SLA.",
    "The API rate limit is 100 requests per minute on the Pro tier.",
]
doc_vecs = embedder.encode(docs, normalize_embeddings=True)  # (n_docs, dim)

def retrieve(query, k=2):
    q = embedder.encode([query], normalize_embeddings=True)[0]
    sims = doc_vecs @ q                       # cosine sim (vectors are normalized)
    top = sims.argsort()[-k:][::-1]
    return [docs[i] for i in top]

# 2. Online: retrieve, then generate grounded in retrieved context
def answer(query):
    context = "\n".join(retrieve(query))
    msg = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": (
                "Answer using ONLY the context below. "
                "If it's not in the context, say you don't know.\n\n"
                f"Context:\n{context}\n\nQuestion: {query}"
            ),
        }],
    )
    return msg.content[0].text

print(answer("What's the rate limit on Pro?"))
# -> The API rate limit is 100 requests per minute on the Pro tier.

In production, you'd replace the in-memory NumPy search with a vector database such as pgvector, Qdrant, Weaviate, or Pinecone to keep retrieval fast as your knowledge base grows. The retrieval logic stays the same-you've simply replaced a linear search with an index built for scale.

Why engineers reach for RAG:

Update knowledge by changing a document - no retraining.
Answers are traceable; you know which chunk produced them.
Sensitive data stays in your store, not in model weights.

Fine-tuning in code

Fine-tuning doesn't retrieve anything. You train the model on hundreds or thousands of input→output examples until the desired behavior becomes consistent.

Most providers expect a JSONL file, where each line contains a complete training conversation.

{"messages": [{"role": "system", "content": "Classify the support ticket into: billing, bug, feature_request, other."}, {"role": "user", "content": "I was charged twice this month."}, {"role": "assistant", "content": "billing"}]}
{"messages": [{"role": "system", "content": "Classify the support ticket into: billing, bug, feature_request, other."}, {"role": "user", "content": "The export button does nothing on Safari."}, {"role": "assistant", "content": "bug"}]}
{"messages": [{"role": "system", "content": "Classify the support ticket into: billing, bug, feature_request, other."}, {"role": "user", "content": "Can you add dark mode?"}, {"role": "assistant", "content": "feature_request"}]}

Once the dataset is ready, you upload it and start a fine-tuning job. The exact SDK differs between providers, but the workflow looks like this:

# Adapt to your provider's fine-tuning SDK.
job = client.fine_tuning.jobs.create(
    training_file="ticket_classifier.jsonl",
    model="base-model-name",
    hyperparameters={"n_epochs": 3},
)
# poll job.status until "succeeded", then call the resulting model id

Notice that the model never retrieves external documents at inference time. Everything it learned comes from the training examples you provided.

Why engineers reach for it:

Consistent formatting, tone, and behavior at scale without repeating detailed instructions in every prompt.
Specialized decision-making that's difficult to capture through retrieved documents alone.
Shorter prompts at inference because the desired behavior is learned during training, which can reduce token costs at high request volumes.

The catch: every time your requirements change, you need to update the training data and run another fine-tuning job. Data preparation is usually the biggest investment-not the training itself.

The decision checklist

Run down this list; stop at the first strong signal.

Knowledge changes often? → RAG. Retraining on every document update is masochism.
Need source-cited / auditable answers? → RAG. Fine-tuned weights can't tell you where an answer came from.
The model keeps getting the format, tone, or judgment wrong-not the facts? → Fine-tuning.
Have a few hundred clean, labeled examples? Fine-Tuning becomes a realistic option. If not, RAG is usually the faster place to start.
Still unsure? → Start with RAG. It's cheaper to build, easier to debug, and solves the most common problem: the model doesn't have access to your knowledge.

And here's the part most posts skip: they're not mutually exclusive. Mature AI systems often fine-tune for behavior and layer RAG on top for fresh knowledge. "RAG vs. Fine-Tuning" is increasingly becoming "RAG + Fine-Tuning."

Gotchas I've hit

Chunking quietly decides your accuracy. Bad chunk boundaries-like splitting tables mid-row or creating 2,000-token mega-chunks-hurt retrieval before the model ever sees the question.
RAG is not plug-and-play. Retrieve the wrong context, and the model will confidently produce the wrong answer.
Fine-tuning a knowledge problem is the classic expensive mistake. If the goal is simply to teach the model your latest pricing," fine-tuning is slow, costly, and goes stale. Use RAG.
No eval = no progress. Build a small labeled test set before launch. Without one, you're optimizing blind, and "it feels better" becomes your only metric.
Garbage in, confident garbage out. Both approaches amplify whatever you feed them. Clean the data first.

RAG vs. Fine-Tuning at a Glance

RAG

Best for: Knowledge gaps
knowledge Update: Edit your documents-no retraining
Knowledge freshness: Always reflects your latest data
Traceable answers: Yes, via retrieved context
Upfront cost: Lower
Best first step: For most AI applications

Fine-Tuning

Best for: Behavior gaps
knowledge Update: retrain the model
Knowledge freshness: Fixed until the next training run
Traceable answers: Not inherently
Upfront cost: Higher (data preparation is the biggest cost)
Best first step: Once behavior becomes the bottleneck

Conclusion

The "RAG vs Fine-Tuning" debate isn't a turf war - it's a routing decision. Point a knowledge problem at RAG, and a behavior problem at fine-tuning, and most of the confusion disappears.

For the vast majority of teams, the right first move is RAG: it's cheaper to build, far easier to debug, ships in days instead of weeks, and directly solves the most common failure mode - the model not knowing your stuff. Reach for fine-tuning when the model already has the facts but keeps getting the format, tone, or judgment wrong, and you've got a few hundred clean, labeled examples to teach it. When you've earned the complexity, run both: fine-tune for behavior, layer RAG on top for fresh, traceable facts.

The expensive mistake to avoid is reaching for fine-tuning to fix a knowledge gap. It's slow, it goes stale, and a retrieval layer would've done the job in a fraction of the time. Start simple, measure with a real eval set, and only add weight to the system when the evidence says you need it.

DEV Community