"Should we fine-tune or use RAG?" is one of the most common architecture questions when building LLM-powered applications. Most discussions frame it as a debate. It is better framed as a decision tree: the answer depends on what problem you are actually trying to solve.
This article gives you a concrete framework with criteria, a cost and latency comparison table, and working code for both approaches.
The core distinction
Before the framework: understand what each technique actually changes.
RAG (Retrieval-Augmented Generation) changes what the model knows at inference time by injecting retrieved context into the prompt. The base model weights are unchanged.
Fine-tuning changes how the model behaves by updating weights on domain-specific examples. The model's knowledge at inference time is still its training cutoff — retrieval is not involved.
This distinction immediately rules out common misuses:
- Fine-tuning cannot teach a model recent facts (weights are frozen post-training)
- RAG cannot teach a model a new response format or writing style (you cannot retrieve "tone")
The decision framework
Work through these questions in order:
Q1: Does your application need information that changes over time?
If yes → use RAG. Fine-tuning a new model every time your knowledge base updates is impractical and expensive. RAG lets you update the document store without touching the model.
Examples where this applies: internal wikis, product documentation, legal/regulatory references, security advisories.
Q2: Do you have a large corpus of existing domain documents?
If yes → RAG is likely sufficient. If your documents cover the domain well, retrieval will surface the right context. Fine-tuning adds cost and complexity without a clear return.
If no → consider fine-tuning on synthetic or curated examples to inject domain knowledge directly.
Q3: Is your problem primarily about output format, style, or classification?
If yes → fine-tuning wins. Style and format are behavioral properties baked into weights, not knowledge properties that can be retrieved. If you need the model to always respond in a specific JSON structure, always use a specific tone, or classify inputs into a taxonomy, fine-tuning is the right tool.
Q4: Is latency critical?
Fine-tuning produces a smaller, faster model that does not need a retrieval round-trip. RAG adds 50–200ms for vector search plus the time to process a longer context window (retrieved chunks). If p95 latency is a hard requirement, fine-tuning has a structural advantage.
Q5: Do you need both knowledge grounding AND behavioral consistency?
If yes → use both. RAG handles knowledge, fine-tuning handles behavior. This is the setup for production systems that have both a large knowledge base and strict output requirements.
Cost and latency comparison
| RAG | Fine-tuning | RAG + Fine-tuning | |
|---|---|---|---|
| Setup cost | Low–Medium | High | High |
| Per-query cost | Higher (longer context) | Lower | Medium |
| Latency overhead | +50–200ms retrieval | None | +50–200ms retrieval |
| Knowledge update | Instant (update index) | Requires retraining | Instant for knowledge |
| Format consistency | Poor without prompting | Excellent | Excellent |
| Factual grounding | Strong (with sources) | Weak (hallucination risk) | Strong |
| Best for | Knowledge-heavy Q&A | Classification, style | Enterprise assistants |
Cost numbers for a typical 500-token query:
- RAG: base model call (~500 tokens) + retrieval context (~300 tokens) = 800 token call → $0.008 at $0.01/1k
- Fine-tuned: smaller model, shorter prompt = ~400 tokens → $0.004 at $0.01/1k (plus amortized training cost)
- Training cost for a fine-tuned GPT-4o-mini on 10k examples: ~$40 one-time
Code: a simple RAG retrieval pipeline
This shows the retrieval side — how context gets injected at inference time.
import os
import json
import numpy as np
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# --- Indexing phase (run once) ---
def embed(text: str) -> np.ndarray:
resp = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return np.array(resp.data[0].embedding, dtype=np.float32)
def build_index(documents: list[dict]) -> list[dict]:
"""
documents: list of {"id": str, "content": str, "metadata": dict}
Returns documents with "embedding" field added.
"""
indexed = []
for doc in documents:
vec = embed(doc["content"])
indexed.append({**doc, "embedding": vec})
return indexed
def retrieve(query: str, index: list[dict], top_k: int = 3) -> list[dict]:
q_vec = embed(query)
scored = []
for doc in index:
doc_vec = np.array(doc["embedding"], dtype=np.float32)
score = float(np.dot(q_vec, doc_vec) / (np.linalg.norm(q_vec) * np.linalg.norm(doc_vec)))
scored.append((score, doc))
scored.sort(key=lambda x: x[0], reverse=True)
return [doc for _, doc in scored[:top_k]]
# --- Inference phase ---
def rag_answer(query: str, index: list[dict]) -> str:
docs = retrieve(query, index, top_k=3)
context = "\n\n".join(
f"[Source {i+1}] {doc['content']}" for i, doc in enumerate(docs)
)
messages = [
{
"role": "system",
"content": (
"You are a helpful assistant. Answer the user's question using ONLY "
"the provided sources. If the sources do not contain the answer, say so. "
"Cite sources by their [Source N] label."
)
},
{
"role": "user",
"content": f"Sources:\n{context}\n\nQuestion: {query}"
}
]
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
temperature=0
)
return resp.choices[0].message.content
# Usage example
documents = [
{
"id": "nis2-scope",
"content": "NIS 2 applies to medium and large entities in 18 essential and important sectors. "
"Small entities are generally excluded except in specific sectors like DNS.",
"metadata": {"source": "NIS2 Directive Article 2"}
},
{
"id": "nis2-penalties",
"content": "Under NIS 2, essential entities can face fines up to €10M or 2% of annual turnover. "
"Important entities face fines up to €7M or 1.4% of annual turnover.",
"metadata": {"source": "NIS2 Directive Article 34"}
}
]
index = build_index(documents)
answer = rag_answer("What fines apply to essential entities under NIS 2?", index)
print(answer)
Code: calling a fine-tuned model
Fine-tuned models expose the same API — the only difference is the model name. Here is the full pattern, including how you prepared your training data:
# Training data format (saved to a JSONL file for upload)
TRAINING_EXAMPLES = [
{
"messages": [
{"role": "system", "content": "Classify the following security event as: phishing, malware, brute_force, data_exfiltration, or unknown."},
{"role": "user", "content": "User received email with link to fake Microsoft login page, credentials entered."},
{"role": "assistant", "content": '{"classification": "phishing", "confidence": "high"}'}
]
},
{
"messages": [
{"role": "system", "content": "Classify the following security event as: phishing, malware, brute_force, data_exfiltration, or unknown."},
{"role": "user", "content": "SSH server shows 847 failed login attempts from single IP in 10 minutes."},
{"role": "assistant", "content": '{"classification": "brute_force", "confidence": "high"}'}
]
},
# ... add 50+ examples for a useful fine-tune
]
def save_training_data(examples: list[dict], path: str = "/tmp/training.jsonl"):
with open(path, "w") as f:
for ex in examples:
f.write(json.dumps(ex) + "\n")
return path
def upload_and_fine_tune(training_file_path: str) -> str:
"""Upload training data and start a fine-tuning job. Returns the job ID."""
with open(training_file_path, "rb") as f:
upload = client.files.create(file=f, purpose="fine-tune")
job = client.fine_tuning.jobs.create(
training_file=upload.id,
model="gpt-4o-mini"
)
print(f"Fine-tuning job started: {job.id}")
return job.id
# Once fine-tuning completes, use the model:
def classify_with_fine_tuned(event_text: str, model_id: str) -> dict:
"""
model_id looks like: ft:gpt-4o-mini:your-org:classifier:abc123
"""
resp = client.chat.completions.create(
model=model_id,
messages=[
{
"role": "system",
"content": "Classify the following security event as: phishing, malware, brute_force, data_exfiltration, or unknown."
},
{
"role": "user",
"content": event_text
}
],
temperature=0,
response_format={"type": "json_object"}
)
return json.loads(resp.choices[0].message.content)
# Usage (replace with your actual fine-tuned model ID)
FINE_TUNED_MODEL = "ft:gpt-4o-mini:my-org:sec-classifier:abc123"
result = classify_with_fine_tuned(
"Unusual outbound traffic: 2.3GB sent to unknown IP at 3AM on weekend.",
FINE_TUNED_MODEL
)
print(result)
# {"classification": "data_exfiltration", "confidence": "high"}
When to combine both
The RAG + fine-tuning combination makes sense when you have:
- A large, updating knowledge base (RAG handles this)
- Strict output format requirements (fine-tuning handles this)
- Domain-specific reasoning patterns (fine-tuning handles this)
Architecture in that case:
User query
│
▼
[Vector retrieval] → top-k relevant chunks
│
▼
[Fine-tuned model] ← receives query + retrieved context
│
▼
Formatted, domain-appropriate response
Security operations centers are a common use case: the knowledge base (threat intelligence, runbooks, asset inventory) updates constantly and must be retrieved dynamically, but the response format — incident severity, MITRE ATT&CK tactic, recommended action — must be consistent and structured, which fine-tuning enforces.
The actual decision
Make the call concrete:
- Knowledge changes weekly or more often → RAG
- You have > 1,000 domain documents → start with RAG
- You need a specific output format enforced across 100% of responses → fine-tune
- You are building a classifier with < 10 output classes → fine-tune
- Your application needs both → RAG for context, fine-tune for behavior
- You have neither domain documents nor labeled examples → neither; fix your data problem first
The teams we work with at AYI NEDJIMI Consultants on AI-assisted security tooling typically end up with RAG for their knowledge base and a lightweight fine-tune for classification tasks — not because that is always optimal, but because those are the two problems they actually have.
Start with RAG. Add fine-tuning when you have clear evidence that format or behavioral consistency is the limiting factor. Build from scratch before reaching for a framework.
Top comments (0)