Ayi NEDJIMI

Posted on May 22

Fine-tuning vs RAG: a decision framework with examples

#llm #ai #python #machinelearning

"Should we fine-tune or use RAG?" is one of the most common architecture questions when building LLM-powered applications. Most discussions frame it as a debate. It is better framed as a decision tree: the answer depends on what problem you are actually trying to solve.

This article gives you a concrete framework with criteria, a cost and latency comparison table, and working code for both approaches.

The core distinction

Before the framework: understand what each technique actually changes.

RAG (Retrieval-Augmented Generation) changes what the model knows at inference time by injecting retrieved context into the prompt. The base model weights are unchanged.

Fine-tuning changes how the model behaves by updating weights on domain-specific examples. The model's knowledge at inference time is still its training cutoff — retrieval is not involved.

This distinction immediately rules out common misuses:

Fine-tuning cannot teach a model recent facts (weights are frozen post-training)
RAG cannot teach a model a new response format or writing style (you cannot retrieve "tone")

The decision framework

Work through these questions in order:

Q1: Does your application need information that changes over time?

If yes → use RAG. Fine-tuning a new model every time your knowledge base updates is impractical and expensive. RAG lets you update the document store without touching the model.

Examples where this applies: internal wikis, product documentation, legal/regulatory references, security advisories.

Q2: Do you have a large corpus of existing domain documents?

If yes → RAG is likely sufficient. If your documents cover the domain well, retrieval will surface the right context. Fine-tuning adds cost and complexity without a clear return.

If no → consider fine-tuning on synthetic or curated examples to inject domain knowledge directly.

Q3: Is your problem primarily about output format, style, or classification?

If yes → fine-tuning wins. Style and format are behavioral properties baked into weights, not knowledge properties that can be retrieved. If you need the model to always respond in a specific JSON structure, always use a specific tone, or classify inputs into a taxonomy, fine-tuning is the right tool.

Q4: Is latency critical?

Fine-tuning produces a smaller, faster model that does not need a retrieval round-trip. RAG adds 50–200ms for vector search plus the time to process a longer context window (retrieved chunks). If p95 latency is a hard requirement, fine-tuning has a structural advantage.

Q5: Do you need both knowledge grounding AND behavioral consistency?

If yes → use both. RAG handles knowledge, fine-tuning handles behavior. This is the setup for production systems that have both a large knowledge base and strict output requirements.

Cost and latency comparison

	RAG	Fine-tuning	RAG + Fine-tuning
Setup cost	Low–Medium	High	High
Per-query cost	Higher (longer context)	Lower	Medium
Latency overhead	+50–200ms retrieval	None	+50–200ms retrieval
Knowledge update	Instant (update index)	Requires retraining	Instant for knowledge
Format consistency	Poor without prompting	Excellent	Excellent
Factual grounding	Strong (with sources)	Weak (hallucination risk)	Strong
Best for	Knowledge-heavy Q&A	Classification, style	Enterprise assistants

Cost numbers for a typical 500-token query:

RAG: base model call (~500 tokens) + retrieval context (~300 tokens) = 800 token call → $0.008 at $0.01/1k
Fine-tuned: smaller model, shorter prompt = ~400 tokens → $0.004 at $0.01/1k (plus amortized training cost)
Training cost for a fine-tuned GPT-4o-mini on 10k examples: ~$40 one-time

Code: a simple RAG retrieval pipeline

This shows the retrieval side — how context gets injected at inference time.

import os
import json
import numpy as np
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# --- Indexing phase (run once) ---

def embed(text: str) -> np.ndarray:
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return np.array(resp.data[0].embedding, dtype=np.float32)

def build_index(documents: list[dict]) -> list[dict]:
    """
    documents: list of {"id": str, "content": str, "metadata": dict}
    Returns documents with "embedding" field added.
    """
    indexed = []
    for doc in documents:
        vec = embed(doc["content"])
        indexed.append({**doc, "embedding": vec})
    return indexed

def retrieve(query: str, index: list[dict], top_k: int = 3) -> list[dict]:
    q_vec = embed(query)
    scored = []
    for doc in index:
        doc_vec = np.array(doc["embedding"], dtype=np.float32)
        score = float(np.dot(q_vec, doc_vec) / (np.linalg.norm(q_vec) * np.linalg.norm(doc_vec)))
        scored.append((score, doc))
    scored.sort(key=lambda x: x[0], reverse=True)
    return [doc for _, doc in scored[:top_k]]

# --- Inference phase ---

def rag_answer(query: str, index: list[dict]) -> str:
    docs = retrieve(query, index, top_k=3)
    context = "\n\n".join(
        f"[Source {i+1}] {doc['content']}" for i, doc in enumerate(docs)
    )

    messages = [
        {
            "role": "system",
            "content": (
                "You are a helpful assistant. Answer the user's question using ONLY "
                "the provided sources. If the sources do not contain the answer, say so. "
                "Cite sources by their [Source N] label."
            )
        },
        {
            "role": "user",
            "content": f"Sources:\n{context}\n\nQuestion: {query}"
        }
    ]

    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0
    )
    return resp.choices[0].message.content

# Usage example
documents = [
    {
        "id": "nis2-scope",
        "content": "NIS 2 applies to medium and large entities in 18 essential and important sectors. "
                   "Small entities are generally excluded except in specific sectors like DNS.",
        "metadata": {"source": "NIS2 Directive Article 2"}
    },
    {
        "id": "nis2-penalties",
        "content": "Under NIS 2, essential entities can face fines up to €10M or 2% of annual turnover. "
                   "Important entities face fines up to €7M or 1.4% of annual turnover.",
        "metadata": {"source": "NIS2 Directive Article 34"}
    }
]

index = build_index(documents)
answer = rag_answer("What fines apply to essential entities under NIS 2?", index)
print(answer)

Code: calling a fine-tuned model

Fine-tuned models expose the same API — the only difference is the model name. Here is the full pattern, including how you prepared your training data:

# Training data format (saved to a JSONL file for upload)
TRAINING_EXAMPLES = [
    {
        "messages": [
            {"role": "system", "content": "Classify the following security event as: phishing, malware, brute_force, data_exfiltration, or unknown."},
            {"role": "user", "content": "User received email with link to fake Microsoft login page, credentials entered."},
            {"role": "assistant", "content": '{"classification": "phishing", "confidence": "high"}'}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "Classify the following security event as: phishing, malware, brute_force, data_exfiltration, or unknown."},
            {"role": "user", "content": "SSH server shows 847 failed login attempts from single IP in 10 minutes."},
            {"role": "assistant", "content": '{"classification": "brute_force", "confidence": "high"}'}
        ]
    },
    # ... add 50+ examples for a useful fine-tune
]

def save_training_data(examples: list[dict], path: str = "/tmp/training.jsonl"):
    with open(path, "w") as f:
        for ex in examples:
            f.write(json.dumps(ex) + "\n")
    return path

def upload_and_fine_tune(training_file_path: str) -> str:
    """Upload training data and start a fine-tuning job. Returns the job ID."""
    with open(training_file_path, "rb") as f:
        upload = client.files.create(file=f, purpose="fine-tune")

    job = client.fine_tuning.jobs.create(
        training_file=upload.id,
        model="gpt-4o-mini"
    )
    print(f"Fine-tuning job started: {job.id}")
    return job.id

# Once fine-tuning completes, use the model:
def classify_with_fine_tuned(event_text: str, model_id: str) -> dict:
    """
    model_id looks like: ft:gpt-4o-mini:your-org:classifier:abc123
    """
    resp = client.chat.completions.create(
        model=model_id,
        messages=[
            {
                "role": "system",
                "content": "Classify the following security event as: phishing, malware, brute_force, data_exfiltration, or unknown."
            },
            {
                "role": "user",
                "content": event_text
            }
        ],
        temperature=0,
        response_format={"type": "json_object"}
    )
    return json.loads(resp.choices[0].message.content)

# Usage (replace with your actual fine-tuned model ID)
FINE_TUNED_MODEL = "ft:gpt-4o-mini:my-org:sec-classifier:abc123"

result = classify_with_fine_tuned(
    "Unusual outbound traffic: 2.3GB sent to unknown IP at 3AM on weekend.",
    FINE_TUNED_MODEL
)
print(result)
# {"classification": "data_exfiltration", "confidence": "high"}

When to combine both

The RAG + fine-tuning combination makes sense when you have:

A large, updating knowledge base (RAG handles this)
Strict output format requirements (fine-tuning handles this)
Domain-specific reasoning patterns (fine-tuning handles this)

Architecture in that case:

User query
    │
    ▼
[Vector retrieval] → top-k relevant chunks
    │
    ▼
[Fine-tuned model] ← receives query + retrieved context
    │
    ▼
Formatted, domain-appropriate response

Security operations centers are a common use case: the knowledge base (threat intelligence, runbooks, asset inventory) updates constantly and must be retrieved dynamically, but the response format — incident severity, MITRE ATT&CK tactic, recommended action — must be consistent and structured, which fine-tuning enforces.

The actual decision

Make the call concrete:

Knowledge changes weekly or more often → RAG
You have > 1,000 domain documents → start with RAG
You need a specific output format enforced across 100% of responses → fine-tune
You are building a classifier with < 10 output classes → fine-tune
Your application needs both → RAG for context, fine-tune for behavior
You have neither domain documents nor labeled examples → neither; fix your data problem first

The teams we work with at AYI NEDJIMI Consultants on AI-assisted security tooling typically end up with RAG for their knowledge base and a lightweight fine-tune for classification tasks — not because that is always optimal, but because those are the two problems they actually have.

Start with RAG. Add fine-tuning when you have clear evidence that format or behavioral consistency is the limiting factor. Build from scratch before reaching for a framework.

DEV Community