DEV Community

ANKUSH CHOUDHARY JOHAL
ANKUSH CHOUDHARY JOHAL

Posted on • Originally published at johal.in

We Ditched RAG 2.0 for Fine-Tuned Llama 3.1: 50% Better Accuracy for Internal Docs

After 6 months of tuning RAG 2.0 pipelines for our internal engineering wiki, we hit a 72% accuracy ceiling that no amount of chunking optimization, hybrid search, or reranking could break. Switching to a fine-tuned Llama 3.1 8B model pushed that to 89%—a 50% relative improvement—while cutting per-query latency by 40% and reducing monthly vector DB costs by $12k. Here’s the unvarnished data, code, and tradeoffs.

📡 Hacker News Top Stories Right Now

  • Zed 1.0 (1531 points)
  • Copy Fail – CVE-2026-31431 (585 points)
  • Cursor Camp (632 points)
  • OpenTrafficMap (153 points)
  • HERMES.md in commit messages causes requests to route to extra usage billing (984 points)

Key Insights

  • Fine-tuned Llama 3.1 8B achieved 89% accuracy on internal doc Q&A, vs 59% for our best RAG 2.0 pipeline (50% relative improvement)
  • Llama 3.1 8B requires 16GB VRAM (single A10G GPU) vs 32GB for RAG 2.0 (vector DB + reranker + LLM)
  • Monthly infra costs dropped from $38k to $26k, a 31.5% reduction, after decommissioning Pinecone and reranker instances
  • By 2026, 70% of internal enterprise doc systems will use fine-tuned small LLMs over RAG for latency-sensitive use cases

Why RAG 2.0 Failed Us

We started our internal docs Q&A journey in Q3 2023 with a basic RAG pipeline: chunk docs into 512-token segments, embed with Cohere’s embed-english-v3.0, store in Pinecone, retrieve top 20 chunks, rerank with Cohere Rerank, then generate answers with Claude 3 Haiku. By Q1 2024, we had iterated through every optimization we could find: tried chunk sizes from 256 to 2048 tokens, overlap from 0 to 128 tokens, hybrid search (dense + sparse), query expansion, HyDE (Hypothetical Document Embeddings), metadata filtering, and even custom reranking models. None of it moved the needle past 59% accuracy on our internal benchmark of 1000 real employee questions, validated by technical writers.

The core problem with RAG 2.0 is that it’s inherently limited by chunking and retrieval. Even with perfect retrieval, you’re only giving the LLM 20 chunks of 512 tokens each—10k tokens total—while a single internal doc (like our payment service runbook) is 40k tokens long. The LLM can’t answer questions that require cross-chunk reasoning, like “How do I configure the Redis cache for the payment service, and what metrics should I alert on?”, because the relevant info is split across 5 chunks. RAG also struggles with internal terminology: our docs use “payment gateway” to refer to Stripe, but the base Cohere embed model maps “payment gateway” to generic payment processors, leading to irrelevant retrieval.

Performance Comparison: RAG 2.0 vs Fine-Tuned Llama 3.1 8B

Metric

RAG 2.0 (Best Config)

Fine-Tuned Llama 3.1 8B

Delta

Accuracy on Internal Benchmark (1000 questions)

59%

89%

+30% absolute / +50% relative

p99 Latency

2100ms

1200ms

-43%

Monthly Infra Cost

$38,000

$26,000

-31.5%

VRAM per Instance

32GB (Pinecone + Reranker + Claude)

16GB (Single A10G GPU)

-50%

Max Context Window

8192 tokens (chunked)

131,072 tokens (full doc)

+16x

Training/Setup Time

0 (pre-trained components)

48 hours (fine-tuning)

N/A

Weekly Maintenance Hours

12 (chunk updates, index tuning)

4 (model versioning)

-66%

RAG 2.0 Pipeline Code

Our final RAG 2.0 pipeline, optimized over 6 months, is shown below. It uses LangChain for orchestration, Pinecone for vector storage, Cohere for embeddings and reranking, and Claude 3 Haiku for generation. Note the extensive error handling and retry logic for production resilience.

import os
import time
from typing import List, Dict, Optional
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

import cohere
from langchain.vectorstores import Pinecone
from langchain.embeddings import CohereEmbeddings
from langchain.llms import Anthropic
from langchain.chains import RetrievalQA
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
from pinecone import Pinecone as PineconeClient, ServerlessSpec

# Initialize clients with error handling for missing env vars
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def init_cohere_client() -> cohere.Client:
    api_key = os.getenv("COHERE_API_KEY")
    if not api_key:
        raise ValueError("COHERE_API_KEY environment variable not set")
    return cohere.Client(api_key)

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def init_pinecone_client() -> PineconeClient:
    api_key = os.getenv("PINECONE_API_KEY")
    if not api_key:
        raise ValueError("PINECONE_API_KEY environment variable not set")
    return PineconeClient(api_key=api_key, environment=os.getenv("PINECONE_ENV", "us-west1-gcp"))

def build_rag_pipeline(
    index_name: str = "internal-docs-v2",
    chunk_size: int = 512,
    rerank_top_k: int = 5
) -> RetrievalQA:
    """Build the full RAG 2.0 pipeline with hybrid search, reranking, and LLM generation."""
    # Initialize embedding model
    embeddings = CohereEmbeddings(
        model="embed-english-v3-0",
        cohere_api_key=os.getenv("COHERE_API_KEY")
    )

    # Connect to Pinecone index
    pc = init_pinecone_client()
    if index_name not in pc.list_indexes().names():
        raise ValueError(f"Pinecone index {index_name} does not exist")
    vector_store = Pinecone.from_existing_index(
        index_name=index_name,
        embedding=embeddings
    )

    # Configure hybrid retriever (dense + sparse)
    retriever = vector_store.as_retriever(
        search_type="mmr",  # Max Marginal Relevance for diversity
        search_kwargs={
            "k": 20,  # Retrieve top 20 chunks initially
            "lambda_mult": 0.7,  # Balance relevance and diversity
            "filter": None  # No metadata filters by default
        }
    )

    # Add Cohere reranker to compress context
    reranker = CohereRerank(
        client=init_cohere_client(),
        model="rerank-english-v3-0",
        top_n=rerank_top_k
    )
    compression_retriever = ContextualCompressionRetriever(
        base_retriever=retriever,
        base_compressor=reranker
    )

    # Initialize LLM (Claude 3 Haiku for cost efficiency)
    llm = Anthropic(
        model="claude-3-haiku-20240307",
        anthropic_api_key=os.getenv("ANTHROPIC_API_KEY"),
        temperature=0.0,  # Deterministic outputs for benchmark
        max_tokens_to_sample=1024
    )

    # Build QA chain with custom prompt
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",  # Pass all context to LLM (simpler than map-reduce for small context)
        retriever=compression_retriever,
        return_source_documents=True,
        chain_type_kwargs={
            "prompt": """You are an internal engineering wiki assistant. Answer the question using only the provided context. If the answer is not in the context, say "I don't have enough information to answer that." Do not make up information.

Context: {context}

Question: {question}

Answer:"""
        }
    )
    return qa_chain

def run_rag_query(qa_chain: RetrievalQA, query: str) -> Dict:
    """Run a query through the RAG pipeline with error handling."""
    try:
        start_time = time.time()
        result = qa_chain({"query": query})
        latency = (time.time() - start_time) * 1000  # ms
        return {
            "answer": result["result"],
            "sources": [doc.page_content for doc in result["source_documents"]],
            "latency_ms": round(latency, 2),
            "error": None
        }
    except Exception as e:
        return {
            "answer": None,
            "sources": [],
            "latency_ms": None,
            "error": str(e)
        }

if __name__ == "__main__":
    # Example usage
    pipeline = build_rag_pipeline()
    result = run_rag_query(pipeline, "How do I configure the Redis cache for the payment service?")
    print(f"Answer: {result['answer']}")
    print(f"Latency: {result['latency_ms']}ms")
    print(f"Sources: {len(result['sources'])} chunks retrieved")
Enter fullscreen mode Exit fullscreen mode

Fine-Tuning Llama 3.1 8B

We chose Llama 3.1 8B Instruct as our base model for three reasons: (1) it has a 131k token context window, enough to fit full internal docs without chunking; (2) it’s open-weight, so we can fine-tune and deploy it without per-token API costs; (3) it outperforms similar-sized models like Mistral 7B and Gemma 2 9B on technical Q&A benchmarks. We used Unsloth for efficient fine-tuning, which reduces VRAM usage by 30% and speeds up training by 2x compared to vanilla HuggingFace Transformers.

Our fine-tuning dataset consisted of 12k (question, answer, context) triples: questions were sampled from real employee queries, answers were written and validated by technical writers, and context was the relevant section of the internal wiki. We used LoRA (Low-Rank Adaptation) with rank 64, alpha 128, targeting all linear layers in the model. Training took 48 hours on 4 A10G GPUs, with a batch size of 8 (2 per GPU, 4 gradient accumulation steps), learning rate 2e-4, and 3 epochs. We used SFTTrainer from the TRL library, and saved the model to S3 for deployment.

Fine-Tuning Code

import os
import torch
import evaluate
from datasets import load_dataset, DatasetDict
from unsloth import FastLanguageModel, is_bfloat16_supported
from unsloth.chat_templates import get_chat_template
from trl import SFTTrainer, SFTConfig
from transformers import AutoTokenizer, TrainingArguments
from huggingface_hub import login
import boto3
from botocore.exceptions import NoCredentialsError, ClientError

# Configuration
MODEL_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"
DATASET_PATH = "s3://our-internal-bucket/llama-finetune-dataset/qa_pairs.jsonl"
OUTPUT_DIR = "./llama-3.1-8b-internal-docs"
S3_BUCKET = "our-internal-bucket"
S3_KEY = "fine-tuned-models/llama-3.1-8b-internal-docs"

def login_to_huggingface():
    """Login to HuggingFace Hub to access Llama 3.1 weights."""
    token = os.getenv("HF_TOKEN")
    if not token:
        raise ValueError("HF_TOKEN environment variable not set (required for Llama 3.1 access)")
    login(token=token, add_to_git_credential=False)

def load_and_prepare_dataset(dataset_path: str) -> DatasetDict:
    """Load and prepare the fine-tuning dataset from S3."""
    try:
        # Download dataset from S3
        s3 = boto3.client("s3")
        bucket, key = dataset_path.replace("s3://", "").split("/", 1)
        s3.download_file(bucket, key, "local_dataset.jsonl")
    except (NoCredentialsError, ClientError) as e:
        raise RuntimeError(f"Failed to download dataset from S3: {e}")

    # Load JSONL into HuggingFace dataset
    dataset = load_dataset("json", data_files="local_dataset.jsonl", split="train")

    # Split into train (90%) and validation (10%)
    dataset = dataset.train_test_split(test_size=0.1, seed=42)
    return dataset

def init_model_and_tokenizer():
    """Initialize Llama 3.1 8B with Unsloth for efficient fine-tuning."""
    # Load model with 4-bit quantization to fit on 4 A10G GPUs
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=MODEL_ID,
        max_seq_length=8192,  # Match Llama 3.1's max context
        dtype=torch.bfloat16 if is_bfloat16_supported() else torch.float16,
        load_in_4bit=True,
        token=os.getenv("HF_TOKEN")
    )

    # Apply LoRA adapters (rank 64, alpha 128 for strong adaptation)
    model = FastLanguageModel.get_peft_model(
        model,
        r=64,  # LoRA rank
        lora_alpha=128,
        lora_dropout=0.05,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        bias="none",
        use_gradient_checkpointing="unsloth",  # Reduce VRAM usage
        random_state=42
    )

    # Configure chat template for Llama 3.1 Instruct
    tokenizer = get_chat_template(
        tokenizer,
        chat_template="llama-3.1"
    )
    return model, tokenizer

def format_dataset(examples, tokenizer):
    """Format dataset into Llama 3.1 chat template."""
    formatted = []
    for q, a, ctx in zip(examples["question"], examples["answer"], examples["context"]):
        # Construct prompt with context, question, and answer
        messages = [
            {"role": "system", "content": "You are an internal engineering wiki assistant. Answer the question using only the provided context. If the answer is not in the context, say 'I don't have enough information to answer that.' Do not make up information."},
            {"role": "user", "content": f"Context: {ctx}\n\nQuestion: {q}"},
            {"role": "assistant", "content": a}
        ]
        formatted.append(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False))
    return {"text": formatted}

def train_model(model, tokenizer, dataset):
    """Run fine-tuning with SFTTrainer."""
    # Format dataset
    dataset = dataset.map(lambda x: format_dataset(x, tokenizer), batched=True, remove_columns=dataset["train"].column_names)

    # Training arguments
    training_args = SFTConfig(
        output_dir=OUTPUT_DIR,
        num_train_epochs=3,
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        eval_steps=50,
        save_steps=100,
        save_total_limit=3,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        report_to="none"  # Disable wandb for internal run
    )

    # Initialize trainer
    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=dataset["train"],
        eval_dataset=dataset["test"],
        tokenizer=tokenizer,
        max_seq_length=8192,
        dataset_text_field="text",
        packing=True  # Pack multiple short examples into one sequence
    )

    # Train
    trainer.train()
    return trainer

def upload_model_to_s3(trainer):
    """Upload fine-tuned model to S3."""
    try:
        s3 = boto3.client("s3")
        trainer.save_model(OUTPUT_DIR)
        # Upload all files in output dir to S3
        for root, dirs, files in os.walk(OUTPUT_DIR):
            for file in files:
                local_path = os.path.join(root, file)
                relative_path = os.path.relpath(local_path, OUTPUT_DIR)
                s3_key = f"{S3_KEY}/{relative_path}"
                s3.upload_file(local_path, S3_BUCKET, s3_key)
        print(f"Model uploaded to s3://{S3_BUCKET}/{S3_KEY}")
    except Exception as e:
        raise RuntimeError(f"Failed to upload model to S3: {e}")

if __name__ == "__main__":
    # Login to HF
    login_to_huggingface()

    # Load dataset
    dataset = load_and_prepare_dataset(DATASET_PATH)

    # Init model and tokenizer
    model, tokenizer = init_model_and_tokenizer()

    # Train
    trainer = train_model(model, tokenizer, dataset)

    # Upload to S3
    upload_model_to_s3(trainer)

    # Save merged model for inference (optional)
    merged_model = model.merge_and_unload()
    merged_model.save_pretrained("./llama-3.1-8b-internal-docs-merged")
    tokenizer.save_pretrained("./llama-3.1-8b-internal-docs-merged")
Enter fullscreen mode Exit fullscreen mode

Inference Code with vLLM

We deployed the fine-tuned model using vLLM, a high-throughput inference engine that supports continuous batching and tensor parallelism. A single A10G GPU can handle ~50 queries per second with p99 latency under 1200ms, enough for our 200-engineer team. The inference API is built with FastAPI, with S3 integration to download the model on startup.

import os
import time
import torch
import uvicorn
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from vllm import LLM, SamplingParams
from vllm.inputs import TokensPrompt
import boto3
from botocore.exceptions import ClientError

# Configuration
S3_BUCKET = "our-internal-bucket"
S3_MODEL_KEY = "fine-tuned-models/llama-3.1-8b-internal-docs"
LOCAL_MODEL_DIR = "/tmp/llama-3.1-8b-internal-docs"
MAX_MODEL_LEN = 131072  # Llama 3.1 8B max context
TENSOR_PARALLEL_SIZE = 1  # Single GPU

app = FastAPI(title="Internal Docs Llama 3.1 Inference API")

class QueryRequest(BaseModel):
    query: str
    context: Optional[str] = None  # Optional pre-retrieved context
    max_tokens: int = 1024
    temperature: float = 0.0

class QueryResponse(BaseModel):
    answer: str
    latency_ms: float
    error: Optional[str] = None

def download_model_from_s3():
    """Download fine-tuned model from S3 to local dir."""
    if os.path.exists(LOCAL_MODEL_DIR):
        print(f"Model already exists at {LOCAL_MODEL_DIR}, skipping download")
        return
    try:
        s3 = boto3.client("s3")
        paginator = s3.get_paginator("list_objects_v2")
        for page in paginator.paginate(Bucket=S3_BUCKET, Prefix=S3_MODEL_KEY):
            for obj in page.get("Contents", []):
                key = obj["Key"]
                local_path = os.path.join(LOCAL_MODEL_DIR, os.path.relpath(key, S3_MODEL_KEY))
                os.makedirs(os.path.dirname(local_path), exist_ok=True)
                s3.download_file(S3_BUCKET, key, local_path)
        print(f"Model downloaded to {LOCAL_MODEL_DIR}")
    except ClientError as e:
        raise RuntimeError(f"Failed to download model from S3: {e}")

def init_vllm_engine():
    """Initialize vLLM engine for high-throughput inference."""
    download_model_from_s3()
    return LLM(
        model=LOCAL_MODEL_DIR,
        tensor_parallel_size=TENSOR_PARALLEL_SIZE,
        max_model_len=MAX_MODEL_LEN,
        gpu_memory_utilization=0.9,  # Use 90% of GPU VRAM
        dtype="bfloat16",
        trust_remote_code=True
    )

def format_prompt(query: str, context: Optional[str]) -> str:
    """Format query into Llama 3.1 Instruct chat template."""
    if context:
        user_content = f"Context: {context}\n\nQuestion: {query}"
    else:
        user_content = f"Question: {query}"
    messages = [
        {"role": "system", "content": "You are an internal engineering wiki assistant. Answer the question using only the provided context if given. If the answer is not in the context, say 'I don't have enough information to answer that.' Do not make up information."},
        {"role": "user", "content": user_content}
    ]
    # Use Llama 3.1 chat template
    prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

""".format(
        system=messages[0]["content"],
        user=messages[1]["content"]
    )
    return prompt

@app.on_event("startup")
def startup_event():
    """Initialize vLLM engine on startup."""
    app.state.llm = init_vllm_engine()
    app.state.tokenizer = app.state.llm.get_tokenizer()
    print("Inference engine initialized")

@app.post("/query", response_model=QueryResponse)
async def query_model(request: QueryRequest):
    """Handle query requests."""
    start_time = time.time()
    try:
        # Format prompt
        prompt = format_prompt(request.query, request.context)
        sampling_params = SamplingParams(
            temperature=request.temperature,
            max_tokens=request.max_tokens,
            top_p=0.9,
            stop=["<|eot_id|>"]  # Stop at end of turn token
        )

        # Generate response
        outputs = app.state.llm.generate([prompt], sampling_params)
        generated_text = outputs[0].outputs[0].text.strip()

        latency = (time.time() - start_time) * 1000  # ms
        return QueryResponse(
            answer=generated_text,
            latency_ms=round(latency, 2),
            error=None
        )
    except Exception as e:
        latency = (time.time() - start_time) * 1000
        raise HTTPException(
            status_code=500,
            detail=f"Inference failed: {str(e)}"
        )

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000, log_level="info")
Enter fullscreen mode Exit fullscreen mode

Case Study: Payment Engineering Team

  • Team size: 4 backend engineers, 1 technical writer
  • Stack & Versions: Llama 3.1 8B (fine-tuned), vLLM 0.5.4, FastAPI 0.104.1, AWS A10G GPU instances, S3 for model storage, Prometheus for metrics
  • Problem: The team’s RAG 2.0 pipeline for payment service docs had 52% accuracy, p99 latency of 2400ms, and required 15 hours/week of maintenance to update Pinecone indexes with new doc versions. Engineers were resorting to Slack questions instead of using the tool, leading to 12 repeated questions per week.
  • Solution & Implementation: The team fine-tuned Llama 3.1 8B on 3k payment-specific Q&A pairs, deployed the model via vLLM on a single A10G GPU, and integrated the inference API into their internal Slack bot and docs portal. They decommissioned their Pinecone index, Cohere reranker, and Claude 3 Haiku generator.
  • Outcome: Accuracy rose to 87%, p99 latency dropped to 1100ms, maintenance hours fell to 2 per week, and repeated Slack questions dropped by 90% (1.2 per week). Monthly infra costs for the docs tool fell from $14k to $8k, saving $72k/year.

Developer Tips

1. Start with a Smaller Model Than You Think You Need

For internal doc Q&A, you do not need a 70B or 405B model. We initially tested Llama 3.1 70B and found it only improved accuracy by 2% over the 8B model, while requiring 4x the VRAM (64GB vs 16GB) and doubling inference latency to 2400ms. The Llama 3.1 8B Instruct model is pre-trained on a massive corpus including technical documentation, code snippets, and API references, so it already understands engineering jargon, code snippets, and API references. Fine-tuning it on your internal data adapts it to your specific terminology (e.g., internal service names, custom config keys) without the overhead of a larger model.

We used Unsloth to optimize Llama 3.1 8B for fine-tuning, which reduced VRAM usage by 30% compared to vanilla HuggingFace Transformers. A single A10G GPU (16GB VRAM) is sufficient to run inference for a team of 200 engineers, with p99 latency under 1200ms. If you have more than 500 users, you can scale horizontally with vLLM’s tensor parallelism across 2 GPUs, but for most mid-sized teams, a single 8B model is more than enough.

Short snippet to check if your GPU can run Llama 3.1 8B:

import torch
from unsloth import FastLanguageModel

# Check available VRAM
print(f"Available VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.2f} GB")

# Test load 8B model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
    load_in_4bit=True,
    max_seq_length=8192
)
print("Llama 3.1 8B loaded successfully")
Enter fullscreen mode Exit fullscreen mode

2. Use LoRA Instead of Full Fine-Tuning

Full fine-tuning of Llama 3.1 8B requires updating all 8 billion parameters, which needs ~160GB of VRAM (10x A10G GPUs) and takes weeks to train. We used Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning technique that only trains a small set of adapter layers, reducing trainable parameters to ~80 million (1% of the total model size). This cut our training time from an estimated 2 weeks to 48 hours on 4 A10G GPUs, and reduced VRAM requirements to 16GB per GPU.

We used rank 64 for our LoRA adapters, which is higher than the typical rank 8-16 used for general tasks, because internal docs require the model to learn highly specific terminology and relationships. A higher rank gives the adapter more capacity to adapt to your data without overfitting. We also targeted all linear layers in the model (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj) to maximize adaptation. After training, we merged the LoRA adapters into the base model for inference, which eliminates any overhead from adapter loading.

Short snippet of LoRA configuration with PEFT:

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=64,  # LoRA rank
    lora_alpha=128,  # Scaling factor
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to base model
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()  # Output: trainable params: 80M || all params: 8B || trainable%: 1.0%
Enter fullscreen mode Exit fullscreen mode

3. Validate Your Fine-Tuning Dataset Relentlessly

Your fine-tuning dataset is the single biggest determinant of model performance. We spent 3 weeks cleaning our initial 15k example dataset, removing duplicates, fixing incorrect answers, and standardizing formatting. We ended up with 12k high-quality examples, each consisting of a question from actual employee queries, an answer validated by a technical writer, and the relevant context from our internal wiki. We used Great Expectations to validate dataset schema, check for missing values, and ensure answer length was within 1024 tokens.

We also split the dataset into 90% train and 10% validation, and used the validation set to catch overfitting during training. If the model’s validation loss stopped improving after 2 epochs, we early-stopped training. We found that including negative examples (questions where the answer is not in the context) reduced hallucinations by 40%, as the model learned to say "I don't have enough information" instead of making up answers. Avoid using synthetic data generated by other LLMs for fine-tuning, as this introduces bias and reduces accuracy on real employee queries.

Short snippet of dataset validation with Great Expectations:

import great_expectations as gx
from great_expectations.dataset import Dataset

# Load dataset
dataset = Dataset.from_pandas(pd.read_json("qa_pairs.jsonl", lines=True))

# Validate schema
dataset.expect_column_to_exist("question")
dataset.expect_column_to_exist("answer")
dataset.expect_column_to_exist("context")

# Validate answer length
dataset.expect_column_value_lengths_to_be_between("answer", min_value=10, max_value=1024)

# Check for missing values
dataset.expect_column_values_to_not_be_null("question")
dataset.expect_column_values_to_not_be_null("answer")

print(dataset.validate())
Enter fullscreen mode Exit fullscreen mode

Join the Discussion

We’ve shared our benchmark data, code, and tradeoffs for switching from RAG 2.0 to fine-tuned Llama 3.1. We’d love to hear from other teams who have made similar switches, or are considering it. What’s your experience with RAG latency? Have you tried fine-tuning small LLMs for internal use cases?

Discussion Questions

  • By 2026, do you think fine-tuned small LLMs will replace RAG for most internal enterprise doc use cases?
  • What tradeoffs would you make between higher accuracy and the upfront cost of fine-tuning (48 hours of GPU time, dataset curation)?
  • Have you tried using Mistral 7B or Gemma 2 9B for internal doc Q&A, and how did they compare to Llama 3.1 8B?

Frequently Asked Questions

How much does it cost to fine-tune Llama 3.1 8B?

We used 4 A10G GPUs on AWS for 48 hours, which cost ~$1,200 (A10G spot instances are ~$0.40/hour per GPU). Dataset curation took 3 weeks of part-time work from 1 backend engineer and 1 technical writer, which is ~$6k in labor costs. Total upfront cost is ~$7,200, which we recouped in 2 months from reduced infra costs ($12k/month savings). For teams with smaller budgets, you can use a single GPU and train for 72 hours, cutting GPU costs to ~$100.

Do I need to update the fine-tuned model every time my internal docs change?

Yes, but the process is far simpler than updating a RAG pipeline. We retrain the model every 2 weeks with new Q&A pairs from updated docs, which takes 24 hours on 4 GPUs. We use HuggingFace’s model versioning to track changes, and can roll back to a previous version in minutes if a new model has regressions. Compare this to RAG, where we had to update Pinecone indexes weekly, rechunk docs, and re-test search relevance, which took 12 hours/week.

Can I use the fine-tuned model for other internal use cases beyond docs?

Absolutely. We’ve repurposed our fine-tuned Llama 3.1 8B for internal HR Q&A, IT support ticket triage, and code review assistance by adding a small set of domain-specific Q&A pairs to our training dataset. Because LoRA adapters are modular, you can train separate adapters for different use cases and swap them at inference time, rather than training a separate model for each use case. This reduces total training costs by 60% for multi-use case deployments.

Conclusion & Call to Action

If you’re running RAG 2.0 for internal docs and hitting an accuracy ceiling, stop tweaking chunk sizes and rerankers—fine-tune a small LLM instead. Our data shows that Llama 3.1 8B outperforms even the most optimized RAG 2.0 pipelines by 50% relative accuracy, with lower latency, lower costs, and less maintenance. The upfront investment in dataset curation and fine-tuning is negligible compared to the long-term savings and improved developer productivity.

For teams with <200 users, start with Llama 3.1 8B, Unsloth for fine-tuning, and vLLM for inference. You can find our full fine-tuning and inference code at https://github.com/our-org/llama-internal-docs. If you have more than 500 users, scale horizontally with vLLM’s tensor parallelism, or upgrade to Llama 3.1 70B if you need higher accuracy for complex queries.

50% Relative accuracy improvement over RAG 2.0

Top comments (0)